Hadoop Timelines

ideal hadoop job At one of the last sessions during the Hadoop Summit 2009, Arun Murthy (Yahoo) was going over the changes that were necessary in Hadoop to sort a terabyte of data in less than 60 seconds. Besides all of the good wisdom in the work, what I liked the most was his use of charts to understand where could Hadoop use some optimization. He described one of the charts (see image on the right) as the “ideal hadoop job”. I don’t remember everything, but the fact is that you see smooth lines/waves of both mappers and reducers, quick startup time, little wasted jobs and so on. This left me thinking: how would my jobs look like? Hence, the reason for Hadoop Timelines.

Hadoop Timelines is a Web service built using App Engine and a Python script using Dumbo that will take care of everything to replicate Arun’s Task Timelines for your own Hadoop jobs. My goal with this project is to raise the awareness of Hadoop developers in understanding job execution and performance, maybe even crazier, that we collaborate and analyze together individual job performance through comments on specific graphs.

If you’re already comfortable with Hadoop, using Timelines should be really easy. The first thing you’ll need to do is follow Klaas’ tip for collecting job logs into HDFS using a simple cron job. If you don’t already have Dumbo and want to keep your dev environment clean, you can follow another excellent post from Klaas on using virtualenv with dumbo. Once you are up and running with Dumbo, download my dumbo/timelines.py job script that will process the joblogs.txt and submit them for public viewing to Hadoop Timelines.

dumbo start timelines.py -input joblogs.txt -output results

WARNING: Please be aware that very basic information on your job tasks will be uploaded for public viewing. Anybody will be able to see the number of tasks, job duration, start and end time but nothing else. Please take a look at an example job if you want to be sure you’ll be comfortable uploading the same information for your jobs. The entire source for the project is available as well.

Now that I wrapped this little side project up, I’m going to start looking into my very own scary looking job graphs and possibly will be blogging whatever lessons I extract from them in the near future. Many thanks to Arun and Owen for uploading their code and data to compute the TeraSort graphs, Klaas for his amazing Dumbo and everyone else who makes writing these type of projects so much fun.

  • co.mments
  • connotea
  • del.icio.us
  • digg
  • Ma.gnolia
  • Reddit
  • scuttle
  • Technorati
  • YahooMyWeb

About this entry