Prof. Douglas Thain: hadoop

Showing posts with label hadoop. Show all posts

Thursday, October 1, 2009

Partly Cloudy with a Chance of Condor

We have been thinking about cloud computing quite a bit over the last month. As I noted earlier, cloud computing is hardly a new idea, but it does add a few new twists on some old concepts in distributed systems. So, we are spending some time to understand how we can take our existing big applications and make them work with cloud systems and software. It should come as no surprise that there are a number of ways to use Condor to harness clouds for big applications.

Two weeks ago, I gave a talk titled Science in the Clouds at an NSF workshop on Cloud Computing and the Geosciences. One of the points that I made was that although clouds make it easy to allocate new machines that have exactly the environment you want, they don't solve the problem of work management. That is, if you have one million tasks to do, how do you reliably distribute them between your workstation, your campus computer center, and your cloud workforce? For this, you need some kind of job execution system, which is largely what grid computing has focused on:

As it stands, Condor is pretty good at managing work across multiple different kinds of systems. In fact, today you can go to a commercial service like Cycle Computing, who can build an on-demand Condor pool by allocating machines from Amazon:

Just today, we hosted Dhruba Borthakur at Notre Dame. Dhruba is the project lead for the open source Apache Hadoop system. We are cooking up some neat ways for Condor and Hadoop to play together. As a first step, one of my students Peter Bui has cooked up a module for Parrot that talks to HDFS, the Hadoop file system. This allows any Unix program -- not just Java -- talk to HDFS, without requiring the kernel configuration and other headaches of using FUSE. Then, you can submit your jobs into a Condor pool and allow them to access data in HDFS as if it were a local file system. The next step is to co-locate the Condor jobs with the Hadoop data that they want to access.

Finally, if you are interested in cloud computing, you should attend CCA09 - Cloud Computing and Applications - to be held in Chicago on October 20th. This will be a focused, one day meeting with speakers from industry, academia who are both building and using cloud computers.

Tuesday, December 9, 2008

Visualizing Clusters in Real Time

The end of the semester is nearing, so activity in our distributed system really shoots up as undergraduates finish their semester projects and graduate students hurry to generate those last few research results. You can see this activity reflected in a number of visual displays that we have created to track activity in the system.

For example, the following is a snapshot of an applet that displays the current state of all machines in our Chirp storage cluster. Each machine is represented by a box, where colors indicate resources on each machine (memory, disk, cpu), and arrows indicate active network transfers. You can click on the following snapshot, or view the current live display if you like.

With a quick view, you can see four different things going on. Two students are working on their semester project in distributed systems, which is to create a dynamic cloud of web servers. This is reflected in the spread of network transfers going from two submit machines near the top, transmitting web server data to the hosts where they happen to run. One student is running a large All-Pairs job: this is reflected in the small vertical arrows about one third of the way down, indicating heavy disk traffic local to each machine. Closer to the bottom, someone is keeping the CPUs busy on the cluster named sc0-xx: this cluster runs Hadoop, so it is probably a Map-Reduce job.

Of course, this kind of display is only good for the high level immediate picture. It only tells you what is going on in the last minute. For a more historical view, we track and publish data from our Condor distributed batch system. For example, this shows the CPU utilization of our system over the last week. The red part shows the number of CPUs in use by the person at the keyboard, the green part shows the number of CPUs idle, and the blue part shows the idle CPUs harnessed by Condor. As you can tell, things picked up in the middle of the week:

This display shows the total CPU time consumed by different users over the last year. A few students have really racked up a lot of computation!

All of these displays rotate automatically on our "scoreboard", which is a monitor in a public hallway of the Engineering building at Notre Dame. It's not unusual for me to step out of my office only to find a few students checking to see who scored the highest over the last week! We ought to give a yearly award for the students who makes the best use of the system.

On a more philosophical note, the displays really help to make our work more concrete to outsiders. Because we work with intangible software instead of test tubes or fissile material, we don't have a "lab" to show visitors or prospective students. A live picture draws people in: random people from other departments stop in the hallway to look at the scoreboard and ask what is going on. A little effort put into "advertising" goes a long way.

Friday, October 31, 2008

An Abstraction for Ensemble Classifiers

In the last post, I presented the idea of abstractions for distributed computing, and explained the All-Pairs abstraction, which represents a very large Cartesian product. Of course, a single abstraction is meant to address a very focused kind of workload. If you have a different category of problem, then you need another abstraction.

We discovered another abstraction while working with Nitesh Chawla's data mining group, also at Notre Dame. A common construction in data mining is the ensemble classifer. A single classifier examines the properties of a large number of objects and divides them up into groups that are roughly similar. You may be familiar with algorithms such as K-Nearest-Neighbors or Decision Trees. Improvements upon these algorithms continue to be an active area of research.

For any given classifier, you can often improve the runtime or the accuracy of the classification by breaking the data into pieces, running the classifier on each piece, and then collecting all of the results, using a majority vote to determine the final classification. For very large datasets, you may even need to use multiple processors and disks to complete the job in a reasonable amount of time.

To address this problem, our students Chris Moretti and Karsten Steinhauser created the Classify abstraction:

Classify( T, R, P, N, F ):
T - The testing set of data to classify.
R - The training set used to train each classifier.
P - The method of partitioning the data.
N - The number of classifiers to employ.
F - The classifier function to use.

Here is a schematic of the Classify abstraction:

This abstraction was also implemented on our local Condor pool and Chirp storage cluster, using one CPU and disk for each classifier function. With this implementation, Chris and Karsten were able to evaluate a number of classifier functions on multi-gigabyte datasets, using up to 128 CPUs simultaneously. In a week, they were able to accomplish what might have taken years to organize and execute by hand. You can read more about the technical details in our paper which will be presented at the International Conference on Data Mining.

The observant reader may notice that the Classify abstraction looks a lot like the Map-Reduce abstraction implemented in the Hadoop. In the next post, I'll discuss this similarity and explain the important difference between the two abstractions.