Prof. Douglas Thain: genome assembly

Monday, January 11, 2010

Green Cloud Online

The Green Cloud is now online!

The Green Cloud is the invention of Dr. Paul Brenner at the ND Center for Research Computing. It is a containerized data center located at the South Bend city greenhouse, stocked with used servers kindly donated by Ebay, Inc. The first batch of machines was installed in December, and will eventually reach about 400 cores once everything is turned on.

What makes the data center unique is that is has no air conditioning. Instead, the data center takes in ambient air, and then exhausts it into the greenhouse. This benefits Notre Dame, since we no longer pay the cost of cooling, but it also benefits the greenhouse, which has significant heating costs during the winter months. (We used to call this idea grid heating.)

Of course, this means the capacity of the system may change with the weather. During the winter, the system can run at full blast and deliver as much heat as possible to the greenhouse. During the spring and fall, the heat may not be needed, and can be vented outdoors. During the hottest part of the summer, we may need to shut some machines down to get the temperature under control. However, recent studies by big data center operators suggest that machine temperature could be safely increased to 80 or 90 degrees, so there may be a fair amount of headroom available. We will see.

For a normal data center that runs web servers and databases, shutting down machines is not really an option. However, the Green Cloud provides fungible computing power for large computations in science and engineering at Notre Dame. If structured correctly, these workloads can adapt to 10 or 100 or 1000 cores. So, turning machines on and off will affect performance, but not correctness.

A good example of a flexible workload is genome assembly. Two of our students, Christopher Moretti and Michael Olson presented initial results on a Scalable Genome Assembler at the MTAGS Workshop held at Supercomputing 2009. Their assembler uses our Work Queue framework to manage a variable workforce, pushing out sequence fragments to whatever machines are available. The system has scaled up to about 1000 cores, spread across the Notre Dame campus, the Green Cloud, Purdue University, and the University of Wisconsin.

We are currently working on a journal paper and an open source release of the assembler, so stay tuned for details.

Tuesday, April 14, 2009

Distributed Genome Assembly on 1000 Computers

Lately, my research group has been collaborating with Prof. Scott Emrich on several problems in bioinformatics. Our students Chris Moretti and Mike Olson have been building a system for carrying out whole-genome assembly problems on campus grids. They recently got it scaled up to run on nearly 1000 nodes spread across Notre Dame, Purdue, and Wisconsin, making the problem complete in a few hours instead of a few weeks. We are excited to move the system into production use to start working on some real assembly problems.

Here is what the genome assembly problem looks like from a computer science perspective. As you should remember from biology class, your entire genetic makeup is encoded into a long string of DNA, which is a chemical sequence of base pairs that we represent by the letters A, T, C, and G. A sequencing device takes a biological sample, and through some chemical manipulations can extract the DNA and produce your entire string of DNA, which is some 2 billion characters (bases) long:

AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA

Except that it isn't that simple. The chemical process of extracting the sequence runs out of energy after about 100-1000 characters. depending on the exact process in use. Instead what you end up with is a large set of "reads" which are random substrings from the entire genome. For example, here are three random substrings of the previous string:

1. ATCCTAGCTAGCTACGA

2. AGTCGATCGATCG

3. CGATCGATAATCGATCCTAG

Now, you have to examine all of the reads, and figure out which ones overlap. In principle, you want to compare all of them to each other with the All-Pairs framework, but that would be computationally infeasible. Instead, there are a number of heuristics that can be used to generate candidate pairs, which then can be matched in detail and then assembled. For example, the three reads from before overlap like this:

AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA

.....................................

AGTCGATCGATCG........................

.......CGATCGATAATCGATCCTAG..........

....................ATCCTAGCTAGCTACGA

There are many wide open questions of exactly what heuristics to use in selecting candidates, performing alignments, and completing the assembly. Our job is to give researchers a modular framework that allows them to try many different kinds of algorithms, using hundreds or thousands of CPUs to complete the job quickly.

We started with the work queue framework from the Wavefront abstraction. An assembly master process reads the candidates and sequences from disk, builds small units of work, and sends them out to worker processes running on various grids. No particular alignment code is baked into the system. Instead, the user provides an alignment program written in whatever language they find convenient. The system moves the executable and the necessary files out to the execution node, and puts it to work.

Here is an example of the system in action on a multi-institutional grid. The X axis shows time, and the various lines show number of tasks running (red), percent complete (blue), and cumulative speedup (green). We started by running a worker on one workstation, then another, then on a 32-node cluster, then on the Notre Dame campus grid, then on Condor pools at Purdue and Wisconsin, growing up to nearly 700 CPUs total. About halfway through, we forced a failure by unplugging the workstation running the master. Upon restarting, the master loaded the completed results, and picked up right where it left off.

I'm looking forward to putting our system into a production mode and attacking some really big problems.

Prof. Douglas Thain

Monday, January 11, 2010

Green Cloud Online

Tuesday, April 14, 2009

Distributed Genome Assembly on 1000 Computers