Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

beyond mapreduce scientific data
processing in real-time
Chris Hillman
October 13th 2015
chillman@dundee.ac.uk

Proteomics
GENOME
21,000
PROTEOME
1,000,000+

Mass Spectrometry
Each Experiment produces
7Gb XML file containing 40,000 scans
600,000,000 data points
In approx 100 minutes
Data processing can
take over 24 hours
• Pick 2D Peaks
• De-isotope
• Pick 3D Peaks
• Match weights to known
peptides

Mass Spectrometry
New Lab has
12 Machines
That’s a lot
of data and
a lot of data
processing

Parallel Processing
Amdahl’s Law
Serial portion is fixed
Gustafson’s Law
Size of problem is not fixed
Gunther’s Law
linear scalability issues

Parallel Algorithm
2D peak picking
fits well into a
Map task
– Read into
memory
– Decode base64
float array
– Peak pick,
isotopic
envelope
detection

Parallel Algorithm
3D peak picking fits well into a Reduce task
– Receive partitions of 2D peaks
– Detect 3D peaks
– Isotopic envelopes
– Output peak
Mass
Intensity

Issues
XML is not a good
format for parallel
processing

Issues
38
38.5
39
39.5
40
40.5
41
372 372.2 372.4 372.6 372.8 373 373.2 373.4 373.6 373.8 374

Issues
Data Shuffle and skew on the cluster
0
20000
40000
60000
80000
100000
120000
50
85
120
155
191
226
261
296
331
366
401
436
471
506
541
576
611
646
681
716
751
786
821
856
891
926
961
996
1031
1066
1101
1136
1172
1208
1243
1280
1316
1351
1386
1422
1457
1492
1527
1562
1597
Series2

MapReduce
Transforming the XML and writing the modified data to
• HDFS
• Hbase
• Cassandra
Executing the MapReduce code reading from the above
Batch process shows potential of speeding up the current
process by scaling the size of the cluster running it.

Flink
Experiences so far
Very easy to install
Very easy to understand
Good documentation
Very easy to adapt current code
I like it!

public class PeakPick extends Configured implements Tool {
Job job=new Job(getConf(), "peakpick");
job.setJarByClass(PeakPick.class);
job.setNumReduceTasks(104);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MapHDFS.class);
job.setReducerClass(ReduceHDFS.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setPartitionerClass(MZPartitioner.class);
FileInputFormat.setInputPaths(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
} public static void main(String[] args) throws Exception{
int res = ToolRunner.run(new Configuration(), new PeakPick(), args);
System.exit(res); }
MR Job

MR Read
public class MapHDFS extends Mapper<LongWritable, Text, IntWritable, Text> {
public void map(LongWritable key, Text value, Context context {
String inputLine = value.toString();
tempStr = inputLine.split("t"); scNumber = tempStr[1];
……
intensityString = tempStr[8]
}
public class MapCassandra extends Mapper<ByteBuffer, SortedMap<ByteBuffer, Cell>,
Text, IntWritable> {
public void map(ByteBuffer key, SortedMap<ByteBuffer, Cell> columns, Context conte
{
scNumber = String.valueOf(key.getInt());
for (Cell cell : columns.values())
{String name = ByteBufferUtil.string(cell.name().toByteBuffer());
if (name.contains("scan")) scNumber String.valueOf(ByteBufferUtil.toInt(cell.value()));
if (name.contains("mslvl")) scLevel = String.valueOf(ByteBufferUtil.toInt(cell.value()));
if (name.contains("rettime")) RT = String.valueOf(ByteBufferUtil.toDouble(cell.value()));
}

MR Write
public class MapHDFS extends Mapper<LongWritable, Text, IntWritable, Text> {
public void map(LongWritable key, Text value, Context context {
…………
for (int i=0; i<outputPoints.size(); i++){
mzStringOut = scNumber + "t" + scLevel + "t" + RT + "t" +
Integer.toString(outputPoints.get(i).getCurveID()) + "t" +
Double.toString(outputPoints.get(i).getWpm())
context.write(new IntWritable(outputPoints.get(i).getKey()), peakOut);
}
public class ReduceHDFS extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values, Context context){
…………
for (int k = 0; k<MonoISO.size();k++){
outText = MonoISO.get(k).getCharge() + "t" +
MonoISO.get(k).getWpm() + "t" +
MonoISO.get(k).getSumI() + "t" +
MonoISO.get(k).getWpRT();
context.write(new IntWritable(0), new Text(outText));
}

Flink Job
public class PeakPickFlink_MR {
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
Job job = Job.getInstance();
//Setup input format
HadoopInputFormat<LongWritable, Text> hadoopIF =
new HadoopInputFormat<LongWritable, Text>(
new TextInputFormat(), LongWritable.class, Text.class, job );
TextInputFormat.addInputPath(job, new Path(args[0]));
//Read HDFS Data
DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopIF);

Flink Job
// use Hadoop Mapper as MapFunction
DataSet<Tuple2<IntWritable, Text>> result = text
.flatMap(new HadoopMapFunction<LongWritable, Text, IntWritable, Text>(
new MapHDFS() )) .groupBy(0)
// use Hadoop Reducer
.reduceGroup(new HadoopReduceFunction<IntWritable, Text, IntWritable,
Text>(new ReduceHDFS() ));

Flink Job
// Set up the Hadoop TextOutputFormat.
HadoopOutputFormat<IntWritable, Text> hadoopOF =
new HadoopOutputFormat<IntWritable, Text>(
new TextOutputFormat<IntWritable, Text>(), job
);
//Write results back to HDFS
hadoopOF.getConfiguration().set("mapreduce.output.textoutputformat.separator"
, " ");
TextOutputFormat.setOutputPath(job, new Path(args[1]));
// Emit data using the Hadoop TextOutputFormat.
result.output(hadoopOF).setParallelism(1);
// Execute Code
env.execute("Hadoop PeakPick");
}
}

Interim Results
Mapper only
Hadoop 12m25s
Flink 4m50s
Mapper and Reducer
Hadoop 28m32s
Flink 10m20s
Existing code 35m22s

Near real-time?
Still not fast enough to be called near real-time
Processing x scans per second, if the cluster was big
enough then maybe…
But… the mass spectrometer takes 100 minutes to
complete its processing for one experiment so in
fact we have more than enough time to process the
data if we are streaming the results and processing
the data as it is produced…

Streaming the data
Simulate streaming data using an existing data file and kafka
Ingest the data using Flink Streaming API and process the scans using the existing
mapper code
Existing Data File Kafka Flink Streaming API

Streaming the results
A peptide elutes over a period of time, this means the data from many scans needs
to be compared at the same time.
A safe window to measure the quantity of a peptide is 10 seconds
10 seconds

Interim Results
Overlapping 10 second windows captures 3D peaks from the 2D scans

Interim Results
Processing the entire scan in the 10 second window means that we don’t need the
overlapping window and the de-duplication step

All this means that the data will be fully pre-processed just
over 10 seconds after the mass spectrometer completes
the experiment.
Near real-time?
Stream Processing

Complete stable working system
Contrast with Spark and Storm
Hookup previous research on database lookup to
create a complete system
Pay for some EC2 system time to complete testing
Write a thesis…
To Do list

Questions?
chillman@dundee.ac.uk
@chillax7
http://www.thequeensarmskensington.co.uk

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time (20)

More from Flink Forward (20)

Recently uploaded (20)

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

Editor's Notes