SlideShare a Scribd company logo
beyond mapreduce scientific data
processing in real-time
Chris Hillman
October 13th 2015
chillman@dundee.ac.uk
Proteomics
GENOME
21,000
PROTEOME
1,000,000+
Mass Spectrometry
Each Experiment produces
7Gb XML file containing 40,000 scans
600,000,000 data points
In approx 100 minutes
Data processing can
take over 24 hours
• Pick 2D Peaks
• De-isotope
• Pick 3D Peaks
• Match weights to known
peptides
Mass Spectrometry
New Lab has
12 Machines
That’s a lot
of data and
a lot of data
processing
Parallel Computing
Parallel Processing
Amdahl’s Law
Serial portion is fixed
Gustafson’s Law
Size of problem is not fixed
Gunther’s Law
linear scalability issues
Working Environment
Parallel Algorithm
2D peak picking
fits well into a
Map task
– Read into
memory
– Decode base64
float array
– Peak pick,
isotopic
envelope
detection
Parallel Algorithm
3D peak picking fits well into a Reduce task
– Receive partitions of 2D peaks
– Detect 3D peaks
– Isotopic envelopes
– Output peak
Mass
Intensity
Issues
XML is not a good
format for parallel
processing
Issues
38
38.5
39
39.5
40
40.5
41
372 372.2 372.4 372.6 372.8 373 373.2 373.4 373.6 373.8 374
Issues
Data Shuffle and skew on the cluster
0
20000
40000
60000
80000
100000
120000
50
85
120
155
191
226
261
296
331
366
401
436
471
506
541
576
611
646
681
716
751
786
821
856
891
926
961
996
1031
1066
1101
1136
1172
1208
1243
1280
1316
1351
1386
1422
1457
1492
1527
1562
1597
Series2
Results
MapReduce
Map
Reduce
Shuffle
MapReduce
Transforming the XML and writing the modified data to
• HDFS
• Hbase
• Cassandra
Executing the MapReduce code reading from the above
Batch process shows potential of speeding up the current
process by scaling the size of the cluster running it.
Flink
Experiences so far
Very easy to install
Very easy to understand
Good documentation
Very easy to adapt current code
I like it!
public class PeakPick extends Configured implements Tool {
Job job=new Job(getConf(), "peakpick");
job.setJarByClass(PeakPick.class);
job.setNumReduceTasks(104);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MapHDFS.class);
job.setReducerClass(ReduceHDFS.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setPartitionerClass(MZPartitioner.class);
FileInputFormat.setInputPaths(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
} public static void main(String[] args) throws Exception{
int res = ToolRunner.run(new Configuration(), new PeakPick(), args);
System.exit(res); }
MR Job
MR Read
public class MapHDFS extends Mapper<LongWritable, Text, IntWritable, Text> {
public void map(LongWritable key, Text value, Context context {
String inputLine = value.toString();
tempStr = inputLine.split("t"); scNumber = tempStr[1];
……
intensityString = tempStr[8]
}
public class MapCassandra extends Mapper<ByteBuffer, SortedMap<ByteBuffer, Cell>,
Text, IntWritable> {
public void map(ByteBuffer key, SortedMap<ByteBuffer, Cell> columns, Context conte
{
scNumber = String.valueOf(key.getInt());
for (Cell cell : columns.values())
{String name = ByteBufferUtil.string(cell.name().toByteBuffer());
if (name.contains("scan")) scNumber String.valueOf(ByteBufferUtil.toInt(cell.value()));
if (name.contains("mslvl")) scLevel = String.valueOf(ByteBufferUtil.toInt(cell.value()));
if (name.contains("rettime")) RT = String.valueOf(ByteBufferUtil.toDouble(cell.value()));
}
MR Write
public class MapHDFS extends Mapper<LongWritable, Text, IntWritable, Text> {
public void map(LongWritable key, Text value, Context context {
…………
for (int i=0; i<outputPoints.size(); i++){
mzStringOut = scNumber + "t" + scLevel + "t" + RT + "t" +
Integer.toString(outputPoints.get(i).getCurveID()) + "t" +
Double.toString(outputPoints.get(i).getWpm())
context.write(new IntWritable(outputPoints.get(i).getKey()), peakOut);
}
public class ReduceHDFS extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values, Context context){
…………
for (int k = 0; k<MonoISO.size();k++){
outText = MonoISO.get(k).getCharge() + "t" +
MonoISO.get(k).getWpm() + "t" +
MonoISO.get(k).getSumI() + "t" +
MonoISO.get(k).getWpRT();
context.write(new IntWritable(0), new Text(outText));
}
Flink Job
public class PeakPickFlink_MR {
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
Job job = Job.getInstance();
//Setup input format
HadoopInputFormat<LongWritable, Text> hadoopIF =
new HadoopInputFormat<LongWritable, Text>(
new TextInputFormat(), LongWritable.class, Text.class, job );
TextInputFormat.addInputPath(job, new Path(args[0]));
//Read HDFS Data
DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopIF);
Flink Job
// use Hadoop Mapper as MapFunction
DataSet<Tuple2<IntWritable, Text>> result = text
.flatMap(new HadoopMapFunction<LongWritable, Text, IntWritable, Text>(
new MapHDFS() )) .groupBy(0)
// use Hadoop Reducer
.reduceGroup(new HadoopReduceFunction<IntWritable, Text, IntWritable,
Text>(new ReduceHDFS() ));
Flink Job
// Set up the Hadoop TextOutputFormat.
HadoopOutputFormat<IntWritable, Text> hadoopOF =
new HadoopOutputFormat<IntWritable, Text>(
new TextOutputFormat<IntWritable, Text>(), job
);
//Write results back to HDFS
hadoopOF.getConfiguration().set("mapreduce.output.textoutputformat.separator"
, " ");
TextOutputFormat.setOutputPath(job, new Path(args[1]));
// Emit data using the Hadoop TextOutputFormat.
result.output(hadoopOF).setParallelism(1);
// Execute Code
env.execute("Hadoop PeakPick");
}
}
Interim Results
Mapper only
Hadoop 12m25s
Flink 4m50s
Mapper and Reducer
Hadoop 28m32s
Flink 10m20s
Existing code 35m22s
Near real-time?
Still not fast enough to be called near real-time
Processing x scans per second, if the cluster was big
enough then maybe…
But… the mass spectrometer takes 100 minutes to
complete its processing for one experiment so in
fact we have more than enough time to process the
data if we are streaming the results and processing
the data as it is produced…
Streaming the data
Simulate streaming data using an existing data file and kafka
Ingest the data using Flink Streaming API and process the scans using the existing
mapper code
Existing Data File Kafka Flink Streaming API
Streaming the results
A peptide elutes over a period of time, this means the data from many scans needs
to be compared at the same time.
A safe window to measure the quantity of a peptide is 10 seconds
10 seconds
Interim Results
Overlapping 10 second windows captures 3D peaks from the 2D scans
Interim Results
Processing the entire scan in the 10 second window means that we don’t need the
overlapping window and the de-duplication step
All this means that the data will be fully pre-processed just
over 10 seconds after the mass spectrometer completes
the experiment.
Near real-time?
Stream Processing
Complete stable working system
Contrast with Spark and Storm
Hookup previous research on database lookup to
create a complete system
Pay for some EC2 system time to complete testing
Write a thesis…
To Do list
Questions?
chillman@dundee.ac.uk
@chillax7
http://www.thequeensarmskensington.co.uk

More Related Content

What's hot (20)

PDF
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Flink Forward
 
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
PDF
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
PDF
FlinkML: Large Scale Machine Learning with Apache Flink
Theodoros Vasiloudis
 
PPTX
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
PDF
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 
PPTX
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
PPT
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Srinath Perera
 
PDF
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
PDF
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Flink Forward
 
PDF
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
PPTX
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
PDF
Predictive Datacenter Analytics with Strymon
Vasia Kalavri
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Flink Forward
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
FlinkML: Large Scale Machine Learning with Apache Flink
Theodoros Vasiloudis
 
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Srinath Perera
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Flink Forward
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
Predictive Datacenter Analytics with Strymon
Vasia Kalavri
 

Viewers also liked (20)

PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
PDF
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PPTX
Aljoscha Krettek – Notions of Time
Flink Forward
 
PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
PDF
Flink Apachecon Presentation
Gyula Fóra
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
PDF
Vasia Kalavri – Training: Gelly School
Flink Forward
 
PDF
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
PDF
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Flink Forward
 
PPTX
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
Apache Flink internals
Kostas Tzoumas
 
Aljoscha Krettek – Notions of Time
Flink Forward
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
Flink Apachecon Presentation
Gyula Fóra
 
Apache Flink Training: System Overview
Flink Forward
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Vasia Kalavri – Training: Gelly School
Flink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Flink Forward
 
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Ad

Similar to Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time (20)

PDF
Hadoop 101 for bioinformaticians
attilacsordas
 
PDF
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
PDF
MapReduce with Hadoop
Vitalie Scurtu
 
PDF
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Jason J Pulikkottil
 
PPTX
Real time hadoop + mapreduce intro
Geoff Hendrey
 
PPTX
Finalprojectpresentation
SANTOSH WAYAL
 
PPTX
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
PPTX
Zaharia spark-scala-days-2012
Skills Matter Talks
 
PDF
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
PDF
Apache Hadoop Java API
Adam Kawa
 
PDF
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
PDF
Hadoop - Lessons Learned
tcurdt
 
PPTX
Large Scale Data With Hadoop
guest27e6764
 
PPTX
Map reduce prashant
Prashant Gupta
 
PDF
B018110610
IOSR Journals
 
PDF
Hadoop - How It Works
Vladimír Hanušniak
 
PDF
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
PDF
Replayable BigData for Multicore Processing and Statistically Rigid Sketching
Tokyo University of Science
 
PPTX
Apache Crunch
Alwin James
 
PDF
MapReduce
Abe Arredondo
 
Hadoop 101 for bioinformaticians
attilacsordas
 
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
MapReduce with Hadoop
Vitalie Scurtu
 
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Jason J Pulikkottil
 
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Finalprojectpresentation
SANTOSH WAYAL
 
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Zaharia spark-scala-days-2012
Skills Matter Talks
 
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
Apache Hadoop Java API
Adam Kawa
 
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Hadoop - Lessons Learned
tcurdt
 
Large Scale Data With Hadoop
guest27e6764
 
Map reduce prashant
Prashant Gupta
 
B018110610
IOSR Journals
 
Hadoop - How It Works
Vladimír Hanušniak
 
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
Replayable BigData for Multicore Processing and Statistically Rigid Sketching
Tokyo University of Science
 
Apache Crunch
Alwin James
 
MapReduce
Abe Arredondo
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Recently uploaded (20)

PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time

  • 1. beyond mapreduce scientific data processing in real-time Chris Hillman October 13th 2015 [email protected]
  • 3. Mass Spectrometry Each Experiment produces 7Gb XML file containing 40,000 scans 600,000,000 data points In approx 100 minutes Data processing can take over 24 hours • Pick 2D Peaks • De-isotope • Pick 3D Peaks • Match weights to known peptides
  • 4. Mass Spectrometry New Lab has 12 Machines That’s a lot of data and a lot of data processing
  • 6. Parallel Processing Amdahl’s Law Serial portion is fixed Gustafson’s Law Size of problem is not fixed Gunther’s Law linear scalability issues
  • 8. Parallel Algorithm 2D peak picking fits well into a Map task – Read into memory – Decode base64 float array – Peak pick, isotopic envelope detection
  • 9. Parallel Algorithm 3D peak picking fits well into a Reduce task – Receive partitions of 2D peaks – Detect 3D peaks – Isotopic envelopes – Output peak Mass Intensity
  • 10. Issues XML is not a good format for parallel processing
  • 11. Issues 38 38.5 39 39.5 40 40.5 41 372 372.2 372.4 372.6 372.8 373 373.2 373.4 373.6 373.8 374
  • 12. Issues Data Shuffle and skew on the cluster 0 20000 40000 60000 80000 100000 120000 50 85 120 155 191 226 261 296 331 366 401 436 471 506 541 576 611 646 681 716 751 786 821 856 891 926 961 996 1031 1066 1101 1136 1172 1208 1243 1280 1316 1351 1386 1422 1457 1492 1527 1562 1597 Series2
  • 15. MapReduce Transforming the XML and writing the modified data to • HDFS • Hbase • Cassandra Executing the MapReduce code reading from the above Batch process shows potential of speeding up the current process by scaling the size of the cluster running it.
  • 16. Flink Experiences so far Very easy to install Very easy to understand Good documentation Very easy to adapt current code I like it!
  • 17. public class PeakPick extends Configured implements Tool { Job job=new Job(getConf(), "peakpick"); job.setJarByClass(PeakPick.class); job.setNumReduceTasks(104); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(Text.class); job.setMapperClass(MapHDFS.class); job.setReducerClass(ReduceHDFS.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setPartitionerClass(MZPartitioner.class); FileInputFormat.setInputPaths(job, new Path(args[1])); FileOutputFormat.setOutputPath(job, new Path(args[2])); job.waitForCompletion(true); return job.isSuccessful() ? 0 : 1; } public static void main(String[] args) throws Exception{ int res = ToolRunner.run(new Configuration(), new PeakPick(), args); System.exit(res); } MR Job
  • 18. MR Read public class MapHDFS extends Mapper<LongWritable, Text, IntWritable, Text> { public void map(LongWritable key, Text value, Context context { String inputLine = value.toString(); tempStr = inputLine.split("t"); scNumber = tempStr[1]; …… intensityString = tempStr[8] } public class MapCassandra extends Mapper<ByteBuffer, SortedMap<ByteBuffer, Cell>, Text, IntWritable> { public void map(ByteBuffer key, SortedMap<ByteBuffer, Cell> columns, Context conte { scNumber = String.valueOf(key.getInt()); for (Cell cell : columns.values()) {String name = ByteBufferUtil.string(cell.name().toByteBuffer()); if (name.contains("scan")) scNumber String.valueOf(ByteBufferUtil.toInt(cell.value())); if (name.contains("mslvl")) scLevel = String.valueOf(ByteBufferUtil.toInt(cell.value())); if (name.contains("rettime")) RT = String.valueOf(ByteBufferUtil.toDouble(cell.value())); }
  • 19. MR Write public class MapHDFS extends Mapper<LongWritable, Text, IntWritable, Text> { public void map(LongWritable key, Text value, Context context { ………… for (int i=0; i<outputPoints.size(); i++){ mzStringOut = scNumber + "t" + scLevel + "t" + RT + "t" + Integer.toString(outputPoints.get(i).getCurveID()) + "t" + Double.toString(outputPoints.get(i).getWpm()) context.write(new IntWritable(outputPoints.get(i).getKey()), peakOut); } public class ReduceHDFS extends Reducer<IntWritable, Text, IntWritable, Text> { public void reduce(IntWritable key, Iterable<Text> values, Context context){ ………… for (int k = 0; k<MonoISO.size();k++){ outText = MonoISO.get(k).getCharge() + "t" + MonoISO.get(k).getWpm() + "t" + MonoISO.get(k).getSumI() + "t" + MonoISO.get(k).getWpRT(); context.write(new IntWritable(0), new Text(outText)); }
  • 20. Flink Job public class PeakPickFlink_MR { final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); Job job = Job.getInstance(); //Setup input format HadoopInputFormat<LongWritable, Text> hadoopIF = new HadoopInputFormat<LongWritable, Text>( new TextInputFormat(), LongWritable.class, Text.class, job ); TextInputFormat.addInputPath(job, new Path(args[0])); //Read HDFS Data DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopIF);
  • 21. Flink Job // use Hadoop Mapper as MapFunction DataSet<Tuple2<IntWritable, Text>> result = text .flatMap(new HadoopMapFunction<LongWritable, Text, IntWritable, Text>( new MapHDFS() )) .groupBy(0) // use Hadoop Reducer .reduceGroup(new HadoopReduceFunction<IntWritable, Text, IntWritable, Text>(new ReduceHDFS() ));
  • 22. Flink Job // Set up the Hadoop TextOutputFormat. HadoopOutputFormat<IntWritable, Text> hadoopOF = new HadoopOutputFormat<IntWritable, Text>( new TextOutputFormat<IntWritable, Text>(), job ); //Write results back to HDFS hadoopOF.getConfiguration().set("mapreduce.output.textoutputformat.separator" , " "); TextOutputFormat.setOutputPath(job, new Path(args[1])); // Emit data using the Hadoop TextOutputFormat. result.output(hadoopOF).setParallelism(1); // Execute Code env.execute("Hadoop PeakPick"); } }
  • 23. Interim Results Mapper only Hadoop 12m25s Flink 4m50s Mapper and Reducer Hadoop 28m32s Flink 10m20s Existing code 35m22s
  • 24. Near real-time? Still not fast enough to be called near real-time Processing x scans per second, if the cluster was big enough then maybe… But… the mass spectrometer takes 100 minutes to complete its processing for one experiment so in fact we have more than enough time to process the data if we are streaming the results and processing the data as it is produced…
  • 25. Streaming the data Simulate streaming data using an existing data file and kafka Ingest the data using Flink Streaming API and process the scans using the existing mapper code Existing Data File Kafka Flink Streaming API
  • 26. Streaming the results A peptide elutes over a period of time, this means the data from many scans needs to be compared at the same time. A safe window to measure the quantity of a peptide is 10 seconds 10 seconds
  • 27. Interim Results Overlapping 10 second windows captures 3D peaks from the 2D scans
  • 28. Interim Results Processing the entire scan in the 10 second window means that we don’t need the overlapping window and the de-duplication step
  • 29. All this means that the data will be fully pre-processed just over 10 seconds after the mass spectrometer completes the experiment. Near real-time? Stream Processing
  • 30. Complete stable working system Contrast with Spark and Storm Hookup previous research on database lookup to create a complete system Pay for some EC2 system time to complete testing Write a thesis… To Do list

Editor's Notes

  • #2: Who am I Teradata Working abroad a lot – time in hotels Part-time PhD University of Dundee Stubborn-ness
  • #3: Proteomics intro, Genome project  proteomics studying proteins in cells (treated vs non-treated) Reasons for disease  cures for disease complexity, current processing methods and times Experiment digesting proteins using an enzyme into smaller pieces called peptides… Full process including database search, area of research pre-processing linking to work already done.
  • #4: Peptides detected in a mass spectrometer…. Current problem space, description of data files. Issues with lag between experiment and data output, Biological and technical replicates QC data between experiments
  • #5: 140Gb XML data 20 billion data points That’s big data to me Parallel algorithm, test on various distributed computing environments
  • #6: Area of research is to write a parallel algorithm for processing the data and test it on different systems HDFS, Cassandra and Hbase for data storage Hadoop MapReduce, Spark, Storm for processing Doing some work for the EU commission and then flink came along…
  • #7: Because this is a research project, there needs to be a formal approach Working in IT you get what works etc. These 3 laws are often quoted in conjunction with parallel processing
  • #9: The experiment cuts the proteins into pieces (called peptides) Peaks are the molecular weight of the protein piece that we are trying to measure
  • #10: The 3d peak matching is a complex algorithm
  • #11: XML poor format for parallel processing Code to decode XML as it is copied on to the cluster Write to local, HDFS, Cassandra, Hbase Describe fields, scan number, retention time, Base64 fields containing float arrays
  • #12: Overlapping regions required to process the reducers We know the details of how wide a 3D peak could be based on the biological properties of the peptides.. De-duplication may be needed
  • #13: Custom partitioner needed to balance out the skew and produce the overlapping regions Beware coding always on a single node pseudo cluster – no distribution Beware test datasets – too small can hide O(N2) problems
  • #14: It works!
  • #15: Quick recap - what has been done so far using Hadoop ver 2 The map and reduce code produce correct results and the potential is there to decrease the time taken to process the data I know how many scans per second can be processed and the relationship between the size of the reduce mass window and time to completion.
  • #16: The Hadoop MapReduce code works well, reading and writing from HDFS, Hbase and cassandra (writing to Cassandra issue) Current widely used software takes around 35mins, running bare metal Speed, little disappointing around 28 minutes to decode the test file – virtual machine issues and only 8 Map Slots Started looking at Spark but then doing some reviewing work and kept reading about Flink. So had to try it and the rest of the presentation is about my experiences with Flink
  • #17: To Begin with I looked at how to port my existing code
  • #18: Looking at some code to compare and contrast hadoop mapreduce to flink
  • #19: Hadoop interfaces versions!! HDFS, read in key value and use string split to parse the tab separated values in the value part, Cassandra read uses the Cassandra Hadoop jars to read a cassandra column family (or table)
  • #20: Hadoop interfaces versions!! Writing requires concatenating the output values into the value and writing back to HDFS Which the Reducer then reads back in and splits back into parts. After all the calcs are done the reducer concatenates all the results and writes back to HDFS
  • #21: Flink Code
  • #22: Using the example code I could put in my mapper and reducer from the MapReduce code and Run straight away… the map test worked perfectly and it’s fast (more later) The reducer worked much faster than expected (but results wrong!) (again more later) Note no write out to HDFS between map and reduce for the shuffle
  • #23: Hadoop interfaces versions!! Writing requires concatenating the output values into the value and writing back to HDFS Straight from the Example code- mapper worked first time perfectly
  • #24: Running just the map task is faster in flink, more so than I expected – especially considering how little effort it took in getting the mapper running. The full task with map and reduce was again faster but… the results were wrong. Slow and correct or fast and wrong… The issue is to do with my custom partitioner and the way results were sorted – more on that later Rewriting the controlling job and making small changes to the map and reduce code I could use data types such as the Tuple8 to store and pass my values Which is nice…
  • #25: The point of the research was to produce the results in near real-time Is 10 minutes good enough? What about when the next part of the processing is added to the batch. The current database lookup process can take hours to complete.
  • #26: Use Kafka to simulate streaming the output directly from the mass spec. In reality this would mean a major change by the manufacturer But this can show the benefits of making such a change. The map code which processes each scan is fairly easy to incorporate into a streaming pattern. You just treat each scan as a single unit of data and use exactly the same code as already exists. The reducer is more of a problem though.
  • #27: The reduce code is more complicated
  • #28: Need to process the 2D results in overlapping 10 second windows to find the 3D peaks. Sliding window, 10 seconds is roughly 40 scans, so 40 scans every 1 scan gives me what I need From the profiled of the 3D curve created by matching up the 2D peaks we can tell if we are have got the start and end of the peak in the window so no need for de-duplication. It still seems to be a map and a reduce step with the mapper decoding each scan and producing peaks, the reducer then aggregates the 2D peaks over time.
  • #29: Removing the need to split the 2D scan by time removes the need for the overlapping window and the de-duplication. Previous methods was constrained by the way MapReduce makes you think… ended up with complex system of overlapping data and de-duplication to produce a way of running in parallel. The streaming method is still parallel but it is each 10 second window that can run on a separate machine not a mass window… Also removes the problem of having to decide how many mass windows to define as this determines how many reduce slots we need.
  • #30: Using kafka to simulate the streaming of data and Flink to ingest the stream and process it show that would be possible to build and information factory for the life sciences data that could produce fully pre-processed mass spectrometer data (peak picked) data in near real-time
  • #31: Using kafka to simulate the streaming of data and Flink to ingest the stream and process it show that would be possible to build and information factory for the life sciences data that could produce fully pre-processed mass spectrometer data (peak picked) data in near real-time