SlideShare a Scribd company logo
"BIG DATA" BIOINFORMATICS
MODELS FOR DISTRIBUTED, PARALLEL AND CONCURRENT PROCESSING
Brian Repko
December 4, 2014
AGENDA
What does this have to do with me?
High Performance Computing models
Traditional Software Engineering models
Distributed Systems Architecture
WHAT DOES THIS HAVE TO DO WITH ME?
Existing tools designed as single flow / single machine
Programming languages have limits (R, python)
CPUs moving to multi-core
Single flow can't make use of this
Some processes hit CPU/memory issues
Data volume in bioinformatics
4 "V"s of "Big Data"
Volume, Velocity, Variety and Veracity
Limits on single node CPU / memory
TERMINOLOGY
Concurrent - dealing with a lot of things
Parallel - doing a lot of things
Why parallelize / use concurrency?
Throughput and Responsiveness
Distributed - anything non-shared
Why distribute?
Scalability and Availability
Goal is concurrent, distributed, resilient, simple
HPC MODELS
Von Neumann Machine
OpenMP
MPI
General Purpose GPU
Beyond HPC (HTC/MTC)
VON NEUMANN MACHINE
John von Neumann 1945
CPU, Memory, I/O
Program AND data in memory
SISD, SIMD, MIMD models
Multi-levels/types of cache, UMA/ccNUMA
Threading, pipelining, vectorization
VON NEUMANN MACHINE
OPENMP (MULTI-PROCESSING)
Shared memory programming model
Based on threads, fork/join and a known "memory model"
C/C++ pragmas, Fortran compiler directives
OpenMP directive categories:
control, work-sharing, data visibility, synchronization and
context / environment
Pros/Cons with use of directives
Optimizations very architecture-specific
Can be difficult to get correct
OPENMP (MULTI-PROCESSING)
MESSAGE PASSING INTERFACE
Distributed memory programming model
Single-Program Multiple-Data (SPMD)
P2P and Broadcast, Synch and Asynch
Datatypes for message content
Communicators / network topology
Program instance gets rank / size
Rank 0 typically a coordinator / reducer
All others do work and send result to rank 0
Dynamic process management in MPI-2
Hybrid models with OpenMP
Bioinformatics example: Trinity (transcript assembly) on Cray
GENERAL PURPOSE GPU
OpenCL (computing language)
CUDA (specific to NVIDIA)
Standard library and device-specific driver (ICD)
Kernel routine written in OpenCL C
Global / Work-group / Work-item memory model
Devices are sent streams of work-groups
Kernel runs in parallel on work-items
Very useful together with OpenGL
Extension of this idea to custom chips (ASIC/FPGA)
BEYOND HPC (HTC/MTC)
High-Throughput Computing
Long-running, parallel jobs
Many-Task Computing
Mix of job size, Workflows
Cluster/Grid Computing (Grid Engine)
Lots of workflow solutions
YAP (MPI)
Swift scripting language
bpipe (workflow DSL)
celery / gridmap (Python)
Process-level failure / error handling
TRADITIONAL ENGINEERING MODELS
Threads, Locks and Fork/Join
Functional Programming
Communicating Sequential Processes (CSP)
Actors
THREADS, LOCKS AND FORK/JOIN
These are the general terms
OpenMP is a particular style (via macros)
Support varies by programming language
May or may not use multiple cores
For C, choose OpenMP or pthreads
Concurrency model (non-deterministic)
Difficult to get correct
The problem is shared mutable state
FUNCTIONAL PROGRAMMING
Alonzo Church 1930
Lambda Calculus
System for maths / computation
Declarative (vs Imperative)
Computation is the evalutation of functions
Avoids mutable state
Haskell (pure), Lisp (Scheme, Clojure, Common Lisp), Erlang,
ML (OCaml), Javascript, C#, F#, Groovy, Scala, Java 8, Python,
R, Julia,...
FUNCTIONAL PROGRAMMING
First-class functions
Higher-order functions
Pure functions (no side effects)
Referential transparency and beta-reduction in any order
including parallel
(Tail) recursion, partial functions, currying
Strict (eager) vs non-strict (lazy) evaluation
Typed or Untyped - Category theory when typed
Software Transactional Memory (Clojure)
FUNCTIONAL PROGRAMMING
expr = "28+32+++32++39"
res = 0
for t in expr.split("+"):
    if t != "":
        res += int(t)
print res
expr = "28+32+++32++39"
print reduce(map(filter(expr.split("+"), isNonBlank), toInteger), add)
COMMUNICATING SEQUENTIAL PROCESSES
Tony Hoare 1978
One of multiple process calculi
Verifiable lack of deadlocks
Avoids shared state
Synchronous message passing via shared channels
Concurrently executing elements - send / receive
Functions can use and return channels
Implemented in Ada, Go and Clojure core.async
Distribution is possible but difficult
COMMUNICATING SEQUENTIAL PROCESSES
ACTORS
Carl Hewitt 1973
Avoids shared state (share nothing!)
Actor (the processing element) has
an identity, non-shared state, and a mailbox
asynchronous messaging
Actors can
do work, send messages, and create other actors
Built-into some programming languages - Erlang, Scala
Frameworks available for almost all languages - Akka
Concurrency and (somewhat easier) Distribution
Bioinformatics example - MetaRay (MPI) to BioSAL/Thorium
ACTORS
DISTRIBUTED SYSTEM ARCHITECTURE
Distributed Storage (Filesystems/NoSQL)
Hadoop
Map-Reduce
Apache Spark
Lambda Architecture
DISTRIBUTED STORAGE (FS/NOSQL)
Filesystems
Lustre, GlusterFS (Redhat), OneFS (Isilon)
Hadoop HDFS
Tachyon
NoSQL / NewSQL
Distribution one of the main reasons for NoSQL
Key-value (Dynamo, Redis, Riak, Voldemort)
Document (MongoDB, Couchbase)
Column (Cassandra, Accumulo, HBase, Vertica)
Graph (Neo4J, Allegro, InfiniteGraph, OrientDB)
Relational (NuoDB, Teradata)
These all have to deal with standard distribution problems
HADOOP
Distributed storage AND computing
HDFS (file system storage)
NameNodes and DataNodes
Map-Reduce (computing model)
JobTrackers and TaskTrackers
Hadoop "ecosystem":
HBase or Parquet (NoSQL DB)
Pig (Hadoop job DSL / scripting)
Hadwrap (scripting / workflow)
Hive (data warehouse)
Drill or Impala (SQL query engine)
Sqoop (ETL - DB to Hadoop)
MAP-REDUCE
A Map-Reduce job has
an input data-set and an output directory
a mapper, reducer and optional combiner (classes)
all classes get and produce kv-pairs
The job runs as
1. The input is converted to kv-pairs (key=line#, value=text)
2. Mapper gets and processes kv-pairs (data locality)
3. Sort/Shuffle phase on mapper output
4. Reducers get all kv-pairs for a given key (sorted)
5. Reducers output is stored in output directory
MAP-REDUCE
APACHE SPARK
New computing model
RDD - Resilient Distributed Datasets
Read-only, distributed collection of objects
Stored on disk (HDFS/Cassandra) or in-memory
Memory usage on shared clusters can be an issue
Computation is transformations and actions on RDDs
Transformations convert data or RDD into an RDD
Actions convert RDD to object / data
Functional programming (immutability) paradigm
DAG (lineage) for how RDD was built (scheduling, failover)
Lazy evaluation of tasks
Actor-based (Akka) distribution of code
Faster than Hadoop, more expressive than MR
APACHE SPARK
Sub-projects
Spark SQL (was "Shark" - Spark for Hive)
GraphX (graph data)
MLlib (machine learning)
Spark Streaming (events)
Multiple languages - Java 8, Scala, Python, R
Bioinformatics examples:
gData Integration and Analytics Layers
ADAM (AMPlab and bdgenomics.org)
Why is Bioinformatics a Good Fit for Spark?
Real-time Image Processing and Analytics using Spark
Why Apache Spark is a Crossover Hit for Data Scientists
LAMBDA ARCHITECTURE
Slow batch layer for all data (Hadoop)
Fast speed layer for latest data (updating)
Serving layer for queries based on both layers
Raw data is immutable facts (append-only)
WHAT DOES THIS HAVE TO DO WITH ME
(AGAIN)?
Bioinformatics has/will have a volume constraint
Some algorithms have a CPU constraint
For volume, move to distributed data
For computation on distributed data
Parallelize over standard data partitions (position, samples)
Distribute that computation
Actors > MPI and Spark > Map-Reduce
Functional programming as an algorithm goal
Additional advantages with Apache Spark
Availability of intermediate processing steps for workflows
Availability of graph and ML algorithms
THANK YOU! QUESTIONS?
Special thanks to Ken Robbins, Dave Tester, Steve Litster, Nick
Holway, Timothy Danford, Laurent Gautier and Jason Calvert
OpenMP/MPI/Hadoop/Spark is available in
SciComp/DataEng clusters
What are your data / computation challenges?

More Related Content

What's hot (20)

PPT
Cryo electron microscopy
Madiheh
 
PPTX
Kegg
msfbi1521
 
PPTX
Tissue Engineering: Scaffold Materials
ElahehEntezarmahdi
 
PPTX
Cryopreservation
Syeda Zomia
 
PPTX
SDS and 2D page
microbiology Notes
 
PPTX
Cell synchronization, animal cell culture
KAUSHAL SAHU
 
PPTX
Expression of Biodegradable plastics in Plants
Godwin J
 
PPTX
Plant molecular farming for recombinant therapeutic proteins
Satish Khadia
 
PPTX
EMBL-EBI
Sayma Zerin
 
PPTX
Modes of Gene Transfer
Dh Sani
 
PPTX
BACTERIAL RHODOPSIN.
Amisav
 
PPTX
Atomic force microscopy
Sonu Bishnoi
 
PPTX
Entrez databases
Hafiz Muhammad Zeeshan Raza
 
PPTX
Cryo electron microscope.presentaion
ajithkumarjajithkumarj
 
PPT
Protein based nanostructures for biomedical applications
karoline Enoch
 
PPTX
Plant Tissue Culture: Somatic Embryogenesis
A Biodiction : A Unit of Dr. Divya Sharma
 
PPTX
Callus culture
Nizam Ali
 
PPTX
Culture techniq and type of animal cell culture
Pankaj Nerkar
 
Cryo electron microscopy
Madiheh
 
Kegg
msfbi1521
 
Tissue Engineering: Scaffold Materials
ElahehEntezarmahdi
 
Cryopreservation
Syeda Zomia
 
SDS and 2D page
microbiology Notes
 
Cell synchronization, animal cell culture
KAUSHAL SAHU
 
Expression of Biodegradable plastics in Plants
Godwin J
 
Plant molecular farming for recombinant therapeutic proteins
Satish Khadia
 
EMBL-EBI
Sayma Zerin
 
Modes of Gene Transfer
Dh Sani
 
BACTERIAL RHODOPSIN.
Amisav
 
Atomic force microscopy
Sonu Bishnoi
 
Entrez databases
Hafiz Muhammad Zeeshan Raza
 
Cryo electron microscope.presentaion
ajithkumarjajithkumarj
 
Protein based nanostructures for biomedical applications
karoline Enoch
 
Plant Tissue Culture: Somatic Embryogenesis
A Biodiction : A Unit of Dr. Divya Sharma
 
Callus culture
Nizam Ali
 
Culture techniq and type of animal cell culture
Pankaj Nerkar
 

Similar to "Big Data" Bioinformatics (20)

PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PPTX
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop introduction
Chirag Ahuja
 
PPTX
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
PDF
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
PPTX
Big data concepts
Serkan Özal
 
PPTX
Hadoop bigdata overview
harithakannan
 
PPTX
Hadoop: An Industry Perspective
Cloudera, Inc.
 
PDF
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
PDF
spark_v1_2
Frank Schroeter
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PPTX
عصر کلان داده، چرا و چگونه؟
datastack
 
PDF
The Evolution of Big Data Frameworks
eXascale Infolab
 
PPTX
Apache spark - History and market overview
Martin Zapletal
 
PDF
About "Apache Cassandra"
Jihyun Ahn
 
PPTX
Thinking in parallel ab tuladev
Pavel Tsukanov
 
PDF
Introduction to apache hadoop
Shashwat Shriparv
 
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
Hadoop introduction
Chirag Ahuja
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Big data concepts
Serkan Özal
 
Hadoop bigdata overview
harithakannan
 
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
spark_v1_2
Frank Schroeter
 
Bds session 13 14
Infinity Tech Solutions
 
عصر کلان داده، چرا و چگونه؟
datastack
 
The Evolution of Big Data Frameworks
eXascale Infolab
 
Apache spark - History and market overview
Martin Zapletal
 
About "Apache Cassandra"
Jihyun Ahn
 
Thinking in parallel ab tuladev
Pavel Tsukanov
 
Introduction to apache hadoop
Shashwat Shriparv
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
Ad

More from Brian Repko (6)

PPTX
Coon Rapids High School (biomedical specialty) Outreach
Brian Repko
 
PPTX
Agile Day Twin Cities - Lightning Talk (Repko)
Brian Repko
 
PDF
Agile Days Twin Cities 2011
Brian Repko
 
PDF
Crucible
Brian Repko
 
PDF
CFEngine 3
Brian Repko
 
PPTX
FIT and JBehave - Good, Bad and Ugly
Brian Repko
 
Coon Rapids High School (biomedical specialty) Outreach
Brian Repko
 
Agile Day Twin Cities - Lightning Talk (Repko)
Brian Repko
 
Agile Days Twin Cities 2011
Brian Repko
 
Crucible
Brian Repko
 
CFEngine 3
Brian Repko
 
FIT and JBehave - Good, Bad and Ugly
Brian Repko
 
Ad

Recently uploaded (20)

PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 

"Big Data" Bioinformatics

  • 1. "BIG DATA" BIOINFORMATICS MODELS FOR DISTRIBUTED, PARALLEL AND CONCURRENT PROCESSING Brian Repko December 4, 2014
  • 2. AGENDA What does this have to do with me? High Performance Computing models Traditional Software Engineering models Distributed Systems Architecture
  • 3. WHAT DOES THIS HAVE TO DO WITH ME? Existing tools designed as single flow / single machine Programming languages have limits (R, python) CPUs moving to multi-core Single flow can't make use of this Some processes hit CPU/memory issues Data volume in bioinformatics 4 "V"s of "Big Data" Volume, Velocity, Variety and Veracity Limits on single node CPU / memory
  • 4. TERMINOLOGY Concurrent - dealing with a lot of things Parallel - doing a lot of things Why parallelize / use concurrency? Throughput and Responsiveness Distributed - anything non-shared Why distribute? Scalability and Availability Goal is concurrent, distributed, resilient, simple
  • 5. HPC MODELS Von Neumann Machine OpenMP MPI General Purpose GPU Beyond HPC (HTC/MTC)
  • 6. VON NEUMANN MACHINE John von Neumann 1945 CPU, Memory, I/O Program AND data in memory SISD, SIMD, MIMD models Multi-levels/types of cache, UMA/ccNUMA Threading, pipelining, vectorization
  • 8. OPENMP (MULTI-PROCESSING) Shared memory programming model Based on threads, fork/join and a known "memory model" C/C++ pragmas, Fortran compiler directives OpenMP directive categories: control, work-sharing, data visibility, synchronization and context / environment Pros/Cons with use of directives Optimizations very architecture-specific Can be difficult to get correct
  • 10. MESSAGE PASSING INTERFACE Distributed memory programming model Single-Program Multiple-Data (SPMD) P2P and Broadcast, Synch and Asynch Datatypes for message content Communicators / network topology Program instance gets rank / size Rank 0 typically a coordinator / reducer All others do work and send result to rank 0 Dynamic process management in MPI-2 Hybrid models with OpenMP Bioinformatics example: Trinity (transcript assembly) on Cray
  • 11. GENERAL PURPOSE GPU OpenCL (computing language) CUDA (specific to NVIDIA) Standard library and device-specific driver (ICD) Kernel routine written in OpenCL C Global / Work-group / Work-item memory model Devices are sent streams of work-groups Kernel runs in parallel on work-items Very useful together with OpenGL Extension of this idea to custom chips (ASIC/FPGA)
  • 12. BEYOND HPC (HTC/MTC) High-Throughput Computing Long-running, parallel jobs Many-Task Computing Mix of job size, Workflows Cluster/Grid Computing (Grid Engine) Lots of workflow solutions YAP (MPI) Swift scripting language bpipe (workflow DSL) celery / gridmap (Python) Process-level failure / error handling
  • 13. TRADITIONAL ENGINEERING MODELS Threads, Locks and Fork/Join Functional Programming Communicating Sequential Processes (CSP) Actors
  • 14. THREADS, LOCKS AND FORK/JOIN These are the general terms OpenMP is a particular style (via macros) Support varies by programming language May or may not use multiple cores For C, choose OpenMP or pthreads Concurrency model (non-deterministic) Difficult to get correct The problem is shared mutable state
  • 15. FUNCTIONAL PROGRAMMING Alonzo Church 1930 Lambda Calculus System for maths / computation Declarative (vs Imperative) Computation is the evalutation of functions Avoids mutable state Haskell (pure), Lisp (Scheme, Clojure, Common Lisp), Erlang, ML (OCaml), Javascript, C#, F#, Groovy, Scala, Java 8, Python, R, Julia,...
  • 16. FUNCTIONAL PROGRAMMING First-class functions Higher-order functions Pure functions (no side effects) Referential transparency and beta-reduction in any order including parallel (Tail) recursion, partial functions, currying Strict (eager) vs non-strict (lazy) evaluation Typed or Untyped - Category theory when typed Software Transactional Memory (Clojure)
  • 18. COMMUNICATING SEQUENTIAL PROCESSES Tony Hoare 1978 One of multiple process calculi Verifiable lack of deadlocks Avoids shared state Synchronous message passing via shared channels Concurrently executing elements - send / receive Functions can use and return channels Implemented in Ada, Go and Clojure core.async Distribution is possible but difficult
  • 20. ACTORS Carl Hewitt 1973 Avoids shared state (share nothing!) Actor (the processing element) has an identity, non-shared state, and a mailbox asynchronous messaging Actors can do work, send messages, and create other actors Built-into some programming languages - Erlang, Scala Frameworks available for almost all languages - Akka Concurrency and (somewhat easier) Distribution Bioinformatics example - MetaRay (MPI) to BioSAL/Thorium
  • 22. DISTRIBUTED SYSTEM ARCHITECTURE Distributed Storage (Filesystems/NoSQL) Hadoop Map-Reduce Apache Spark Lambda Architecture
  • 23. DISTRIBUTED STORAGE (FS/NOSQL) Filesystems Lustre, GlusterFS (Redhat), OneFS (Isilon) Hadoop HDFS Tachyon NoSQL / NewSQL Distribution one of the main reasons for NoSQL Key-value (Dynamo, Redis, Riak, Voldemort) Document (MongoDB, Couchbase) Column (Cassandra, Accumulo, HBase, Vertica) Graph (Neo4J, Allegro, InfiniteGraph, OrientDB) Relational (NuoDB, Teradata) These all have to deal with standard distribution problems
  • 24. HADOOP Distributed storage AND computing HDFS (file system storage) NameNodes and DataNodes Map-Reduce (computing model) JobTrackers and TaskTrackers Hadoop "ecosystem": HBase or Parquet (NoSQL DB) Pig (Hadoop job DSL / scripting) Hadwrap (scripting / workflow) Hive (data warehouse) Drill or Impala (SQL query engine) Sqoop (ETL - DB to Hadoop)
  • 25. MAP-REDUCE A Map-Reduce job has an input data-set and an output directory a mapper, reducer and optional combiner (classes) all classes get and produce kv-pairs The job runs as 1. The input is converted to kv-pairs (key=line#, value=text) 2. Mapper gets and processes kv-pairs (data locality) 3. Sort/Shuffle phase on mapper output 4. Reducers get all kv-pairs for a given key (sorted) 5. Reducers output is stored in output directory
  • 27. APACHE SPARK New computing model RDD - Resilient Distributed Datasets Read-only, distributed collection of objects Stored on disk (HDFS/Cassandra) or in-memory Memory usage on shared clusters can be an issue Computation is transformations and actions on RDDs Transformations convert data or RDD into an RDD Actions convert RDD to object / data Functional programming (immutability) paradigm DAG (lineage) for how RDD was built (scheduling, failover) Lazy evaluation of tasks Actor-based (Akka) distribution of code Faster than Hadoop, more expressive than MR
  • 28. APACHE SPARK Sub-projects Spark SQL (was "Shark" - Spark for Hive) GraphX (graph data) MLlib (machine learning) Spark Streaming (events) Multiple languages - Java 8, Scala, Python, R Bioinformatics examples: gData Integration and Analytics Layers ADAM (AMPlab and bdgenomics.org) Why is Bioinformatics a Good Fit for Spark? Real-time Image Processing and Analytics using Spark Why Apache Spark is a Crossover Hit for Data Scientists
  • 29. LAMBDA ARCHITECTURE Slow batch layer for all data (Hadoop) Fast speed layer for latest data (updating) Serving layer for queries based on both layers Raw data is immutable facts (append-only)
  • 30. WHAT DOES THIS HAVE TO DO WITH ME (AGAIN)? Bioinformatics has/will have a volume constraint Some algorithms have a CPU constraint For volume, move to distributed data For computation on distributed data Parallelize over standard data partitions (position, samples) Distribute that computation Actors > MPI and Spark > Map-Reduce Functional programming as an algorithm goal Additional advantages with Apache Spark Availability of intermediate processing steps for workflows Availability of graph and ML algorithms
  • 31. THANK YOU! QUESTIONS? Special thanks to Ken Robbins, Dave Tester, Steve Litster, Nick Holway, Timothy Danford, Laurent Gautier and Jason Calvert OpenMP/MPI/Hadoop/Spark is available in SciComp/DataEng clusters What are your data / computation challenges?