High Performance Processing of Streaming Data

High Performance Processing of
Streaming Data
Workshops on Dynamic Data Driven Applications Systems(DDDAS) In
conjunction with: 22nd International Conference on
High Performance Computing (HiPC), Bengaluru, India
12/16/2015
1
Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage
and Geoffrey Fox December 16, 2015
gcf@indiana.edu
http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/
Department of Intelligent Systems Engineering
School of Informatics and Computing, Digital Science Center
Indiana University Bloomington

Software Philosophy
• We use the concept of HPC-ABDS High Performance Computing
enhanced Apache Big Data Software Stack illustrated on next slide.
• HPC-ABDS is a collection of 350 software systems used in either HPC or
best practice Big Data applications. The latter include Apache, other open-
source and commercial systems
• HPC-ABDS helps ABDS by allowing HPC to add performance to ABDS
software systems
• HPC-ABDS helps HPC by bringing the rich functionality and software
sustainability model of commercial and open source software. These bring
a large community and expertise that is reasonably easy to find as it is
broadly taught both in traditional courses and by community activities such
as Meet up groups were for example:
– Apache Spark 107,000 meet-up members in 233 groups
– Hadoop 40,000 and installed in 32% of company data systems 2013
– Apache Storm 9,400 members
• This talk focuses on Storm; its use and how one can add high performance
212/16/2015

3
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
Cross-
Cutting
Functions
1) Message
and Data
Protocols:
Avro, Thrift,
Protobuf
2) Distributed
Coordination:
Google
Chubby,
Zookeeper,
Giraffe,
JGroups
3) Security &
Privacy:
InCommon,
Eduroam
OpenStack
Keystone,
LDAP, Sentry,
Sqrrl, OpenID,
SAML OAuth
4)
Monitoring:
Ambari,
Ganglia,
Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad,
Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA),
Jitterbit, Talend, Pentaho, Apatar, Docker Compose
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, Azure Machine
Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM
Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana
Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js
15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud
Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero,
OODT, Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq,
Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
14B) Streams: Storm, S4, Samza, Granules, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream
Analytics, Floe
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Hama,
Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem
13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ,
NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub,
Azure Queues, Event Hubs
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal
Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB
11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB,
Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J,
Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm,
Torque, Globus Tools, Pilot Jobs
8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat,
Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes,
Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,
Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
21 layers
Over 350
Software
Packages
May 15
2015
Green implies HPC
Integration
12/16/2015
High Performance Computing Apache Big Data Software Stack

IOTCloud
• Device  Pub-SubStorm 
Datastore  Data Analysis
• Apache Storm provides scalable
distributed system for processing
data streams coming from devices
in real time.
• For example Storm layer can
decide to store the data in cloud
storage for further analysis or to
send control data back to the
devices
• Evaluating Pub-Sub Systems
ActiveMQ, RabbitMQ, Kafka,
Kestrel
Turtlebot
and Kinect
12/16/2015
4

6 Forms of
MapReduce
cover “all”
circumstances
Describes
different aspects
- Problem
- Machine
- Software
If these different
aspects match,
one gets good
performance
512/16/2015

Cloud controlled Robot Data Pipeline
612/16/2015
Message Brokers
RabbitMQ, Kafka
Gateway Sending
to
pub-sub
Sending
to
Persisting
to storage
Streamin
g
workflow
A stream
application with
some tasks
running in parallel
Multiple
streaming
workflows
Streaming Workflows
Apache Storm
Apache Storm comes from Twitter and supports Map-
Dataflow-Streaming computing model
Key ideas: Pub-Sub, fault-tolerance (Zookeeper), Bolts, Spouts

Simultaneous Localization & Mapping (SLAM)
𝑝(𝑥1:𝑡, 𝑚|𝑧1:𝑡, 𝑢1:𝑡−1) =
𝑝 𝑚 𝑥1:𝑡, 𝑧1:𝑡 𝑝(𝑥1:𝑡|𝑧1:𝑡, 𝑢1:𝑡−1
Particles are
distributed
in parallel tasks
Application
Build a map given the distance
measurements from robot to
objects around it and its pose
Streaming
Workflow
Rao-Blackwellized particle
filtering based algorithm for
SLAM. Distribute the particles
across parallel tasks and compute
in parallel.
Map building
happens
periodically12/16/2015
7

Parallel SLAM Simultaneous Localization and
Mapping by Particle Filtering
812/16/2015
Speedup

Robot Latency Kafka & RabbitMQ
912/16/2015
Kinect with
Turtlebot
and
RabbitMQ
RabbitMQ
versus Kafka

SLAM Latency variations for 4 or 20 way parallelism
Jitter due to Application or System influences such as Network delays, Garbage
collection and Scheduling of tasks
1012/16/2015
No Cut
Fluctuations decrease after Cut on #iterations per swarm member

Fault Tolerance at Message Broker
• RabbitMQ supports Queue replication and persistence to
disk across nodes for fault tolerance
• Can use a cluster of RabbitMQ brokers to achieve high
availability and fault tolerance
• Kafka stores the messages in disk and supports
replication of topics across nodes for fault tolerance.
Kafka's storage first approach may increase reliability but
can introduce increased latency
• Multiple Kafka brokers can be used to achieve high
availability and fault tolerance

Parallel Overheads SLAM Simultaneous Localization
and Mapping: I/O and Garbage Collection
12/16/2015
12

Parallel Overheads SLAM Simultaneous Localization
and Mapping: Load Imbalance Overhead
12/16/2015
13

Multi-Robot Collision Avoidance
Streaming Workflow
Information
from robots
Runs in
parallel
• Second parallel Storm application
• Velocity Obstacles (VOs) along with
other constrains such as acceleration
and max velocity limits,
• Non-Holonomic constraints, for
differential robots, and localization
uncertainty.
• NPC NPS measure parallelism
Control Latency
# Collisions
versus number
of robots
12/16/2015
14

Lessons from using Storm
• We successfully parallelized Storm as core software of two
robot planning applications
• We needed to replace Kafka by RabbitMQ to improve
performance
– Kafka had large variations in response time
• We reduced Garbage Collection overheads
• We see that we need to generalize Storm’s
– Map-Dataflow Streaming architecture to
– Map-Dataflow/Collective Streaming architecture
• Now we use HPC-ABDS to improve Storm communication
performance
1512/16/2015

16
Bringing Optimal Communications to Storm
12/16/2015
Both process based and thread based
parallelism is used
Worker and Task distribution of Storm
A worker hosts multiple tasks. B-1 is a
task of component B and W-1 is a task
of W
Communication links are
between workers
These are multiplexed among
the tasks
W-1
Worker
Node-1
B-1
W-3
Worker
W-2
W-5
Worker
Node-2
W-4
W-7
Worker
W-6
W-1
Worker
Node-1
B-1
W-3
Worker
W-2
W-5
Worker
Node-2
W-4
W-7
Worker
W-6

Memory Mapped File based
Communication
• Inter process communications using shared memory for a
single node
• Multiple writer single reader design
• A memory mapped file is created for each worker of a node
• Create the file under /dev/shm
• Writer breaks the message in to packets and puts them to file
• Reader reads the packets and assemble the message
• When a file becomes full move to another file
• PS all of this “well known” BUT not deployed
12/16/2015
17

Optimized Broadcast Algorithms
• Binary tree
– Workers arranged in a binary tree
• Flat tree
– Broadcast from the origin to 1 worker in each node
sequentially. This worker broadcast to other workers in the
node sequentially
• Bidirectional Rings
– Workers arranged in a line
– Starts two broadcasts from the origin and these traverse half
of the line
• All well known and we have used similar ideas of basic HPC-
ABDS to improve MPI for machine learning (using Java)
12/16/2015
18

Java MPI performs better than Threads I
128 24 core Haswell nodes with Java Machine Learning
Default MPI much worse than threads
Optimized MPI using shared memory node-based messaging is much better
than threads
1912/16/2015

Java MPI performs better than Threads II
128 24 core Haswell nodes
2012/16/2015
200K Dataset Speedup

Speedups show classic parallel computing structure
with 48 node single core as “sequential”
State of art dimension reduction routine
Speedups improve as problem size increases
48 nodes, 1 core to 128 nodes 24 cores is potential speedup of 64
2112/16/2015

Experimental Configuration
• 11 Node cluster
• 1 Node – Nimbus & ZooKeeper
• 1 Node – RabbitMQ
• 1 Node – Client
• 8 Nodes – Supervisors with 4 workers each
• Client sends messages with the current timestamp, the topology returns
a response with the same time stamp. Latency = current time -
timestamp
12/16/2015
22
W-1
W-5
W-n
B-1R-1 G-1RabbitMQ RabbitMQ
Client

Original
Binary Tree
Flat Tree
Bidirectional
Ring
Speedup of latency with both TCP based and Shared Memory based
communications for different algorithms and sizes
12/16/2015
23
Original and new Storm Broadcast Algorithms

Future Work
• Memory mapped communications require continuous
polling by a thread. If this tread does the processing of
the message, the polling overhead can be reduced.
• Scheduling of tasks should take the communications in to
account
• The current processing model has multiple threads
processing a message at different stages. Reduce the
number of threads to achieve predictable performance
• Improve the packet structure to reduce the overhead
• Compare with related Java MPI technology
• Add additional collectives to those supported by Storm
12/16/2015
24

Conclusions on initial HPC-ABDS
use in Apache Storm
• Apache Storm worked well with performance
enhancements
• For Binary tree performed the best
• Algorithms reduces the network traffic
• Shared memory communications reduce the
latency further
• Memory mapped file communications improve
performance
12/16/2015
25

Thank You
• References
– Our software https://github.com/iotcloud
– Apache Storm http://storm.apache.org/
– We will donate software to Storm
– SLAM paper
http://dsc.soic.indiana.edu/publications/SLAM_In_
the_cloud.pdf
– Collision Avoidance paper http://goo.gl/xdB8LZ
12/16/2015
26

Spare SLAM Slides
12/16/2015
27

• IoTCloud uses Zookeeper,
Storm, Hbase, RabbitMQ
for robot cloud control
• Focus on high performance
(parallel) control functions
• Guaranteed real time
response
12/16/2015
28
Parallel
simultaneous
localization and
mapping
(SLAM) in the
cloud

Latency with RabbitMQ
Different Message sizes in
bytes
Latency with Kafka
Note change in scales
for latency and
message size
12/16/2015
29

Robot Latency Kafka & RabbitMQ
Kinect with
Turtlebot
and
RabbitMQ
RabbitMQ
versus
Kafka
12/16/2015
30

Parallel SLAM Simultaneous Localization
and Mapping by Particle Filtering
12/16/2015
31

Spare High Performance
Storm Slides
12/16/2015
32

Memory Mapped Communication
12/16/2015
33
write Packet 1 Packet 2 Packet 3
Writer 01
Writer 02
Write
Write
Obtain the write location
atomically and increment
Shared File
Reader
Read packet by packet
sequentially
Use a new file when the file size is reached
Reader deletes the files after it reads them fully
ID No of
Packets
Packet
No
Dest Task Content
Length
Source
Task
Stream
Length
Stream Content
16 4 4 4 4 4 4Bytes
Fields
Packet Structure

Default Broadcast
3412/16/2015
W-1
Worker
Node-1
B-1
W-3
Worker
W-2
W-5
Worker
Node-2
W-4
W-7
Worker
W-6
B-1 wants to broadcast a message to W, it sends 6
messages through 3 TCP communication channels
and send 1 message to W-1 via shared memory

Memory Mapped Communication
12/16/2015
35
No significant difference
because we are using all
the workers in the cluster
beyond 30 workers capacity
A topology with pipeline going through all the workers
Non Optimized Time

Spare Parallel Tweet
Clustering with Storm Slides
12/16/2015
36

Parallel Tweet Clustering with Storm
• Judy Qiu, Emilio Ferrara and Xiaoming Gao
• Storm Bolts coordinated by ActiveMQ to synchronize
parallel cluster center updates – add loops to Storm
• 2 million streaming tweets processed in 40 minutes;
35,000 clusters
3712/16/2015
Sequential
Parallel –
eventually
10,000 bolts

Parallel Tweet Clustering with Storm
3812/16/2015
• Speedup on up to 96 bolts on two clusters Moe and Madrid
• Red curve is old algorithm;
• green and blue new algorithm
• Full Twitter – 1000 way parallelism
• Full Everything – 10,000 way parallelism

High Performance Processing of Streaming Data

More Related Content

What's hot (19)

Similar to High Performance Processing of Streaming Data (20)

More from Geoffrey Fox (20)

Recently uploaded (20)

High Performance Processing of Streaming Data