JEEConf 2015 - Introduction to real-time big data with Apache Spark

Introduction to Real-time
Big Data with Apache Spark

About Me
https://ua.linkedin.com/in/tarasmatyashovsky

Spark
Fast and general-purpose
cluster computing platform
for large-scale data processing

Why Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Contributors per month to Spark

Time to Sort 100TB

Why Spark is Faster?
Spark processes data in-memory while
Hadoop persists back to the disk
after a map/reduce action

JEEConf 2015 - Introduction to real-time big data with Apache Spark

Powered by Spark
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Components Stack
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

Core Concepts
automatically distribute data across cluster
and
parallelize operations performed on them

Distributed Application
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

RDD API
Transformations:
• filter()
• map()
• flatMap()
• distinct()
• union()
• intersection()
• subtract()
• etc.
Actions:
• collect()
• reduce()
• count()
• countByValue()
• first()
• take()
• top()
• etc.
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

Sample Application
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Requirements
Analytics about Morning@Lohika events:
• unique participants by companies
• most loyal participants
• participants by position
• etc.

Data Format
Simple CSV files
all fields are optional
First Name Last Name Company Position Email Present
Vladimir Tsukur GlobalLogic
Tech/Team
Lead
flushdia@gmail.com 1
Mikalai Alimenkou XP Injection Tech Lead
mikalai.alimenkou@
xpinjection.com
1
Taras Matyashovsky Lohika
Software
Engineer
taras.matyashovsky@
gmail.com
0

Demo Time

Cluster
Manager
Worker
Driver
Spark
Context
Executor
Task
Worker
Executor
Task
http://spark.apache.org/docs/latest/cluster-overview.html
Task
Task
Demo Explained

Structured data processing
Spark SQL

Distributed collection of data
organized into named columns
Data Frame

Data Frame API
• selecting columns
• joining different data sources
• aggregation, e.g. sum, count, average
• filtering

Plan Optimization & Execution
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf

Faster than RDD
http://www.slideshare.net/databricks/spark-sqlsse2015public

https://spark.apache.org/docs/latest/tuning.html

Product
Cloud-based analytics application

Use Cases
• supplement Neo4j database used to
store/query big dimensions
• supplement RDBMS for querying of
high volumes of data

Use Cases
• represent existing computational graph
as flow of Spark-based operations
• predictive analytics based on Spark
MLib component

Lessons Learned
• Spark simplicity is deceptive
• Each use case is unique
• Be really aware:
• Databricks blog
• Mailing lists & Jira
• Pull requests
Spark is kind of magic

http://www.techrepublic.com/article/can-anything-dim-apache-spark/

Project Tungsten
• the largest change to Spark’s execution
engine since the project’s inception
• focuses on substantially improving the
efficiency of memory and CPU for
Spark applications
• sun.misc.Unsafe
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

Thank you!
Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/

References
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-
business-gordon
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-
models/
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early
release ebook from O'Reilly Media)
https://spark-prs.appspot.com/#all
https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-
sorting.html
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
http://www.slideshare.net/databricks/spark-sqlsse2015public
https://spark.apache.org/docs/latest/running-on-mesos.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
http://spark-packages.org/

JEEConf 2015 - Introduction to real-time big data with Apache Spark

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to JEEConf 2015 - Introduction to real-time big data with Apache Spark (20)

More from Taras Matyashovsky (9)

Recently uploaded (20)

JEEConf 2015 - Introduction to real-time big data with Apache Spark

Editor's Notes