SlideShare a Scribd company logo
Introduction to Real-time
Big Data with Apache Spark
About Me
https://ua.linkedin.com/in/tarasmatyashovsky
Spark
Fast and general-purpose
cluster computing platform
for large-scale data processing
Why Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Contributors per month to Spark
History
Time to Sort 100TB
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Why Spark is Faster?
Spark processes data in-memory while
Hadoop persists back to the disk
after a map/reduce action
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Powered by Spark
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
Components Stack
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Core Concepts
automatically distribute data across cluster
and
parallelize operations performed on them
Distributed Application
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Spark Core Abstraction
JEEConf 2015 - Introduction to real-time big data with Apache Spark
RDD API
Transformations:
• filter()
• map()
• flatMap()
• distinct()
• union()
• intersection()
• subtract()
• etc.
Actions:
• collect()
• reduce()
• count()
• countByValue()
• first()
• take()
• top()
• etc.
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Sample Application
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Requirements
Analytics about Morning@Lohika events:
• unique participants by companies
• most loyal participants
• participants by position
• etc.
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Data Format
Simple CSV files
all fields are optional
First Name Last Name Company Position Email Present
Vladimir Tsukur GlobalLogic
Tech/Team
Lead
flushdia@gmail.com 1
Mikalai Alimenkou XP Injection Tech Lead
mikalai.alimenkou@
xpinjection.com
1
Taras Matyashovsky Lohika
Software
Engineer
taras.matyashovsky@
gmail.com
0
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Demo Time
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Cluster
Manager
Worker
Driver
Spark
Context
Executor
Task
Worker
Executor
Task
http://spark.apache.org/docs/latest/cluster-overview.html
Task
Task
Demo Explained
Structured data processing
Spark SQL
Distributed collection of data
organized into named columns
Data Frame
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Data Frame API
• selecting columns
• joining different data sources
• aggregation, e.g. sum, count, average
• filtering
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Plan Optimization & Execution
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
Faster than RDD
http://www.slideshare.net/databricks/spark-sqlsse2015public
Demo Time
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
JEEConf 2015 - Introduction to real-time big data with Apache Spark
https://spark.apache.org/docs/latest/tuning.html
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Our Spark Integration
Product
Cloud-based analytics application
Use Cases
• supplement Neo4j database used to
store/query big dimensions
• supplement RDBMS for querying of
high volumes of data
Use Cases
• represent existing computational graph
as flow of Spark-based operations
• predictive analytics based on Spark
MLib component
Lessons Learned
• Spark simplicity is deceptive
• Each use case is unique
• Be really aware:
• Databricks blog
• Mailing lists & Jira
• Pull requests
Spark is kind of magic
Spark is on a Rise
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
Project Tungsten
• the largest change to Spark’s execution
engine since the project’s inception
• focuses on substantially improving the
efficiency of memory and CPU for
Spark applications
• sun.misc.Unsafe
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Thank you!
Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/
References
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-
business-gordon
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-
models/
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early
release ebook from O'Reilly Media)
https://spark-prs.appspot.com/#all
https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-
sorting.html
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
http://www.slideshare.net/databricks/spark-sqlsse2015public
https://spark.apache.org/docs/latest/running-on-mesos.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
http://spark-packages.org/

More Related Content

What's hot (20)

PPTX
Spark - Migration Story
Roman Chukh
 
PPTX
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
PDF
ASPgems - kappa architecture
Juantomás García Molina
 
PPTX
[Strata] Sparkta
Stratio
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
PPTX
Zeppelin at Twitter
Prasad Wagle
 
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
PDF
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Databricks
 
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
PDF
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Khai Tran
 
PPTX
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
PDF
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Databricks
 
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
PDF
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
PDF
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
PDF
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark - Migration Story
Roman Chukh
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
ASPgems - kappa architecture
Juantomás García Molina
 
[Strata] Sparkta
Stratio
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
Zeppelin at Twitter
Prasad Wagle
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Databricks
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Khai Tran
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Databricks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 

Viewers also liked (20)

PPTX
JEEConf 2015 Big Data Analysis in Java World
Serg Masyutin
 
PDF
BI Suite Overview
Bruno Saraiva
 
PPTX
Getting Started with J2EE, A Roadmap
Makarand Bhatambarekar
 
PPTX
DMDW 11. Student Presentation - JAVA to MongoDB
Johannes Hoppe
 
PDF
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
VMworld
 
PDF
your browser, my storage
Francesco Fullone
 
PDF
JDK: CPU, PSU, LU, FR — WTF?!
Alexey Fyodorov
 
KEY
Functional UI testing of Adobe Flex RIA
Viktor Gamov
 
PPTX
Creating your own private Download Center with Bintray
Baruch Sadogursky
 
PDF
WebSockets: The Current State of the Most Valuable HTML5 API for Java Developers
Viktor Gamov
 
PDF
JavaOne 2013: «Java and JavaScript - Shaken, Not Stirred»
Viktor Gamov
 
PDF
Societal Impact of Applied Data Science on the Big Data Stack
Stealth Project
 
PDF
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Oracle Code SF...
Baruch Sadogursky
 
PPTX
Java 8 Puzzlers [as presented at OSCON 2016]
Baruch Sadogursky
 
PPTX
Spring Data: New approach to persistence
Oleksiy Rezchykov
 
KEY
Testing Flex RIAs for NJ Flex user group
Viktor Gamov
 
PPTX
Confession of an Engineer
Taras Matyashovsky
 
PPTX
Morning at Lohika 2nd anniversary
Taras Matyashovsky
 
PDF
Pragmatic functional refactoring with java 8 (1)
RichardWarburton
 
PDF
Couchbase Sydney meetup #1 Couchbase Architecture and Scalability
Karthik Babu Sekar
 
JEEConf 2015 Big Data Analysis in Java World
Serg Masyutin
 
BI Suite Overview
Bruno Saraiva
 
Getting Started with J2EE, A Roadmap
Makarand Bhatambarekar
 
DMDW 11. Student Presentation - JAVA to MongoDB
Johannes Hoppe
 
VMworld 2013: Tech Preview: Accelerating Data Operations Using VMware VVols a...
VMworld
 
your browser, my storage
Francesco Fullone
 
JDK: CPU, PSU, LU, FR — WTF?!
Alexey Fyodorov
 
Functional UI testing of Adobe Flex RIA
Viktor Gamov
 
Creating your own private Download Center with Bintray
Baruch Sadogursky
 
WebSockets: The Current State of the Most Valuable HTML5 API for Java Developers
Viktor Gamov
 
JavaOne 2013: «Java and JavaScript - Shaken, Not Stirred»
Viktor Gamov
 
Societal Impact of Applied Data Science on the Big Data Stack
Stealth Project
 
DevOps @Scale (Greek Tragedy in 3 Acts) as it was presented at Oracle Code SF...
Baruch Sadogursky
 
Java 8 Puzzlers [as presented at OSCON 2016]
Baruch Sadogursky
 
Spring Data: New approach to persistence
Oleksiy Rezchykov
 
Testing Flex RIAs for NJ Flex user group
Viktor Gamov
 
Confession of an Engineer
Taras Matyashovsky
 
Morning at Lohika 2nd anniversary
Taras Matyashovsky
 
Pragmatic functional refactoring with java 8 (1)
RichardWarburton
 
Couchbase Sydney meetup #1 Couchbase Architecture and Scalability
Karthik Babu Sekar
 
Ad

Similar to JEEConf 2015 - Introduction to real-time big data with Apache Spark (20)

PPTX
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PPTX
Spark 101
Lance Co Ting Keh
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Dev Ops Training
Spark Summit
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPTX
LanceIntroSpark_box
Lance Co Ting Keh
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Codemotion
 
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
PDF
Introduction to Spark Training
Spark Summit
 
PDF
Spark Will Replace Hadoop ! Know Why
Edureka!
 
PDF
Lessons from Running Large Scale Spark Workloads
Databricks
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
PPTX
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
PDF
Spark forspringdevs springone_final
sdeeg
 
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Bds session 13 14
Infinity Tech Solutions
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Dev Ops Training
Spark Summit
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
LanceIntroSpark_box
Lance Co Ting Keh
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Codemotion
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Introduction to Spark Training
Spark Summit
 
Spark Will Replace Hadoop ! Know Why
Edureka!
 
Lessons from Running Large Scale Spark Workloads
Databricks
 
Intro to Spark development
Spark Summit
 
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
Spark forspringdevs springone_final
sdeeg
 
Ad

More from Taras Matyashovsky (9)

PPTX
Morning 3 anniversary
Taras Matyashovsky
 
PPTX
Distinguish Pop from Heavy Metal using Apache Spark MLlib
Taras Matyashovsky
 
PPTX
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
PPTX
Influence. The Psychology of Persuasion (in IT)
Taras Matyashovsky
 
PPTX
Morning at Lohika 1st anniversary
Taras Matyashovsky
 
PPTX
New life inside monolithic application
Taras Matyashovsky
 
PDF
Distributed applications using Hazelcast
Taras Matyashovsky
 
PPTX
Morning at Lohika
Taras Matyashovsky
 
PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
Morning 3 anniversary
Taras Matyashovsky
 
Distinguish Pop from Heavy Metal using Apache Spark MLlib
Taras Matyashovsky
 
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
Influence. The Psychology of Persuasion (in IT)
Taras Matyashovsky
 
Morning at Lohika 1st anniversary
Taras Matyashovsky
 
New life inside monolithic application
Taras Matyashovsky
 
Distributed applications using Hazelcast
Taras Matyashovsky
 
Morning at Lohika
Taras Matyashovsky
 
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 

Recently uploaded (20)

PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Import Data Form Excel to Tally Services
Tally xperts
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 

JEEConf 2015 - Introduction to real-time big data with Apache Spark

Editor's Notes

  • #9: Real-time, streaming Structures which could not be decomposed to key-value pairs Jobs/algorithms which do not yield to the MapReduce programming model
  • #22: Functional Programming API Drawback - limited opportunities for automatic optimization
  • #32: Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn Cluster Manager should be chosen and configured properly Monitoring via web UI(s) and metrics Web UI: master web UI worker web UI driver web UI - available only during execution history server - spark.eventLog.enabled = true Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
  • #33: Serialization: default and Kryo Tune Executor Memory Fraction: RDD Storage (60%), Shuffle and Aggregation Buffers (20%), User code (20%) Tune storage level: store in memory and/or on disk store as unserialized/serialized objects replicate each partition on 1 or 2 cluster nodes store in Tachyon Level of Parallelism: spark.task.cpus 1 task per partition using 1 core to execute spark.default.parallelism can be controlled: repartition() and coalescence() functions degree of parallelism as a operations parameter storage system matters Data locality: check data locality via UI configure data locality settings if needed spark.locality.wait timeout execute certain jobs on a driver spark.localExecution.enabled
  • #34: API can be experimental or used just for development Spark Java API can be not up-to-date as Scala API is main focus