SlideShare a Scribd company logo
Your First Spark Application
In One Hour
Hadoop introduction
MR vs Spark
Spark application from scratch
Vitaliy Bashun
 Data Architect
 SoftServe, London, UK
 Background: Java, Databases, Micro-Services, Big Data
Which level are you at?
I have heard a bit about Big Data. Want more.
I know some Big Data principles. Need some practice.
I tried some tools. Want to try Spark.
I’m not bad in Big Data. Want to work with Spark.
I know Spark a bit. Want to hear more.
I’m an active Spark user. Came to show how cool I am.
My PM (Tech Lead, etc.) sent me here. Leve me in my quiet corner.
Vitalii Bashun "First Spark application in one hour"
Agenda
 Big Data. What and Why
 Big Data tool groups
 Why Spark became an industry standard
 Spark concepts
 RDD, Data Frame, Data Set
 The First Spark Application
 Spark Shell
Big Data. What and Why?
Long, long way to Hadoop
Long, long way to Hadoop
Vertical scale
Long, long way to Hadoop
Horizontal scale
Long, long way to Hadoop
Long, long way to Hadoop
Long, long way to Hadoop
Long, long way to Hadoop
Long, long way to Hadoop
HDFS
Long, long way to Hadoop
HDFS
Long, long way to Hadoop
Long, long way to Hadoop
Long, long way to Hadoop
Long, long way to Hadoop
Why Spark became an industry standard
All intermediate data is in memory. Faster processing
More convenient programming abstraction (over MR)
MLib
Spark SQL
Streaming support
Vitalii Bashun "First Spark application in one hour"
Vitalii Bashun "First Spark application in one hour"
Big Data Tool Groups
2 concepts you need to know about to
create a Spark application
Spark Session
RDD, DataFrame, Dataset
Spark Session (alternatively SparkContext )
The entry point to programming Spark
with the Dataset and DataFrame API.
In older Spark versions only SparkConfig
and SparkContext can be used instead
RDD
RDD – Resilient Distributed Dataset.
Distributed collection.
 low level type
 Immutable.
 Transformations create other RDDs (no real
action occur)
 Actions initiate ”real work”
DataFrame – Built on top of RDD. Data
represented as named columns (like a table)
DataSet – Adds compile time types
Let’s make hands dirty
Demo time

More Related Content

What's hot (20)

PPTX
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Durga Gadiraju
 
PDF
Hadoop vs spark
amarkayam
 
PPTX
Stratio big data spain
Álvaro Agea Herradón
 
PDF
Introduction to df
Mohit Jaggi
 
PDF
Hadoop_RealTime_Processing_eVenkat
Venkat Krishnan
 
PDF
Apache Big Data Europa- How to make money with your own data
Jorge Lopez-Malla
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
PPTX
Hadoop world overview trends and topics
Valentin Kropov
 
PDF
Apache Spark Briefing
Thomas W. Dinsmore
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PPTX
Big data and hadoop training - Session 3
hkbhadraa
 
PDF
Briefing on the Modern ML Stack with R
Databricks
 
PPT
Big data and hadoop ecosystem tools
ramesh517
 
PPTX
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
PPTX
Big data Processing with Apache Spark & Scala
Edureka!
 
PPTX
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
PDF
Building DSLs with Scala
Mohit Jaggi
 
PPTX
Introduction to Apache Spark and MLlib
pumaranikar
 
PDF
APACHE SPARK INTERVIEW QUESTIONS AND ANSWERS 2021
Sprintzeal
 
PDF
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Big Data Spain
 
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Durga Gadiraju
 
Hadoop vs spark
amarkayam
 
Stratio big data spain
Álvaro Agea Herradón
 
Introduction to df
Mohit Jaggi
 
Hadoop_RealTime_Processing_eVenkat
Venkat Krishnan
 
Apache Big Data Europa- How to make money with your own data
Jorge Lopez-Malla
 
Hadoop and Big Data: Revealed
Sachin Holla
 
Hadoop world overview trends and topics
Valentin Kropov
 
Apache Spark Briefing
Thomas W. Dinsmore
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Big data and hadoop training - Session 3
hkbhadraa
 
Briefing on the Modern ML Stack with R
Databricks
 
Big data and hadoop ecosystem tools
ramesh517
 
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Big data Processing with Apache Spark & Scala
Edureka!
 
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Building DSLs with Scala
Mohit Jaggi
 
Introduction to Apache Spark and MLlib
pumaranikar
 
APACHE SPARK INTERVIEW QUESTIONS AND ANSWERS 2021
Sprintzeal
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Big Data Spain
 

Similar to Vitalii Bashun "First Spark application in one hour" (20)

PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PPTX
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PDF
Introduction to Spark Training
Spark Summit
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PDF
Spark Driven Big Data Analytics
inoshg
 
PDF
Introduction to apache spark
UserReport
 
PDF
Big data processing with apache spark
sarith divakar
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
Dev Ops Training
Spark Summit
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Bds session 13 14
Infinity Tech Solutions
 
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Intro to Spark development
Spark Summit
 
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Introduction to Spark Training
Spark Summit
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Spark Driven Big Data Analytics
inoshg
 
Introduction to apache spark
UserReport
 
Big data processing with apache spark
sarith divakar
 
Apache Spark Fundamentals
Zahra Eskandari
 
Dev Ops Training
Spark Summit
 
Ad

More from DataConf (9)

PPTX
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
DataConf
 
PPTX
Sergiy Lunyakin "Azure SQL DWH: Tips and Tricks for developers"
DataConf
 
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
PDF
Taras Firman "How to build advanced prediction with adding external data."
DataConf
 
PPTX
Juriy Zaletsky "Використання Encog для прогнозування коливання курсів валют"
DataConf
 
PPTX
Oles Petriv "Semantic image segmentation using word embeddings."
DataConf
 
PPTX
Anastasiya Kaminskaya "How to optimize Tabular model in PowerPivot or in Anal...
DataConf
 
PPTX
Vitalii Bondarenko "Machine Learning on Fast Data"
DataConf
 
PDF
Volodymyr Getmanskyi "Deep learning for satellite imagery colorization and di...
DataConf
 
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
DataConf
 
Sergiy Lunyakin "Azure SQL DWH: Tips and Tricks for developers"
DataConf
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
Taras Firman "How to build advanced prediction with adding external data."
DataConf
 
Juriy Zaletsky "Використання Encog для прогнозування коливання курсів валют"
DataConf
 
Oles Petriv "Semantic image segmentation using word embeddings."
DataConf
 
Anastasiya Kaminskaya "How to optimize Tabular model in PowerPivot or in Anal...
DataConf
 
Vitalii Bondarenko "Machine Learning on Fast Data"
DataConf
 
Volodymyr Getmanskyi "Deep learning for satellite imagery colorization and di...
DataConf
 
Ad

Recently uploaded (20)

PPTX
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
PDF
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PDF
community health nursing question paper 2.pdf
Prince kumar
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PPTX
How to Set Maximum Difference Odoo 18 POS
Celine George
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PDF
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PPSX
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
community health nursing question paper 2.pdf
Prince kumar
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
How to Set Maximum Difference Odoo 18 POS
Celine George
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 

Vitalii Bashun "First Spark application in one hour"

  • 1. Your First Spark Application In One Hour Hadoop introduction MR vs Spark Spark application from scratch
  • 2. Vitaliy Bashun  Data Architect  SoftServe, London, UK  Background: Java, Databases, Micro-Services, Big Data
  • 3. Which level are you at? I have heard a bit about Big Data. Want more. I know some Big Data principles. Need some practice. I tried some tools. Want to try Spark. I’m not bad in Big Data. Want to work with Spark. I know Spark a bit. Want to hear more. I’m an active Spark user. Came to show how cool I am. My PM (Tech Lead, etc.) sent me here. Leve me in my quiet corner.
  • 5. Agenda  Big Data. What and Why  Big Data tool groups  Why Spark became an industry standard  Spark concepts  RDD, Data Frame, Data Set  The First Spark Application  Spark Shell
  • 6. Big Data. What and Why?
  • 7. Long, long way to Hadoop
  • 8. Long, long way to Hadoop Vertical scale
  • 9. Long, long way to Hadoop Horizontal scale
  • 10. Long, long way to Hadoop
  • 11. Long, long way to Hadoop
  • 12. Long, long way to Hadoop
  • 13. Long, long way to Hadoop
  • 14. Long, long way to Hadoop HDFS
  • 15. Long, long way to Hadoop HDFS
  • 16. Long, long way to Hadoop
  • 17. Long, long way to Hadoop
  • 18. Long, long way to Hadoop
  • 19. Long, long way to Hadoop
  • 20. Why Spark became an industry standard All intermediate data is in memory. Faster processing More convenient programming abstraction (over MR) MLib Spark SQL Streaming support
  • 23. Big Data Tool Groups
  • 24. 2 concepts you need to know about to create a Spark application Spark Session RDD, DataFrame, Dataset
  • 25. Spark Session (alternatively SparkContext ) The entry point to programming Spark with the Dataset and DataFrame API. In older Spark versions only SparkConfig and SparkContext can be used instead
  • 26. RDD RDD – Resilient Distributed Dataset. Distributed collection.  low level type  Immutable.  Transformations create other RDDs (no real action occur)  Actions initiate ”real work”
  • 27. DataFrame – Built on top of RDD. Data represented as named columns (like a table) DataSet – Adds compile time types
  • 28. Let’s make hands dirty Demo time