Vitalii Bashun "First Spark application in one hour"

Download as PPTX, PDF

0 likes32 views

The document provides an introduction to creating a first Spark application in one hour. It begins with an overview of Hadoop and why Spark became an industry standard due to its ability to keep intermediate data in memory for faster processing. The key concepts covered are Spark Session, which acts as the entry point for Spark programming, and Resilient Distributed Datasets (RDDs), DataFrames, and Datasets, which are the main abstractions Spark uses for distributed data. The document concludes by stating it will demonstrate creating a hands-on first Spark application using the Spark Shell.

Education

Your First Spark Application
In One Hour
Hadoop introduction
MR vs Spark
Spark application from scratch

Vitaliy Bashun
 Data Architect
 SoftServe, London, UK
 Background: Java, Databases, Micro-Services, Big Data

Which level are you at?
I have heard a bit about Big Data. Want more.
I know some Big Data principles. Need some practice.
I tried some tools. Want to try Spark.
I’m not bad in Big Data. Want to work with Spark.
I know Spark a bit. Want to hear more.
I’m an active Spark user. Came to show how cool I am.
My PM (Tech Lead, etc.) sent me here. Leve me in my quiet corner.

Vitalii Bashun "First Spark application in one hour"

Agenda
 Big Data. What and Why
 Big Data tool groups
 Why Spark became an industry standard
 Spark concepts
 RDD, Data Frame, Data Set
 The First Spark Application
 Spark Shell

Long, long way to Hadoop
Horizontal scale

Why Spark became an industry standard
All intermediate data is in memory. Faster processing
More convenient programming abstraction (over MR)
MLib
Spark SQL
Streaming support

2 concepts you need to know about to
create a Spark application
Spark Session
RDD, DataFrame, Dataset

Spark Session (alternatively SparkContext )
The entry point to programming Spark
with the Dataset and DataFrame API.
In older Spark versions only SparkConfig
and SparkContext can be used instead

RDD
RDD – Resilient Distributed Dataset.
Distributed collection.
 low level type
 Immutable.
 Transformations create other RDDs (no real
action occur)
 Actions initiate ”real work”

DataFrame – Built on top of RDD. Data
represented as named columns (like a table)
DataSet – Adds compile time types

More Related Content

What's hot (20)

PPTX

Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Durga Gadiraju

PDF

Hadoop vs sparkamarkayam

PPTX

Stratio big data spainÁlvaro Agea Herradón

PDF

Introduction to dfMohit Jaggi

PDF

Hadoop_RealTime_Processing_eVenkatVenkat Krishnan

PDF

Apache Big Data Europa- How to make money with your own dataJorge Lopez-Malla

PPTX

Hadoop and Big Data: RevealedSachin Holla

PPTX

Hadoop world overview trends and topicsValentin Kropov

PDF

Apache Spark BriefingThomas W. Dinsmore

PDF

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit

PPTX

Big data and hadoop training - Session 3hkbhadraa

PDF

Briefing on the Modern ML Stack with RDatabricks

PPT

Big data and hadoop ecosystem toolsramesh517

PPTX

Hadoop An IntroductionMohanasundaram Ponnusamy

PPTX

Big data Processing with Apache Spark & ScalaEdureka!

PPTX

Lightening Fast Big Data Analytics using Apache SparkManish Gupta

PDF

Building DSLs with ScalaMohit Jaggi

PPTX

Introduction to Apache Spark and MLlibpumaranikar

PDF

APACHE SPARK INTERVIEW QUESTIONS AND ANSWERS 2021Sprintzeal

PDF

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Big Data Spain

Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Durga Gadiraju

Hadoop vs sparkamarkayam

Stratio big data spainÁlvaro Agea Herradón

Introduction to dfMohit Jaggi

Hadoop_RealTime_Processing_eVenkatVenkat Krishnan

Apache Big Data Europa- How to make money with your own dataJorge Lopez-Malla

Hadoop and Big Data: RevealedSachin Holla

Hadoop world overview trends and topicsValentin Kropov

Apache Spark BriefingThomas W. Dinsmore

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit

Big data and hadoop training - Session 3hkbhadraa

Briefing on the Modern ML Stack with RDatabricks

Big data and hadoop ecosystem toolsramesh517

Hadoop An IntroductionMohanasundaram Ponnusamy

Big data Processing with Apache Spark & ScalaEdureka!

Lightening Fast Big Data Analytics using Apache SparkManish Gupta

Building DSLs with ScalaMohit Jaggi

Introduction to Apache Spark and MLlibpumaranikar

APACHE SPARK INTERVIEW QUESTIONS AND ANSWERS 2021Sprintzeal

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Big Data Spain

Similar to Vitalii Bashun "First Spark application in one hour" (20)

PDF

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan

PPTX

Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfsasuke20y4sh

PDF

Bds session 13 14Infinity Tech Solutions

PPTX

Spark.pptx to knowledge gaining in wdm days agoPreethamMCPreethamMC

PDF

How Apache Spark fits into the Big Data landscapePaco Nathan

PPTX

In Memory Analytics with Apache SparkVenkata Naga Ravi

PPTX

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime

PPTX

Intro to Spark development Spark Summit

PDF

How Apache Spark fits in the Big Data landscapePaco Nathan

PDF

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей

PDF

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking

PPT

Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA

PDF

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

PDF

Introduction to Spark TrainingSpark Summit

PPTX

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

PDF

Spark Driven Big Data Analyticsinoshg

PDF

Introduction to apache sparkUserReport

PDF

Big data processing with apache sparksarith divakar

PPTX

Apache Spark FundamentalsZahra Eskandari

PDF

Dev Ops TrainingSpark Summit

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan

Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfsasuke20y4sh

Bds session 13 14Infinity Tech Solutions

Spark.pptx to knowledge gaining in wdm days agoPreethamMCPreethamMC

How Apache Spark fits into the Big Data landscapePaco Nathan

In Memory Analytics with Apache SparkVenkata Naga Ravi

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime

Intro to Spark development Spark Summit

How Apache Spark fits in the Big Data landscapePaco Nathan

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking

Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

Introduction to Spark TrainingSpark Summit

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

Spark Driven Big Data Analyticsinoshg

Introduction to apache sparkUserReport

Big data processing with apache sparksarith divakar

Apache Spark FundamentalsZahra Eskandari

Dev Ops TrainingSpark Summit

More from DataConf (9)

PPTX

Sergiy Lunyakin "Cloud BI with Azure Analysis Services"DataConf

PPTX

Sergiy Lunyakin "Azure SQL DWH: Tips and Tricks for developers"DataConf

PPTX

Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"DataConf

PDF

Taras Firman "How to build advanced prediction with adding external data."DataConf

PPTX

Juriy Zaletsky "Використання Encog для прогнозування коливання курсів валют"DataConf

PPTX

Oles Petriv "Semantic image segmentation using word embeddings."DataConf

PPTX

Anastasiya Kaminskaya "How to optimize Tabular model in PowerPivot or in Anal...DataConf

PPTX

Vitalii Bondarenko "Machine Learning on Fast Data"DataConf

PDF

Volodymyr Getmanskyi "Deep learning for satellite imagery colorization and di...DataConf

Sergiy Lunyakin "Cloud BI with Azure Analysis Services"DataConf

Sergiy Lunyakin "Azure SQL DWH: Tips and Tricks for developers"DataConf

Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"DataConf

Taras Firman "How to build advanced prediction with adding external data."DataConf

Juriy Zaletsky "Використання Encog для прогнозування коливання курсів валют"DataConf

Oles Petriv "Semantic image segmentation using word embeddings."DataConf

Anastasiya Kaminskaya "How to optimize Tabular model in PowerPivot or in Anal...DataConf

Vitalii Bondarenko "Machine Learning on Fast Data"DataConf

Volodymyr Getmanskyi "Deep learning for satellite imagery colorization and di...DataConf

Recently uploaded (20)

PPTX

Unit 2 COMMERCIAL BANKING, Corporate banking.pptxAnubalaSuresh1

PDF

BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...Nguyen Thanh Tu Collection

PDF

Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...History of Stoke Newington

PDF

community health nursing question paper 2.pdfPrince kumar

PDF

The Constitution Review Committee (CRC) has released an updated schedule for ...nservice241

PDF

The Different Types of Non-Experimental ResearchThelma Villaflores

PDF

Women's Health: Essential Tips for Every Stage.pdfIftikhar Ahmed

PPTX

Universal immunization Programme (UIP).pptxVishal Chanalia

PPTX

Growth and development and milestones, factorsBHUVANESHWARI BADIGER

PDF

The History of Phone Numbers in Stoke Newington by Billy ThomasHistory of Stoke Newington

PPTX

I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...Beena E S

PPTX

Cultivation practice of Litchi in Nepal.pptxUmeshTimilsina1

PPTX

grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptxSireQuinn

PPTX

How to Set Maximum Difference Odoo 18 POSCeline George

PDF

ARAL-Orientation_Morning-Session_Day-11.pdfJoelVilloso1

PDF

LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTUREAPARNA T SHAIL KUMAR

PDF

Isharyanti-2025-Cross Language Communication in Indonesian LanguageNeny Isharyanti

PPTX

A PPT on Alfred Lord Tennyson's Ulysses.Beena E S

PDF

CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdfColégio Santa Teresinha

PPSX

HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st YearPriyanshu Anand

Unit 2 COMMERCIAL BANKING, Corporate banking.pptxAnubalaSuresh1

BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...Nguyen Thanh Tu Collection

Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...History of Stoke Newington

community health nursing question paper 2.pdfPrince kumar

The Constitution Review Committee (CRC) has released an updated schedule for ...nservice241

The Different Types of Non-Experimental ResearchThelma Villaflores

Women's Health: Essential Tips for Every Stage.pdfIftikhar Ahmed

Universal immunization Programme (UIP).pptxVishal Chanalia

Growth and development and milestones, factorsBHUVANESHWARI BADIGER

The History of Phone Numbers in Stoke Newington by Billy ThomasHistory of Stoke Newington

I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...Beena E S

Cultivation practice of Litchi in Nepal.pptxUmeshTimilsina1

grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptxSireQuinn

How to Set Maximum Difference Odoo 18 POSCeline George

ARAL-Orientation_Morning-Session_Day-11.pdfJoelVilloso1

LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTUREAPARNA T SHAIL KUMAR

Isharyanti-2025-Cross Language Communication in Indonesian LanguageNeny Isharyanti

A PPT on Alfred Lord Tennyson's Ulysses.Beena E S

CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdfColégio Santa Teresinha

HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st YearPriyanshu Anand

Vitalii Bashun "First Spark application in one hour"

1. Your First Spark Application In One Hour Hadoop introduction MR vs Spark Spark application from scratch

2. Vitaliy Bashun  Data Architect  SoftServe, London, UK  Background: Java, Databases, Micro-Services, Big Data

3. Which level are you at? I have heard a bit about Big Data. Want more. I know some Big Data principles. Need some practice. I tried some tools. Want to try Spark. I’m not bad in Big Data. Want to work with Spark. I know Spark a bit. Want to hear more. I’m an active Spark user. Came to show how cool I am. My PM (Tech Lead, etc.) sent me here. Leve me in my quiet corner.

5. Agenda  Big Data. What and Why  Big Data tool groups  Why Spark became an industry standard  Spark concepts  RDD, Data Frame, Data Set  The First Spark Application  Spark Shell

6. Big Data. What and Why?

7. Long, long way to Hadoop

8. Long, long way to Hadoop Vertical scale

9. Long, long way to Hadoop Horizontal scale

10. Long, long way to Hadoop

11. Long, long way to Hadoop

12. Long, long way to Hadoop

13. Long, long way to Hadoop

14. Long, long way to Hadoop HDFS

15. Long, long way to Hadoop HDFS

16. Long, long way to Hadoop

17. Long, long way to Hadoop

18. Long, long way to Hadoop

19. Long, long way to Hadoop

20. Why Spark became an industry standard All intermediate data is in memory. Faster processing More convenient programming abstraction (over MR) MLib Spark SQL Streaming support

23. Big Data Tool Groups

24. 2 concepts you need to know about to create a Spark application Spark Session RDD, DataFrame, Dataset

25. Spark Session (alternatively SparkContext ) The entry point to programming Spark with the Dataset and DataFrame API. In older Spark versions only SparkConfig and SparkContext can be used instead

26. RDD RDD – Resilient Distributed Dataset. Distributed collection.  low level type  Immutable.  Transformations create other RDDs (no real action occur)  Actions initiate ”real work”

27. DataFrame – Built on top of RDD. Data represented as named columns (like a table) DataSet – Adds compile time types

28. Let’s make hands dirty Demo time