SlideShare a Scribd company logo
Getting Started with Apache
Spark and Scala
Dinko Srkoč
Instantor Technology Services
About Apache Spark
(inevitable but hopefully quick intro)
● Started at UC Berkeley in 2009
● General purpose cluster computing system
● Fast: 10x on disk, 100x in memory vs Hadoop MapReduce
● Runs locally, in the cloud, on Hadoop, Mesos
● High level APIs in:
○ Scala
○ Python
○ Java
○ R
About Apache Spark
The Stack:
● SQL - SQL and semi/structured data processing
● MLLib - machine learning algorithms
● GraphX - graph processing
● Streaming - stream processing of
live data streams
Data collections in Spark
Collections: immutable, distributed, partitioned across nodes, operated in parallel
● Resilient Distributed Dataset (RDD)
○ Basic abstraction
○ Low-level API
○ Suitable for unstructured data (media, streams of text)
● Dataset/DataFrame
○ Dataset[T] - typed API, DataFrame (a.k.a. DataSet[Row]) - untyped API
○ High-level expressions: filters/maps, aggregations, averages, SQL queries, columnar access
○ optimizations
Demo
The Menu:
● Starter - spark shell
○ Loading from different sources
○ The inevitable word count example
● Intermediate - spark notebook
○ Documentation, data visualization
● Main course - back to shell
○ streaming
○ Spark UI
● Dessert - mini project:
○ SBT
○ Deploying to Google Cloud Dataproc
Thank you!
Questions?

More Related Content

What's hot (20)

PDF
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
PDF
TensorFlowOnSpark Enhanced: Scala, Pipelines, and Beyond with Lee Yang and An...
Databricks
 
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
PDF
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
PDF
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
PDF
Willump: Optimizing Feature Computation in ML Inference
Databricks
 
PDF
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 
PDF
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
PDF
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
PDF
Introduction to dataset
datamantra
 
PDF
Spark Summit EU talk by Luca Canali
Spark Summit
 
PDF
Introduction to Flink Streaming
datamantra
 
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
PDF
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
TensorFlowOnSpark Enhanced: Scala, Pipelines, and Beyond with Lee Yang and An...
Databricks
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
Willump: Optimizing Feature Computation in ML Inference
Databricks
 
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
Introduction to dataset
datamantra
 
Spark Summit EU talk by Luca Canali
Spark Summit
 
Introduction to Flink Streaming
datamantra
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 

Viewers also liked (20)

PDF
Javantura v4 - Let me tell you a story why Scrum is not for you - Roko Roić
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - KumuluzEE – Microservices with Java - Matjaž B. Jurič & Tilen ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Support SpringBoot application development lifecycle using Ora...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Test-driven documentation with Spring REST Docs - Danijel Mitar
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Angular2 - Ionic2 - from birth to stable versions - Hrvoje Pek...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Spring Boot and JavaFX - can they play together - Josip Kovaček
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - What’s NOT new in modular Java - Milen Dyankov
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Java or Scala – Web development with Playframework 2.5.x - Kre...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - CroDuke Indy and the Kingdom of Java Skills - Branko Mihaljevi...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - DMN – supplement your BPMN - Željko Šmaguc
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - (Spring)Boot your application on Red Hat middleware stack - Al...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - JVM++ The GraalVM - Martin Toshev
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - FreeMarker in Spring web - Marin Kalapać
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - The power of cloud in professional services company - Ivan Krn...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Cloud-native Architectures and Java - Matjaž B. Jurič
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - True RESTful Java Web Services with JSON API and Katharsis - M...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Security architecture of the Java platform - Martin Toshev
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Keycloak – instant login for your app - Marko Štrukelj
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v4 - Android App Development in 2017 - Matej Vidaković
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Let me tell you a story why Scrum is not for you - Roko Roić
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - KumuluzEE – Microservices with Java - Matjaž B. Jurič & Tilen ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Support SpringBoot application development lifecycle using Ora...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Test-driven documentation with Spring REST Docs - Danijel Mitar
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Angular2 - Ionic2 - from birth to stable versions - Hrvoje Pek...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Spring Boot and JavaFX - can they play together - Josip Kovaček
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - What’s NOT new in modular Java - Milen Dyankov
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Java and lambdas and streams - are they better than for loops ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Java or Scala – Web development with Playframework 2.5.x - Kre...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - CroDuke Indy and the Kingdom of Java Skills - Branko Mihaljevi...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - DMN – supplement your BPMN - Željko Šmaguc
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - (Spring)Boot your application on Red Hat middleware stack - Al...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - JVM++ The GraalVM - Martin Toshev
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - FreeMarker in Spring web - Marin Kalapać
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - The power of cloud in professional services company - Ivan Krn...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Cloud-native Architectures and Java - Matjaž B. Jurič
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - True RESTful Java Web Services with JSON API and Katharsis - M...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Security architecture of the Java platform - Martin Toshev
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Keycloak – instant login for your app - Marko Štrukelj
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v4 - Android App Development in 2017 - Matej Vidaković
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Ad

Similar to Javantura v4 - Getting started with Apache Spark - Dinko Srkoč (20)

PDF
Intro to apache spark
Amine Sagaama
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PDF
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
 
PDF
Let's start with Spark
Milos Milovanovic
 
PPTX
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Apache spark
Prashant Pranay
 
PDF
Introduction to Apache Spark Ecosystem
Bojan Babic
 
PDF
Intro to Apache Spark
Marius Soutier
 
PDF
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
PPTX
Apache Spark II (SparkSQL)
Datio Big Data
 
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
PDF
Introduction to Apache Spark
Samy Dindane
 
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
PPTX
Dec6 meetup spark presentation
Ramesh Mudunuri
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PPTX
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
PDF
Apache spark
Dona Mary Philip
 
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
Intro to apache spark
Amine Sagaama
 
Apache Spark Overview
Dharmjit Singh
 
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
 
Let's start with Spark
Milos Milovanovic
 
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache spark
Prashant Pranay
 
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Intro to Apache Spark
Marius Soutier
 
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Apache Spark II (SparkSQL)
Datio Big Data
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Introduction to Apache Spark
Samy Dindane
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Apache Spark - A High Level overview
Karan Alang
 
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
Apache spark
Dona Mary Philip
 
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Apache Spark in Industry
Dorian Beganovic
 
Ad

More from HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association (20)

PDF
Java cro'21 the best tools for java developers in 2021 - hujak
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
JavaCro'21 - Java is Here To Stay - HUJAK Keynote
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v7 - Behaviour Driven Development with Cucumber - Ivan Lozić
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PPTX
Javantura v7 - The State of Java - Today and Tomowwow - HUJAK's Community Key...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PPTX
Javantura v7 - Learning to Scale Yourself: The Journey from Coder to Leader -...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
JavaCro'19 - The State of Java and Software Development in Croatia - Communit...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - Java in Croatia and HUJAK - Branko Mihaljević, Aleksander Radovan
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - On the Aspects of Polyglot Programming and Memory Management i...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PPTX
Javantura v6 - Case Study: Marketplace App with Java and Hyperledger Fabric -...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - How to help customers report bugs accurately - Miroslav Čerkez...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - When remote work really works - the secrets behind successful ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - Kotlin-Java Interop - Matej Vidaković
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - Spring HATEOAS hypermedia-driven web services, and clients tha...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - End to End Continuous Delivery of Microservices for Kubernetes...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PPTX
Javantura v6 - Istio Service Mesh - The magic between your microservices - Ma...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - How can you improve the quality of your application - Ioannis ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - Automation of web apps testing - Hrvoje Ruhek
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - Master the Concepts Behind the Java 10 Challenges and Eliminat...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - Building IoT Middleware with Microservices - Mario Kusek
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Javantura v6 - JDK 11 & JDK 12 - Dalibor Topic
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Java cro'21 the best tools for java developers in 2021 - hujak
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
JavaCro'21 - Java is Here To Stay - HUJAK Keynote
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v7 - Behaviour Driven Development with Cucumber - Ivan Lozić
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v7 - The State of Java - Today and Tomowwow - HUJAK's Community Key...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v7 - Learning to Scale Yourself: The Journey from Coder to Leader -...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
JavaCro'19 - The State of Java and Software Development in Croatia - Communit...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - Java in Croatia and HUJAK - Branko Mihaljević, Aleksander Radovan
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - On the Aspects of Polyglot Programming and Memory Management i...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - Case Study: Marketplace App with Java and Hyperledger Fabric -...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - How to help customers report bugs accurately - Miroslav Čerkez...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - When remote work really works - the secrets behind successful ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - Kotlin-Java Interop - Matej Vidaković
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - Spring HATEOAS hypermedia-driven web services, and clients tha...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - End to End Continuous Delivery of Microservices for Kubernetes...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - Istio Service Mesh - The magic between your microservices - Ma...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - How can you improve the quality of your application - Ioannis ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - Automation of web apps testing - Hrvoje Ruhek
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - Master the Concepts Behind the Java 10 Challenges and Eliminat...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - Building IoT Middleware with Microservices - Mario Kusek
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Javantura v6 - JDK 11 & JDK 12 - Dalibor Topic
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 

Recently uploaded (20)

PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Biography of Daniel Podor.pdf
Daniel Podor
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Advancing WebDriver BiDi support in WebKit
Igalia
 

Javantura v4 - Getting started with Apache Spark - Dinko Srkoč

  • 1. Getting Started with Apache Spark and Scala Dinko Srkoč Instantor Technology Services
  • 2. About Apache Spark (inevitable but hopefully quick intro) ● Started at UC Berkeley in 2009 ● General purpose cluster computing system ● Fast: 10x on disk, 100x in memory vs Hadoop MapReduce ● Runs locally, in the cloud, on Hadoop, Mesos ● High level APIs in: ○ Scala ○ Python ○ Java ○ R
  • 3. About Apache Spark The Stack: ● SQL - SQL and semi/structured data processing ● MLLib - machine learning algorithms ● GraphX - graph processing ● Streaming - stream processing of live data streams
  • 4. Data collections in Spark Collections: immutable, distributed, partitioned across nodes, operated in parallel ● Resilient Distributed Dataset (RDD) ○ Basic abstraction ○ Low-level API ○ Suitable for unstructured data (media, streams of text) ● Dataset/DataFrame ○ Dataset[T] - typed API, DataFrame (a.k.a. DataSet[Row]) - untyped API ○ High-level expressions: filters/maps, aggregations, averages, SQL queries, columnar access ○ optimizations
  • 5. Demo The Menu: ● Starter - spark shell ○ Loading from different sources ○ The inevitable word count example ● Intermediate - spark notebook ○ Documentation, data visualization ● Main course - back to shell ○ streaming ○ Spark UI ● Dessert - mini project: ○ SBT ○ Deploying to Google Cloud Dataproc