SlideShare a Scribd company logo
IBM SparkTechnology Center
Paris Open Surce Summit – Apache Software Foundation – Dec 2017
Building IoT Applications with
Apache Spark and Apache Bahir
Luciano Resende
IBM | Spark Technology Center
2
Data Science Platform Architect – IBM – Spark Technology Center
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Spark, Apache Toree among other projects related to Apache Spark ecosystem
lresende@apache.org
http://lresende.blogspot.com/
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
@
About me - Luciano Resende
Open Source Community Leadership
Spark	Technology	Center
Founding	Partner 188+	Project	Committers 77+	Projects
Key	Open	source	steering	committee	
memberships OSS	Advisory	Board
Open	Source
IBM SparkTechnology Center
IBM Spark Technology Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com
Key statistics:
About 40 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now a top level Apache project !
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
4
IBM SparkTechnology Center
Agenda
Introductions
Apache Spark
Apache Bahir
IoT Applications
Live Demo
Summary
References
5
IBM SparkTechnology Center
Apache Spark
6
IBM SparkTechnology Center
Apache Spark Introduction
What is Apache Spark ?
7
Spark Core
Spark
SQL
Spark
Streaming
Spark
ML
Spark
GraphX
executes	SQL	
statements
performs	
streaming	
analytics	using	
micro-batches	
common	
machine	
learning	and	
statistical	
algorithms
distributed	
graph	
processing	
framework
general	compute	engine,	handles	
distributed	task	dispatching,	
scheduling	and	basic	I/O	functions
large	variety	of	data	sources	and	
formats	can	be	supported,	both	on-
premise	or	cloud
BigInsights	
(HDFS)
Cloudant
dashDB
SQL	DB
IBM SparkTechnology Center
Apache Spark Evolution
8
IBM SparkTechnology Center
Apache Spark – Spark SQL
9
Spark
SQL
▪Unified data access APIS: Query
structured data sets with SQL or
Dataset/DataFrame APIs
▪Fast, familiar query language across all
of your enterprise data
RDBMS
Data Sources
Structured
Streaming
Data Sources
IBM SparkTechnology Center
Apache Spark – Spark SQL
You can run SQL statement with SparkSession.sql(…) interface:
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”)
val ds = spark.sql(“select * from T1”)
You can further transform the resultant dataset:
val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”)
val ds2 = ds.orderBy(“c1”)
The result is a DataFrame / Dataset[Row]
ds.show() displays the rows
10
IBM SparkTechnology Center
Apache Spark – Spark SQL
You can read from data sources using SparkSession.read.format(…)
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank]
// loading JSON data to a Dataset of Bank type
val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank]
// select a column value from the Dataset
bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset.
11
IBM SparkTechnology Center
Apache Spark – Spark SQL
You can also configure a specific data source with specific options
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = sparkSession.read
.option("header", ”true") // Use first line of all files as header
.option("inferSchema", ”true") // Automatically infer data types
.option("delimiter", " ")
.csv("/users/lresende/data.csv”)
.as[Bank]
bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset.
12
IBM SparkTechnology Center
Apache Spark – Spark SQL
Data Sources under the covers
• Data source registration (e.g. spark.read.datasource)
• Provide BaseRelation implementation
• That implements support for table scans:
• TableScans, PrunedScan, PrunedFilteredScan, CatalystScan
• Detailed information available at
• http://www.spark.tc/exploring-the-apache-spark-datasource-api/
13
IBM SparkTechnology Center
Apache Spark – Spark SQL Structured Streaming
Unified programming model for streaming, interactive and batch queries
14
Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Considers the data stream as unbounded table
IBM SparkTechnology Center
Apache Spark – Spark SQL Structured Streaming
SQL regular APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.read
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. save(” dest-path”)
15
Structured Streaming APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.readStream
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. startStream(” dest-path”)
IBM SparkTechnology Center
Apache Spark – Spark Streaming
16
Spark
Streaming
▪Micro-batch event processing for near-
real time analytics
▪e.g. Internet of Things (IoT) devices,
Twitter feeds, Kafka (event hub), etc.
▪No multi-threading or parallel process
programming required
IBM SparkTechnology Center
Apache Spark – Spark Streaming
Also known as discretized stream or Dstream
Abstracts a continuous stream of data
Based on micro-batching
Based on RDDs
17
IBM SparkTechnology Center
Apache Spark – Spark Streaming
val sparkConf = new SparkConf()
.setAppName("MQTTWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2)
val words = lines.flatMap(x => x.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
18
IBM SparkTechnology Center
Apache Bahir
19
IBM SparkTechnology Center
MAY/2016: Established as a top-level Apache Project.
• PMC formed by Apache Spark committers/pmc, Apache Members
• Initial contributions imported from Apache Spark
AUG/2016: Flink community join Apache Bahir
• Initial contributions of Flink extensions
• In October 2016 Robert Metzger elected committer
Origins of the Apache Bahir Project
IBM SparkTechnology Center
Origins of the Bahir name
Naming an Apache Project is a science !!!
• We needed a name that wasn’t used yet
• Needed to be related to Spark
We ended up with : Bahir
• A name of Arabian origin that means Sparkling,
• Also associated with a guy who succeeds at everything
IBM SparkTechnology Center
Why Apache Bahir
It’s an Apache project
• And if you are here, you know what it means
What are the benefits of curating your extensions at Apache Bahir
• Apache Governance
• Apache License
• Apache Community
• Apache Brand
22
IBM SparkTechnology Center
Why Apache Bahir
Flexibility
• Release flexibility
• Bounded to platform or component release
Shared infrastructure
• Release, CI, etc
Shared knowledge
• Collaborate with experts on both platform and component areas
23
IBM SparkTechnology Center
Bahir extensions for Apache Spark
MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming.
• http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/
• http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/
Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark Streaming.
Twitter – Enables reading social data from twitter using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/
Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-akka/
ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/
24
IBM SparkTechnology Center
Bahir extensions for Apache Spark
Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub
• https://issues.apache.org/jira/browse/BAHIR-116
25
IBM SparkTechnology Center
Apache Spark extensions in Bahir
Adding Bahir extensions into your application
• Using SBT
• libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0”
• Using Maven
• <dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-mqtt_2.11 </artifactId>
<version>2.2.0</version>
</dependency>
26
IBM SparkTechnology Center
Apache Spark extensions in Bahir
Submitting applications with Bahir extensions to Spark
• Spark-shell
• bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
• Spark-submit
• bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
27
IBM SparkTechnology Center
IoT - Internet of Things
28
IBM SparkTechnology Center
IoT – Definition by Wikipedia
The Internet of things (IoT) is the network of physical devices, vehicles, home
appliances, and other items embedded with electronics, software, sensors,
actuators, and network connectivity which enable these objects to connect and
exchange data.
29
IBM SparkTechnology Center
IoT – Definition by Wikipedia
The Internet of things (IoT) is the network of physical devices, vehicles, home
appliances, and other items embedded with electronics, software, sensors,
actuators, and network connectivity which enable these objects to connect and
exchange data.
30
IBM SparkTechnology Center
IoT – Interaction between multiple entities
31
Things Software
People
actuate
inform
IBM SparkTechnology Center 32
Manufacturer	
Chipset Board Appliance
Cloud
Service	provider
Consumer
IoT Platform
Connectivity Security Analysis Management Integration
IoT Ecosystem in a Nutshell
IBM SparkTechnology Center
IoT Patterns – Some of them …
33
• Remote control
• Security analysis
• Edge analytics
• Historical data analysis
• Distributed Platforms
• Real-time decisions
IBM SparkTechnology Center
IoT Patterns – Real-time decisions
34
• Action is triggered if an anomaly (+/-) is identified
• MTTR (mean time to repair) is critical
• High throughput might hide real issue
• QoS tradeoffs
• Payload size and format
IBM SparkTechnology Center
MQTT – M2M / IoT Connectivity Protocol
35
Connect
+	
Publish
+
Subscribe
~1990
IBM / Eurotech
2010
Published
2011
Eclipse M2M / Paho
2014
OASIS
Open	spec
+	40	client	
implementations
Minimal	
overhead
Tiny	
Clients	
(Java	170KB)
History
Header
2-4	bytes	
(publish)
14	bytes	
(connect)
Soon
V5
IBM SparkTechnology Center
MQTT – Quality of Service
36
MQTT
Broker
QoS0
QoS1
QoS2
At most once
At least once
Exactly once
. No connection failover
. Never duplicate
. Has connection failover
. Can duplicate
. Has connection failover
. Never duplicate
IBM SparkTechnology Center
MQTT – World usage
Smart Home Automation
Messaging
Notable Mentions:
• IBM IoT Platform
• AWS IoT
• Microsoft IoT Hub
• Facebook Messanger
37
IBM SparkTechnology Center
Live Demo
38
IBM SparkTechnology Center
IoT Simulator using MQTT
The demo environment
https://github.com/lresende/bahir-iot-demo
39
Node.js Web app
Simulates Elevator IoT devices
Elevator simulator Metrics:
• Weight
• Speed
• Power
• Temperature
• System
MQTT	
Mosquitto
IBM SparkTechnology Center
Summary
4
0
IBM SparkTechnology Center
Summary – Take away points
Apache Spark
• IoT Analytics Runtime with support for ”Continuous Applications”
Apache Bahir
• Bring access to IoT data via supported connectors (e.g. MQTT)
IoT Applications
• Using Spark and Bahir to start processing IoT data in near real time
using Spark Streaming and Spark Structured Streaming
41
IBM SparkTechnology Center
Join the Apache Bahir community !!!
42
IBM SparkTechnology Center
References
Apache Bahir
http://bahir.apache.org
Documentation for Apache Spark extensions
http://bahir.apache.org/docs/spark/current/documentation/
Source Repositories
https://github.com/apache/bahir
https://github.com/apache/bahir-website
Demo Repository
https://github.com/lresende/bahir-iot-demo
43
Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif

More Related Content

What's hot (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPTX
LEGO: Data Driven Growth Hacking Powered by Big Data
DataWorks Summit/Hadoop Summit
 
PPTX
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
DataWorks Summit
 
PDF
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Artem Ervits
 
PPTX
Hadoop first ETL on Apache Falcon
DataWorks Summit
 
PPTX
Redis for Security Data : SecurityScorecard JVM Redis Usage
Timothy Spann
 
PDF
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Lucidworks
 
PPTX
YARN Ready: Apache Spark
Hortonworks
 
PDF
Spark Security
Yifeng Jiang
 
PPTX
Sharing metadata across the data lake and streams
DataWorks Summit
 
PDF
Hortonworks Technical Workshop - HDP Search
Hortonworks
 
PPTX
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
PDF
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Lucidworks
 
PPTX
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
PPTX
Oracle Office Hours - Exposing REST services with APEX and ORDS
Doug Gault
 
PDF
Apache Zeppelin Helium and Beyond
DataWorks Summit/Hadoop Summit
 
PDF
Spark mhug2
Joseph Niemiec
 
PDF
20150627 bigdatala
gethue
 
PPTX
Apache MetaModel - unified access to all your data points
Kasper Sørensen
 
PDF
Full Stack Scala
Ramnivas Laddad
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
LEGO: Data Driven Growth Hacking Powered by Big Data
DataWorks Summit/Hadoop Summit
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
DataWorks Summit
 
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Artem Ervits
 
Hadoop first ETL on Apache Falcon
DataWorks Summit
 
Redis for Security Data : SecurityScorecard JVM Redis Usage
Timothy Spann
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Lucidworks
 
YARN Ready: Apache Spark
Hortonworks
 
Spark Security
Yifeng Jiang
 
Sharing metadata across the data lake and streams
DataWorks Summit
 
Hortonworks Technical Workshop - HDP Search
Hortonworks
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Lucidworks
 
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Oracle Office Hours - Exposing REST services with APEX and ORDS
Doug Gault
 
Apache Zeppelin Helium and Beyond
DataWorks Summit/Hadoop Summit
 
Spark mhug2
Joseph Niemiec
 
20150627 bigdatala
gethue
 
Apache MetaModel - unified access to all your data points
Kasper Sørensen
 
Full Stack Scala
Ramnivas Laddad
 

Similar to Building iot applications with Apache Spark and Apache Bahir (20)

PDF
Boston Spark Meetup event Slides Update
vithakur
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PPTX
Spark Summit Presentation by Anjul Bhambhri
Spark Summit
 
PPTX
Spark Summit East Keynote by Anjul Bhambhri
Jen Aman
 
PPTX
Keynote at spark summit east anjul
Anjul Bhambhri
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
PDF
Spark Summit EU: IBM Keynote
sparktc
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PDF
Introduction to Apache Spark
datamantra
 
PDF
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
PPT
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Romeo Kienzler
 
PDF
Dev Ops Training
Spark Summit
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
PPT
Spark_Part 1
Shashi Prakash
 
PPTX
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Boston Spark Meetup event Slides Update
vithakur
 
Started with-apache-spark
Happiest Minds Technologies
 
Apache Spark PDF
Naresh Rupareliya
 
Spark Summit Presentation by Anjul Bhambhri
Spark Summit
 
Spark Summit East Keynote by Anjul Bhambhri
Jen Aman
 
Keynote at spark summit east anjul
Anjul Bhambhri
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
Spark Summit EU: IBM Keynote
sparktc
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
Apache Spark in Industry
Dorian Beganovic
 
Introduction to Apache Spark
datamantra
 
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Romeo Kienzler
 
Dev Ops Training
Spark Summit
 
Apache Spark Fundamentals
Zahra Eskandari
 
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Spark_Part 1
Shashi Prakash
 
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Ad

More from Luciano Resende (20)

PDF
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
PDF
Using Elyra for COVID-19 Analytics
Luciano Resende
 
PDF
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Luciano Resende
 
PDF
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
Luciano Resende
 
PDF
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
PDF
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
PDF
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
PDF
Jupyter Enterprise Gateway Overview
Luciano Resende
 
PPTX
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
PDF
Open Source AI - News and examples
Luciano Resende
 
PDF
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
PDF
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
PDF
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
PDF
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
PDF
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
PDF
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
PDF
How mentoring can help you start contributing to open source
Luciano Resende
 
PDF
Luciano Resende's keynote at Apache big data conference
Luciano Resende
 
PPT
Asf icfoss-mentoring
Luciano Resende
 
PDF
Open Source tools overview
Luciano Resende
 
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
Using Elyra for COVID-19 Analytics
Luciano Resende
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Luciano Resende
 
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
Luciano Resende
 
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
Jupyter Enterprise Gateway Overview
Luciano Resende
 
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
Open Source AI - News and examples
Luciano Resende
 
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
How mentoring can help you start contributing to open source
Luciano Resende
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende
 
Asf icfoss-mentoring
Luciano Resende
 
Open Source tools overview
Luciano Resende
 
Ad

Recently uploaded (20)

DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PPTX
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
AI/ML Applications in Financial domain projects
Rituparna De
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Data base management system Transactions.ppt
gandhamcharan2006
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 

Building iot applications with Apache Spark and Apache Bahir

  • 1. IBM SparkTechnology Center Paris Open Surce Summit – Apache Software Foundation – Dec 2017 Building IoT Applications with Apache Spark and Apache Bahir Luciano Resende IBM | Spark Technology Center
  • 2. 2 Data Science Platform Architect – IBM – Spark Technology Center • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Spark, Apache Toree among other projects related to Apache Spark ecosystem [email protected] http://lresende.blogspot.com/ https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende @ About me - Luciano Resende
  • 3. Open Source Community Leadership Spark Technology Center Founding Partner 188+ Project Committers 77+ Projects Key Open source steering committee memberships OSS Advisory Board Open Source
  • 4. IBM SparkTechnology Center IBM Spark Technology Center Founded in 2015. Location: Physical: 505 Howard St., San Francisco CA Web: http://spark.tc Twitter: @apachespark_tc Mission: Contribute intellectual and technical capital to the Apache Spark community. Make the core technology enterprise- and cloud-ready. Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com Key statistics: About 40 developers, co-located with 25 IBM designers. Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now a top level Apache project ! Founding member of UC Berkeley AMPLab and RISE Lab Member of R Consortium and Scala Center 4
  • 5. IBM SparkTechnology Center Agenda Introductions Apache Spark Apache Bahir IoT Applications Live Demo Summary References 5
  • 7. IBM SparkTechnology Center Apache Spark Introduction What is Apache Spark ? 7 Spark Core Spark SQL Spark Streaming Spark ML Spark GraphX executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework general compute engine, handles distributed task dispatching, scheduling and basic I/O functions large variety of data sources and formats can be supported, both on- premise or cloud BigInsights (HDFS) Cloudant dashDB SQL DB
  • 9. IBM SparkTechnology Center Apache Spark – Spark SQL 9 Spark SQL ▪Unified data access APIS: Query structured data sets with SQL or Dataset/DataFrame APIs ▪Fast, familiar query language across all of your enterprise data RDBMS Data Sources Structured Streaming Data Sources
  • 10. IBM SparkTechnology Center Apache Spark – Spark SQL You can run SQL statement with SparkSession.sql(…) interface: val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”) val ds = spark.sql(“select * from T1”) You can further transform the resultant dataset: val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”) val ds2 = ds.orderBy(“c1”) The result is a DataFrame / Dataset[Row] ds.show() displays the rows 10
  • 11. IBM SparkTechnology Center Apache Spark – Spark SQL You can read from data sources using SparkSession.read.format(…) val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank] // loading JSON data to a Dataset of Bank type val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank] // select a column value from the Dataset bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset. 11
  • 12. IBM SparkTechnology Center Apache Spark – Spark SQL You can also configure a specific data source with specific options val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = sparkSession.read .option("header", ”true") // Use first line of all files as header .option("inferSchema", ”true") // Automatically infer data types .option("delimiter", " ") .csv("/users/lresende/data.csv”) .as[Bank] bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset. 12
  • 13. IBM SparkTechnology Center Apache Spark – Spark SQL Data Sources under the covers • Data source registration (e.g. spark.read.datasource) • Provide BaseRelation implementation • That implements support for table scans: • TableScans, PrunedScan, PrunedFilteredScan, CatalystScan • Detailed information available at • http://www.spark.tc/exploring-the-apache-spark-datasource-api/ 13
  • 14. IBM SparkTechnology Center Apache Spark – Spark SQL Structured Streaming Unified programming model for streaming, interactive and batch queries 14 Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Considers the data stream as unbounded table
  • 15. IBM SparkTechnology Center Apache Spark – Spark SQL Structured Streaming SQL regular APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.read .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . save(” dest-path”) 15 Structured Streaming APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.readStream .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . startStream(” dest-path”)
  • 16. IBM SparkTechnology Center Apache Spark – Spark Streaming 16 Spark Streaming ▪Micro-batch event processing for near- real time analytics ▪e.g. Internet of Things (IoT) devices, Twitter feeds, Kafka (event hub), etc. ▪No multi-threading or parallel process programming required
  • 17. IBM SparkTechnology Center Apache Spark – Spark Streaming Also known as discretized stream or Dstream Abstracts a continuous stream of data Based on micro-batching Based on RDDs 17
  • 18. IBM SparkTechnology Center Apache Spark – Spark Streaming val sparkConf = new SparkConf() .setAppName("MQTTWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2) val words = lines.flatMap(x => x.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() 18
  • 20. IBM SparkTechnology Center MAY/2016: Established as a top-level Apache Project. • PMC formed by Apache Spark committers/pmc, Apache Members • Initial contributions imported from Apache Spark AUG/2016: Flink community join Apache Bahir • Initial contributions of Flink extensions • In October 2016 Robert Metzger elected committer Origins of the Apache Bahir Project
  • 21. IBM SparkTechnology Center Origins of the Bahir name Naming an Apache Project is a science !!! • We needed a name that wasn’t used yet • Needed to be related to Spark We ended up with : Bahir • A name of Arabian origin that means Sparkling, • Also associated with a guy who succeeds at everything
  • 22. IBM SparkTechnology Center Why Apache Bahir It’s an Apache project • And if you are here, you know what it means What are the benefits of curating your extensions at Apache Bahir • Apache Governance • Apache License • Apache Community • Apache Brand 22
  • 23. IBM SparkTechnology Center Why Apache Bahir Flexibility • Release flexibility • Bounded to platform or component release Shared infrastructure • Release, CI, etc Shared knowledge • Collaborate with experts on both platform and component areas 23
  • 24. IBM SparkTechnology Center Bahir extensions for Apache Spark MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming. • http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/ • http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/ Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark Streaming. Twitter – Enables reading social data from twitter using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/ Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-akka/ ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/ 24
  • 25. IBM SparkTechnology Center Bahir extensions for Apache Spark Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub • https://issues.apache.org/jira/browse/BAHIR-116 25
  • 26. IBM SparkTechnology Center Apache Spark extensions in Bahir Adding Bahir extensions into your application • Using SBT • libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0” • Using Maven • <dependency> <groupId>org.apache.bahir</groupId> <artifactId>spark-streaming-mqtt_2.11 </artifactId> <version>2.2.0</version> </dependency> 26
  • 27. IBM SparkTechnology Center Apache Spark extensions in Bahir Submitting applications with Bahir extensions to Spark • Spark-shell • bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. • Spark-submit • bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. 27
  • 28. IBM SparkTechnology Center IoT - Internet of Things 28
  • 29. IBM SparkTechnology Center IoT – Definition by Wikipedia The Internet of things (IoT) is the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, sensors, actuators, and network connectivity which enable these objects to connect and exchange data. 29
  • 30. IBM SparkTechnology Center IoT – Definition by Wikipedia The Internet of things (IoT) is the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, sensors, actuators, and network connectivity which enable these objects to connect and exchange data. 30
  • 31. IBM SparkTechnology Center IoT – Interaction between multiple entities 31 Things Software People actuate inform
  • 32. IBM SparkTechnology Center 32 Manufacturer Chipset Board Appliance Cloud Service provider Consumer IoT Platform Connectivity Security Analysis Management Integration IoT Ecosystem in a Nutshell
  • 33. IBM SparkTechnology Center IoT Patterns – Some of them … 33 • Remote control • Security analysis • Edge analytics • Historical data analysis • Distributed Platforms • Real-time decisions
  • 34. IBM SparkTechnology Center IoT Patterns – Real-time decisions 34 • Action is triggered if an anomaly (+/-) is identified • MTTR (mean time to repair) is critical • High throughput might hide real issue • QoS tradeoffs • Payload size and format
  • 35. IBM SparkTechnology Center MQTT – M2M / IoT Connectivity Protocol 35 Connect + Publish + Subscribe ~1990 IBM / Eurotech 2010 Published 2011 Eclipse M2M / Paho 2014 OASIS Open spec + 40 client implementations Minimal overhead Tiny Clients (Java 170KB) History Header 2-4 bytes (publish) 14 bytes (connect) Soon V5
  • 36. IBM SparkTechnology Center MQTT – Quality of Service 36 MQTT Broker QoS0 QoS1 QoS2 At most once At least once Exactly once . No connection failover . Never duplicate . Has connection failover . Can duplicate . Has connection failover . Never duplicate
  • 37. IBM SparkTechnology Center MQTT – World usage Smart Home Automation Messaging Notable Mentions: • IBM IoT Platform • AWS IoT • Microsoft IoT Hub • Facebook Messanger 37
  • 39. IBM SparkTechnology Center IoT Simulator using MQTT The demo environment https://github.com/lresende/bahir-iot-demo 39 Node.js Web app Simulates Elevator IoT devices Elevator simulator Metrics: • Weight • Speed • Power • Temperature • System MQTT Mosquitto
  • 41. IBM SparkTechnology Center Summary – Take away points Apache Spark • IoT Analytics Runtime with support for ”Continuous Applications” Apache Bahir • Bring access to IoT data via supported connectors (e.g. MQTT) IoT Applications • Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming and Spark Structured Streaming 41
  • 42. IBM SparkTechnology Center Join the Apache Bahir community !!! 42
  • 43. IBM SparkTechnology Center References Apache Bahir http://bahir.apache.org Documentation for Apache Spark extensions http://bahir.apache.org/docs/spark/current/documentation/ Source Repositories https://github.com/apache/bahir https://github.com/apache/bahir-website Demo Repository https://github.com/lresende/bahir-iot-demo 43 Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif