SlideShare a Scribd company logo
© 2017 KNIME.com AG. All Rights Reserved.
Heterogeneous Data Mining
with Spark
Tobias Kötter
KNIME
2© 2017 KNIME.com AG. All Rights Reserved.
What is KNIME?
© 2017 KNIME.com AG. All Rights Reserved. 3
The KNIME® Analytics Platform
© 2017 KNIME.com AG. All Rights Reserved. 4
Analysis & Mining
Statistics, Machine Learning, Data
Mining, Web Analytics, Text
Mining, Network Analysis, Social
Media Analysis, R, Weka, Python,
Community / 3rd party, ...
Data Access
MySQL, Oracle, ...
SAS, SPSS, ...
Excel, Flat, ...
Hive, Impala, ...
XML, JSON, PMML
Text, Doc, Image, ...
Web Crawlers,
Industry Specific,
Community / 3rd
party ...
Transformation
Row, Column, Matrix
Text, Image, Networks, Time
Series, Java, Python,
Community / 3rd party, ...
Visualization
R, Python,
JFreeChart,
JavaScript,
Community / 3rd party, ...
Deployment
via BIRT
PMML, XML, JSON
Databases, Excel, Flat, etc.
Text, Doc, Image
Industry Specific
Community / 3rd party, ...
Over 1500 native and embedded nodes included:
Big Data
Hive, Impala, HDFS Vertica,
Teradata/Aster, Spark, MLlib,
Community / 3rd party, ...
© 2017 KNIME.com AG. All Rights Reserved. 5
Broad Range of KNIME Application Areas & Customers
Advanced
Analytics
Pharma
Health Care
Finance
Retail
Customer
Intelligence
Manu-
facturing
© 2017 KNIME.com AG. All Rights Reserved. 6
KNIME Analytics Platform: Try it Now!
• Download from
www.knime.com
• Browse the KNIME
Learning Hub at
www.knime.com/learning-hub
• Check out KNIME Press
“KNIME Beginner’s Guide”
• Become active on our Forum!
© 2017 KNIME.com AG. All Rights Reserved. 7
KNIME Software Overview
© 2017 KNIME.com AG. All Rights Reserved. 8
KNIME Big Data Connectors
• Package required drivers/libraries for HDFS, Hive, Impala
access
• Runs on Hadoop
• Preconfigured connectors
– Hive
– Cloudera Impala
– (secured) HDFS, webHDFS, httpFS
• Support for Kerberos secured cluster
• Extends the open source database and remote file
handling integration
© 2017 KNIME.com AG. All Rights Reserved. 9
Database Extension
• Visually assemble complex SQL statements
• Connect to almost all JDBC-compliant databases
• Harness the power of your database within KNIME
• Operations are performed within the database
© 2017 KNIME.com AG. All Rights Reserved. 10
KNIME Spark Executor
• Commercial extension
• Based on Spark MLlib
• Scalable machine learning library
• Algorithms for
– Classification (decision tree, naïve bayes, …)
– Regression (logistic regression, linear regression, …)
– Clustering (k-means)
– Collaborative filtering (ALS)
– Dimensionality reduction (SVD, PCA)
• Supports Spark version 1.2, 1.3, 1.5, 1.6, 2.0, 2.1 and 2.2
• Kerberos secured cluster support
© 2017 KNIME.com AG. All Rights Reserved. 11
Machine Learning – Supervised Learning Example
© 2017 KNIME.com AG. All Rights Reserved. 12
Let KNIME Control Your Spark Jobs
13© 2017 KNIME.com AG. All Rights Reserved.
Use Case
© 2017 KNIME.com AG. All Rights Reserved. 14
The Question
• Wouldn’t it be great to know if your flight will be
delayed?
https://pixabay.com/en/staircase-airport-modern-technology-1149599/
© 2017 KNIME.com AG. All Rights Reserved. 15
The Answer
• Of course, so let’s learn a model that does!
https://pixabay.com/en/banner-yes-no-decision-choice-1183407/
© 2017 KNIME.com AG. All Rights Reserved. 16
The Airport
• Chicago O’Hare International Airport
https://commons.wikimedia.org/wiki/File:O%27Hare_with_AA_plane.JPG Foto Ad Meskens
© 2017 KNIME.com AG. All Rights Reserved. 17
The Airport
• Flughafen Berlin Brandenburg
https://de.wikipedia.org/wiki/Datei:BBI_2010-07-23_5.JPG
© 2017 KNIME.com AG. All Rights Reserved. 18
The Data
• Historical Flight Data
• Airport and City Information
• Geo Coordinates
• Airplane Data
• Radar Images
• Textual Weather Reports
https://commons.wikimedia.org/wiki/File:World-airline-routemap-2009.png
© 2017 KNIME.com AG. All Rights Reserved. 19
The Challenges
• Many Data Sources and Formats
• Large Unstructured Data
• Analyze the Data
© 2017 KNIME.com AG. All Rights Reserved. 20
The Challenges
• Many Data Sources and Formats
• Large Unstructured Data
• Analyze the Data
© 2017 KNIME.com AG. All Rights Reserved. 21
Spark Reader and Writer Nodes
• Read and write various data formats from scalable
storage e.g. HDFS
• Data preview in the node dialog
• Based on the Spark Data Source API
© 2017 KNIME.com AG. All Rights Reserved. 22
CSV to Spark Node
© 2017 KNIME.com AG. All Rights Reserved. 23
CSV to Spark Node
© 2017 KNIME.com AG. All Rights Reserved. 24
Virtual Data Warehouse
© 2017 KNIME.com AG. All Rights Reserved. 25
The Challenges
• Many Data Sources and Formats
• Large Unstructured Data
• Analyze the Data
https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
© 2017 KNIME.com AG. All Rights Reserved. 26
Radar Data Analysis
© 2017 KNIME.com AG. All Rights Reserved. 27
Radar Images
© 2017 KNIME.com AG. All Rights Reserved. 28
Image Processing Result
© 2017 KNIME.com AG. All Rights Reserved. 29
KNIME Image Processing on Spark
© 2017 KNIME.com AG. All Rights Reserved. 30
KNIME Image Processing on Spark
© 2017 KNIME.com AG. All Rights Reserved. 31
Sentiment Analysis
https://pixabay.com/en/emotions-man-happy-sad-face-adult-371238/
© 2017 KNIME.com AG. All Rights Reserved. 32
Textual Weather Reports
© 2017 KNIME.com AG. All Rights Reserved. 33
Text Processing Result
© 2017 KNIME.com AG. All Rights Reserved. 34
KNIME Text Processing on Spark
© 2017 KNIME.com AG. All Rights Reserved. 35
KNIME Text Processing on Spark
© 2017 KNIME.com AG. All Rights Reserved. 36
The Challenges
• Many Data Sources and Formats
• Large Unstructured Data
• Analyze the Data
https://pixabay.com/en/ball-binary-magnifying-glass-hand-958950/
© 2017 KNIME.com AG. All Rights Reserved. 37
Ad-hoc Analysis and Model Learning on Spark
© 2017 KNIME.com AG. All Rights Reserved. 38
Model Learning on Spark
© 2017 KNIME.com AG. All Rights Reserved. 39
Decision Tree Model
© 2017 KNIME.com AG. All Rights Reserved. 40
Ad-hoc Analysis on Spark
© 2017 KNIME.com AG. All Rights Reserved. 41
Do you speak SQL?
• Spark SQL Query node with syntax highlighting and
query completion
© 2017 KNIME.com AG. All Rights Reserved. 42
OSM Map View with Delay by Destination
© 2017 KNIME.com AG. All Rights Reserved. 43
Bar Chart with Delay by Destination
44© 2017 KNIME.com AG. All Rights Reserved.
Behind the Scene
© 2017 KNIME.com AG. All Rights Reserved. 45
Behind the Scene
Cluster Worker NodeCluster Worker Node
KNIME Workflow
KNIME Analytics Platform
KNIME Server
Workflow Replica
Execute KNIME
workflow on Spark
RDD Partition RDD Partition
Input RDD
RDD Partition RDD Partition
Output RDD
Spark Executor JVMSpark Executor JVM
KNIME Workflow
(OSGI)
KNIME Workflow
(OSGI)
© 2017 KNIME.com AG. All Rights Reserved. 46
Behind the Scene
Cluster Worker NodeCluster Worker Node
KNIME Workflow
KNIME Analytics Platform
KNIME Server
Workflow Replica
Execute KNIME
workflow on Spark
RDD Partition RDD Partition
Input RDD
RDD Partition
Output RDD
Spark Executor JVMSpark Executor JVM
KNIME Workflow
(OSGI)
• Variation (1): Send RDD
data through a single
workflow replica
RDD Partition
© 2017 KNIME.com AG. All Rights Reserved. 47
Behind the Scene
Cluster Worker NodeCluster Worker Node
KNIME Workflow
KNIME Analytics Platform
KNIME Server
Workflow Replica
Execute KNIME
workflow on Spark
RDD Partition
Input RDD
Spark Executor JVMSpark Executor JVM
KNIME Workflow
(OSGI)
• Variation (2): Send pre-
grouped RDD data through
workflow replicas RDD Partition
KNIME Workflow
(OSGI)
RDD Partition
Output RDD
RDD Partition
RDD Partition RDD Partition
© 2017 KNIME.com AG. All Rights Reserved. 48
Summary
• Visual assembling of Spark jobs
– No coding required
– Works together with other KNIME nodes e.g. loops
• Spark Data Source API to read from various sources
– Supports various data formats
– Read form any JDBC compliant database
• Prototype to use “all” KNIME nodes in Spark
49© 2017 KNIME.com AG. All Rights Reserved.
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by
KNIME.com AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.

More Related Content

What's hot (20)

PDF
Webinar: Behind the Scenes on Guided Analytics
KNIMESlides
 
PDF
What's New in KNIME Analytics Platform 4.1
KNIMESlides
 
PDF
Knime customer intelligence on social media: Text Analytics vs. Network Mining
KNIMESlides
 
PDF
Codeless Deep Learning for Language Modeling and Image Classification
KNIMESlides
 
PDF
Knime & bioinformatics
BioinformaticsInstitute
 
PDF
Twitter analytics in Bluemix
Wilfried Hoge
 
PDF
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Codemotion
 
PPTX
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Flink Forward
 
PDF
The Race To Better Datacenters - Tailormade Colocation by Globalways AG
Markus Binder
 
PDF
Airline Reservations and Routing: A Graph Use Case
Jason Plurad
 
PDF
NetApp By The Numbers
NetApp Insight
 
PPTX
Reach New Heights with Amazon Redshift
Matillion
 
PDF
NetApp Flash Storage Facts
NetApp Insight
 
PDF
Steve Litras [Cribl] | The Power of Infinite Choice | InfluxDays Virtual Expe...
InfluxData
 
PDF
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
 
PDF
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
InfluxData
 
PDF
Exploring Graph Use Cases with JanusGraph
Jason Plurad
 
PPTX
Pick a Winner: How to Choose a Data Warehouse
Matillion
 
PDF
Graph Computing with JanusGraph
Jason Plurad
 
PDF
Start Flying with Python & Apache TinkerPop
Jason Plurad
 
Webinar: Behind the Scenes on Guided Analytics
KNIMESlides
 
What's New in KNIME Analytics Platform 4.1
KNIMESlides
 
Knime customer intelligence on social media: Text Analytics vs. Network Mining
KNIMESlides
 
Codeless Deep Learning for Language Modeling and Image Classification
KNIMESlides
 
Knime & bioinformatics
BioinformaticsInstitute
 
Twitter analytics in Bluemix
Wilfried Hoge
 
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Codemotion
 
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Flink Forward
 
The Race To Better Datacenters - Tailormade Colocation by Globalways AG
Markus Binder
 
Airline Reservations and Routing: A Graph Use Case
Jason Plurad
 
NetApp By The Numbers
NetApp Insight
 
Reach New Heights with Amazon Redshift
Matillion
 
NetApp Flash Storage Facts
NetApp Insight
 
Steve Litras [Cribl] | The Power of Infinite Choice | InfluxDays Virtual Expe...
InfluxData
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
 
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
InfluxData
 
Exploring Graph Use Cases with JanusGraph
Jason Plurad
 
Pick a Winner: How to Choose a Data Warehouse
Matillion
 
Graph Computing with JanusGraph
Jason Plurad
 
Start Flying with Python & Apache TinkerPop
Jason Plurad
 

Similar to Heterogeneous Data Mining with Spark (20)

PDF
Big Data with KNIME.pdf
James Vp
 
PDF
Open Source Story and what’s new in KNIME Software
KNIMESlides
 
PDF
KNIME For Data Analytics Course Overview
BakhtiarAmaludin
 
PDF
Your Flight is Boarding Now!
MeetupDataScienceRoma
 
PDF
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Rosaria Silipo
 
PDF
Big Data with KNIME is as easy as 1, 2, 3, ...4!
KNIMESlides
 
PDF
KNIME_Server_ProductSheet_122020.pdf
LeangsengLim1
 
PPTX
Building an AI and ML Model Using KNIME and Python.pptx
ssuser448ad3
 
PDF
Big Data LDN 2017: Your flight is boarding now!
Matt Stubbs
 
PDF
From_SPSS Modeler_to_KNIME_v4.7_ebook.pdf
VeniAgustina1
 
PPTX
Knime (Konstanz Information Miner)
Kiran Buriro
 
PPTX
KNIME Data Connect - 5th December 2024 (Arief).pptx
DwiCahya58
 
PPTX
H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...
Sri Ambati
 
PDF
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIMESlides
 
PDF
Let’s talk about reproducible data analysis
Greg Landrum
 
PDF
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIMESlides
 
PDF
Big Data Modeling Challenges and Machine Learning with No Code
Liana Ye
 
PPTX
KNIME_Overview_Presentation data mining tools
YazanMohamed1
 
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
PPTX
Machine learning basic course with KNIME analytics platform
Nathaniel Shimoni
 
Big Data with KNIME.pdf
James Vp
 
Open Source Story and what’s new in KNIME Software
KNIMESlides
 
KNIME For Data Analytics Course Overview
BakhtiarAmaludin
 
Your Flight is Boarding Now!
MeetupDataScienceRoma
 
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Rosaria Silipo
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
KNIMESlides
 
KNIME_Server_ProductSheet_122020.pdf
LeangsengLim1
 
Building an AI and ML Model Using KNIME and Python.pptx
ssuser448ad3
 
Big Data LDN 2017: Your flight is boarding now!
Matt Stubbs
 
From_SPSS Modeler_to_KNIME_v4.7_ebook.pdf
VeniAgustina1
 
Knime (Konstanz Information Miner)
Kiran Buriro
 
KNIME Data Connect - 5th December 2024 (Arief).pptx
DwiCahya58
 
H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...
Sri Ambati
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIMESlides
 
Let’s talk about reproducible data analysis
Greg Landrum
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIMESlides
 
Big Data Modeling Challenges and Machine Learning with No Code
Liana Ye
 
KNIME_Overview_Presentation data mining tools
YazanMohamed1
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Machine learning basic course with KNIME analytics platform
Nathaniel Shimoni
 
Ad

More from KNIMESlides (11)

PDF
Automating Inferences out of Financial Data
KNIMESlides
 
PDF
Credit Card Fraud Detection Tutorial - KNIME Meetup Berlin 2020
KNIMESlides
 
PDF
Credit Card Fraud Detection Tutorial
KNIMESlides
 
PDF
Practicing Data Science: A Collection of Case Studies
KNIMESlides
 
PDF
Scoring Metrics for Classification Models
KNIMESlides
 
PDF
Anomaly Detection - Discover unknown Frauds and Anomalies using Machine Learning
KNIMESlides
 
PDF
Guided Automation- A Blueprint for Interactive Automated Machine Learning
KNIMESlides
 
PDF
From raw data to deployment
KNIMESlides
 
PDF
Just add Imagination
KNIMESlides
 
PDF
Advanced analytics for the Internet of Things. Restocking Rental Bike Stations
KNIMESlides
 
PDF
Text Processing with KNIME
KNIMESlides
 
Automating Inferences out of Financial Data
KNIMESlides
 
Credit Card Fraud Detection Tutorial - KNIME Meetup Berlin 2020
KNIMESlides
 
Credit Card Fraud Detection Tutorial
KNIMESlides
 
Practicing Data Science: A Collection of Case Studies
KNIMESlides
 
Scoring Metrics for Classification Models
KNIMESlides
 
Anomaly Detection - Discover unknown Frauds and Anomalies using Machine Learning
KNIMESlides
 
Guided Automation- A Blueprint for Interactive Automated Machine Learning
KNIMESlides
 
From raw data to deployment
KNIMESlides
 
Just add Imagination
KNIMESlides
 
Advanced analytics for the Internet of Things. Restocking Rental Bike Stations
KNIMESlides
 
Text Processing with KNIME
KNIMESlides
 
Ad

Recently uploaded (20)

PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PPTX
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
fashion industry boom.pptx an economics project
TGMPandeyji
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
AI/ML Applications in Financial domain projects
Rituparna De
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
fashion industry boom.pptx an economics project
TGMPandeyji
 

Heterogeneous Data Mining with Spark

  • 1. © 2017 KNIME.com AG. All Rights Reserved. Heterogeneous Data Mining with Spark Tobias Kötter KNIME
  • 2. 2© 2017 KNIME.com AG. All Rights Reserved. What is KNIME?
  • 3. © 2017 KNIME.com AG. All Rights Reserved. 3 The KNIME® Analytics Platform
  • 4. © 2017 KNIME.com AG. All Rights Reserved. 4 Analysis & Mining Statistics, Machine Learning, Data Mining, Web Analytics, Text Mining, Network Analysis, Social Media Analysis, R, Weka, Python, Community / 3rd party, ... Data Access MySQL, Oracle, ... SAS, SPSS, ... Excel, Flat, ... Hive, Impala, ... XML, JSON, PMML Text, Doc, Image, ... Web Crawlers, Industry Specific, Community / 3rd party ... Transformation Row, Column, Matrix Text, Image, Networks, Time Series, Java, Python, Community / 3rd party, ... Visualization R, Python, JFreeChart, JavaScript, Community / 3rd party, ... Deployment via BIRT PMML, XML, JSON Databases, Excel, Flat, etc. Text, Doc, Image Industry Specific Community / 3rd party, ... Over 1500 native and embedded nodes included: Big Data Hive, Impala, HDFS Vertica, Teradata/Aster, Spark, MLlib, Community / 3rd party, ...
  • 5. © 2017 KNIME.com AG. All Rights Reserved. 5 Broad Range of KNIME Application Areas & Customers Advanced Analytics Pharma Health Care Finance Retail Customer Intelligence Manu- facturing
  • 6. © 2017 KNIME.com AG. All Rights Reserved. 6 KNIME Analytics Platform: Try it Now! • Download from www.knime.com • Browse the KNIME Learning Hub at www.knime.com/learning-hub • Check out KNIME Press “KNIME Beginner’s Guide” • Become active on our Forum!
  • 7. © 2017 KNIME.com AG. All Rights Reserved. 7 KNIME Software Overview
  • 8. © 2017 KNIME.com AG. All Rights Reserved. 8 KNIME Big Data Connectors • Package required drivers/libraries for HDFS, Hive, Impala access • Runs on Hadoop • Preconfigured connectors – Hive – Cloudera Impala – (secured) HDFS, webHDFS, httpFS • Support for Kerberos secured cluster • Extends the open source database and remote file handling integration
  • 9. © 2017 KNIME.com AG. All Rights Reserved. 9 Database Extension • Visually assemble complex SQL statements • Connect to almost all JDBC-compliant databases • Harness the power of your database within KNIME • Operations are performed within the database
  • 10. © 2017 KNIME.com AG. All Rights Reserved. 10 KNIME Spark Executor • Commercial extension • Based on Spark MLlib • Scalable machine learning library • Algorithms for – Classification (decision tree, naïve bayes, …) – Regression (logistic regression, linear regression, …) – Clustering (k-means) – Collaborative filtering (ALS) – Dimensionality reduction (SVD, PCA) • Supports Spark version 1.2, 1.3, 1.5, 1.6, 2.0, 2.1 and 2.2 • Kerberos secured cluster support
  • 11. © 2017 KNIME.com AG. All Rights Reserved. 11 Machine Learning – Supervised Learning Example
  • 12. © 2017 KNIME.com AG. All Rights Reserved. 12 Let KNIME Control Your Spark Jobs
  • 13. 13© 2017 KNIME.com AG. All Rights Reserved. Use Case
  • 14. © 2017 KNIME.com AG. All Rights Reserved. 14 The Question • Wouldn’t it be great to know if your flight will be delayed? https://pixabay.com/en/staircase-airport-modern-technology-1149599/
  • 15. © 2017 KNIME.com AG. All Rights Reserved. 15 The Answer • Of course, so let’s learn a model that does! https://pixabay.com/en/banner-yes-no-decision-choice-1183407/
  • 16. © 2017 KNIME.com AG. All Rights Reserved. 16 The Airport • Chicago O’Hare International Airport https://commons.wikimedia.org/wiki/File:O%27Hare_with_AA_plane.JPG Foto Ad Meskens
  • 17. © 2017 KNIME.com AG. All Rights Reserved. 17 The Airport • Flughafen Berlin Brandenburg https://de.wikipedia.org/wiki/Datei:BBI_2010-07-23_5.JPG
  • 18. © 2017 KNIME.com AG. All Rights Reserved. 18 The Data • Historical Flight Data • Airport and City Information • Geo Coordinates • Airplane Data • Radar Images • Textual Weather Reports https://commons.wikimedia.org/wiki/File:World-airline-routemap-2009.png
  • 19. © 2017 KNIME.com AG. All Rights Reserved. 19 The Challenges • Many Data Sources and Formats • Large Unstructured Data • Analyze the Data
  • 20. © 2017 KNIME.com AG. All Rights Reserved. 20 The Challenges • Many Data Sources and Formats • Large Unstructured Data • Analyze the Data
  • 21. © 2017 KNIME.com AG. All Rights Reserved. 21 Spark Reader and Writer Nodes • Read and write various data formats from scalable storage e.g. HDFS • Data preview in the node dialog • Based on the Spark Data Source API
  • 22. © 2017 KNIME.com AG. All Rights Reserved. 22 CSV to Spark Node
  • 23. © 2017 KNIME.com AG. All Rights Reserved. 23 CSV to Spark Node
  • 24. © 2017 KNIME.com AG. All Rights Reserved. 24 Virtual Data Warehouse
  • 25. © 2017 KNIME.com AG. All Rights Reserved. 25 The Challenges • Many Data Sources and Formats • Large Unstructured Data • Analyze the Data https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
  • 26. © 2017 KNIME.com AG. All Rights Reserved. 26 Radar Data Analysis
  • 27. © 2017 KNIME.com AG. All Rights Reserved. 27 Radar Images
  • 28. © 2017 KNIME.com AG. All Rights Reserved. 28 Image Processing Result
  • 29. © 2017 KNIME.com AG. All Rights Reserved. 29 KNIME Image Processing on Spark
  • 30. © 2017 KNIME.com AG. All Rights Reserved. 30 KNIME Image Processing on Spark
  • 31. © 2017 KNIME.com AG. All Rights Reserved. 31 Sentiment Analysis https://pixabay.com/en/emotions-man-happy-sad-face-adult-371238/
  • 32. © 2017 KNIME.com AG. All Rights Reserved. 32 Textual Weather Reports
  • 33. © 2017 KNIME.com AG. All Rights Reserved. 33 Text Processing Result
  • 34. © 2017 KNIME.com AG. All Rights Reserved. 34 KNIME Text Processing on Spark
  • 35. © 2017 KNIME.com AG. All Rights Reserved. 35 KNIME Text Processing on Spark
  • 36. © 2017 KNIME.com AG. All Rights Reserved. 36 The Challenges • Many Data Sources and Formats • Large Unstructured Data • Analyze the Data https://pixabay.com/en/ball-binary-magnifying-glass-hand-958950/
  • 37. © 2017 KNIME.com AG. All Rights Reserved. 37 Ad-hoc Analysis and Model Learning on Spark
  • 38. © 2017 KNIME.com AG. All Rights Reserved. 38 Model Learning on Spark
  • 39. © 2017 KNIME.com AG. All Rights Reserved. 39 Decision Tree Model
  • 40. © 2017 KNIME.com AG. All Rights Reserved. 40 Ad-hoc Analysis on Spark
  • 41. © 2017 KNIME.com AG. All Rights Reserved. 41 Do you speak SQL? • Spark SQL Query node with syntax highlighting and query completion
  • 42. © 2017 KNIME.com AG. All Rights Reserved. 42 OSM Map View with Delay by Destination
  • 43. © 2017 KNIME.com AG. All Rights Reserved. 43 Bar Chart with Delay by Destination
  • 44. 44© 2017 KNIME.com AG. All Rights Reserved. Behind the Scene
  • 45. © 2017 KNIME.com AG. All Rights Reserved. 45 Behind the Scene Cluster Worker NodeCluster Worker Node KNIME Workflow KNIME Analytics Platform KNIME Server Workflow Replica Execute KNIME workflow on Spark RDD Partition RDD Partition Input RDD RDD Partition RDD Partition Output RDD Spark Executor JVMSpark Executor JVM KNIME Workflow (OSGI) KNIME Workflow (OSGI)
  • 46. © 2017 KNIME.com AG. All Rights Reserved. 46 Behind the Scene Cluster Worker NodeCluster Worker Node KNIME Workflow KNIME Analytics Platform KNIME Server Workflow Replica Execute KNIME workflow on Spark RDD Partition RDD Partition Input RDD RDD Partition Output RDD Spark Executor JVMSpark Executor JVM KNIME Workflow (OSGI) • Variation (1): Send RDD data through a single workflow replica RDD Partition
  • 47. © 2017 KNIME.com AG. All Rights Reserved. 47 Behind the Scene Cluster Worker NodeCluster Worker Node KNIME Workflow KNIME Analytics Platform KNIME Server Workflow Replica Execute KNIME workflow on Spark RDD Partition Input RDD Spark Executor JVMSpark Executor JVM KNIME Workflow (OSGI) • Variation (2): Send pre- grouped RDD data through workflow replicas RDD Partition KNIME Workflow (OSGI) RDD Partition Output RDD RDD Partition RDD Partition RDD Partition
  • 48. © 2017 KNIME.com AG. All Rights Reserved. 48 Summary • Visual assembling of Spark jobs – No coding required – Works together with other KNIME nodes e.g. loops • Spark Data Source API to read from various sources – Supports various data formats – Read form any JDBC compliant database • Prototype to use “all” KNIME nodes in Spark
  • 49. 49© 2017 KNIME.com AG. All Rights Reserved. The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States. KNIME® is also registered in Germany.