SlideShare a Scribd company logo
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1
Oracle Stream Analytics
Complex Event Processing for Apache Spark Streaming
Complex Event Processing for Spark Streaming
Prabhu Thukkaram, Senior Director, Oracle Product Development
Hoyong Park, Architect, Oracle Product Development
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 2
Complex Event Processing with Continuous Query Processor
Continuous
Query Processor
Pre-registered Queries
E3@T3, E2@T2, E1@T1
ResultsInput Events
R3@T3, R2@T2, R1@T1
Heartbeats
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
Why Continuous Query Processor ?
Oracle Confidential – Internal/Restricted/Highly Restricted 3
• Complex event processing requires events to be processed one at a time
– Each event must be processed as identified by its individual timestamp
– Real world events originate at different times and must be processed as such
– CEP applications seek correlation and patterns across events in time within or across
batches and irrespective of batch boundaries
• Window length can span fractional batches
• Micro-batching with Spark Streaming
– All events in the batch are identified by same time (RDD Time)
– No progression of time between events in the same batch
– No progression of time when RDD partitions are empty
• Critical for missing event scenarios. E.g. alert when order status “Received” is not followed by
“Shipped” for order Id 10001 within 1 hour
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
Pattern Detection in CQP
Oracle Confidential – Internal/Restricted/Highly Restricted 4
Checks if temperature readings from a power sensor are wobbling
during a certain time interval.
The CQP code checks for a W-Pattern in temperature readings during a
10 minute interval and selects support levels as output
SELECT LAST(A.value), LAST(C.value) FROM TEMP_STREAM
MATCH_RECOGNIZE (
PARTITION BY DEVICE_ID
PATTERN (A+ B+ C+ D+) DURATION OF 10 MINUTES
DEFINE
A AS (value < PREV(value))
B AS (value > PREV(value))
C AS (value < PREV(value))
D AS (value > PREV(value))
)
A
B
C D
10 Minutes
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
Distributed Complex Event Processing
Oracle Confidential – Internal/Restricted/Highly Restricted 5
• Continuous Query Processor
– Event by event processing
• Each event is assigned a unique timestamp
• Apache Spark
– Distributed computing with scale out and fault tolerance
Spark Streaming + Continuous Query Processor
=
Distributed, Scalable, and Fault Tolerant CEP Platform
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
How CQP complements Spark Streaming ?
Oracle Confidential – Internal/Restricted/Highly Restricted 6
• Continuous, event-by-event, and stateful processing
• Flexible temporal windows (Time & Rows)
• Automatic progression of time
– Heartbeat propagation to advance time
• Pattern detection without batch boundaries
– Built-in finite state automaton
• Declarative SQL like language
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
How Spark complements CQP ?
• Distributed computing framework for CQP
• OOTB sources for data Ingestion
• Horizontal scale-out
• Fault tolerance and high availability
• Spark can detect failures, restart required resources, and automatically replay parts
of the stream that experienced failure
Oracle Confidential – Internal
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
High-Level Architecture
Data Stream Oracle’s CQP Engine
Finite State Automaton for
Pattern Detection across
Discrete Events
HDFS
Journaled CQP engine State serialized to HDFS
after computing each partition
CQP Engine State restored on
Executor Restart and Recompute of a
partition
Geo Sensing Cartridge for
Spatial Analytics
RETE Rules for Conditional
Logic
Complex Pattern Detection, Temporal Queries, Spatial
Queries, Contextual Queries, and Conditional Logic are all
executed in Oracle’s CEP Engine
Distributed Cache for
Ultra-fast Context Lookup
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
CQP Spark Application
Oracle Confidential – Internal
Cluster
Spark Standalone or YARN or Mesos
Spark Driver
Spark Executor
Spark Executor
Spark Executor
• Transformation delegated to CQP
• CQP application parsed in Spark’s Driver process
• Driver builds a DAG (job definition) from CQP application
• Driver runs job for each micro-batch
CQP
CQP
CQP
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
Spark Integration - High Level
• Extensions
– CQLDStream extends DStream overriding compute()
– CQLRDD extends RDD overriding compute()
• Initialization/DAG creation phase
– Create Spark Context, Streaming Context, and start CQP on each Executor
– Setup DAG of CQLDStream
• Physical plan execution phase
– Assign unique timestamp to each event in micro-batch based on RDD time
– Register query with CQP on RDD compute if not already registered
– Send micro-batch to CQP and accept returned results
– Checkpoint CQP state if HA enabled
Oracle Confidential – Internal
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
Spark Integration - High Availability
• Executor Failure
– Restart CQP on newly restarted Executor
– On RDD compute, re-register query and restore query state from snapshots
– Continue processing stream
Oracle Confidential – Internal
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Demo
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal
Q & A
Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. |
CQP scalability in Spark
Oracle Confidential – Internal/Restricted/Highly Restricted 14
0
20000
40000
60000
80000
w1(c3) w2(c5) w3(c7) w4(c9)
Processing Time (seconds) for 40 Million
Records
0
5000
10000
15000
20000
w1(c3) w2(c5) w3(c7) w4(c9)
Avg. Processing Time Per Batch
(milliseconds)
0
20
40
60
80
100
120
w1(c3) w2(c5) w3(c7) w4(c9)
Number of Batches
Processed over 10
Minutes
Legend
Wn = n number of workers or
executors
W2(c5) means 2 executors
and a total of 5 cores across
both executors

More Related Content

What's hot (20)

PDF
Airflow introduction
Chandler Huang
 
PDF
Introduction to Apache Kafka and Confluent... and why they matter
confluent
 
PDF
Apache Flume
Arinto Murdopo
 
PDF
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
Preferred Networks
 
PDF
Grafana introduction
Rico Chen
 
PPTX
Grafana.pptx
Bhushan Rane
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Kafka presentation
Mohammed Fazuluddin
 
PPTX
VPP事始め
npsg
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Uwe Korn
 
PDF
Apache Airflow
Sumit Maheshwari
 
PPTX
Analyzing 1.2 Million Network Packets per Second in Real-time
DataWorks Summit
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
PPTX
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
NTT DATA Technology & Innovation
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PPTX
Apache Airflow overview
NikolayGrishchenkov
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Grafana
NoelMc Grath
 
PDF
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Cloudera Japan
 
Airflow introduction
Chandler Huang
 
Introduction to Apache Kafka and Confluent... and why they matter
confluent
 
Apache Flume
Arinto Murdopo
 
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
Preferred Networks
 
Grafana introduction
Rico Chen
 
Grafana.pptx
Bhushan Rane
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Kafka presentation
Mohammed Fazuluddin
 
VPP事始め
npsg
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Uwe Korn
 
Apache Airflow
Sumit Maheshwari
 
Analyzing 1.2 Million Network Packets per Second in Real-time
DataWorks Summit
 
Dynamic Partition Pruning in Apache Spark
Databricks
 
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
NTT DATA Technology & Innovation
 
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Apache Airflow overview
NikolayGrishchenkov
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Grafana
NoelMc Grath
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Cloudera Japan
 

Similar to Bringing complex event processing to Spark streaming (20)

PPTX
Apache Spark and Oracle Stream Analytics
Prabhu Thukkaram
 
PDF
Soa12c launch 5 event processing shmakov eng cr
Vasily Demin
 
PDF
oracle-complex-event-processing-066421
Stephanie Langenfeld McReynolds
 
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
Stream Analytics
Franco Ucci
 
PPTX
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
PDF
Streaming solutions for real time problems
Aparna Gaonkar
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PPTX
hydrogenbigdataanalysis
Manvi Chandra
 
PPTX
SnappyData overview NikeTechTalk 11/19/15
SnappyData
 
PDF
Scalable Event Processing with WSO2CEP @ WSO2Con2015eu
Sriskandarajah Suhothayan
 
PPTX
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
PPTX
Apache Hive for modern DBAs
Luis Marques
 
PDF
Speeding up big data with event processing
Alexandre de Castro Alves
 
PDF
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
PDF
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
Apache Spark and Oracle Stream Analytics
Prabhu Thukkaram
 
Soa12c launch 5 event processing shmakov eng cr
Vasily Demin
 
oracle-complex-event-processing-066421
Stephanie Langenfeld McReynolds
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
Unified Big Data Processing with Apache Spark
C4Media
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Spark from the Surface
Josi Aranda
 
Stream Analytics
Franco Ucci
 
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
Streaming solutions for real time problems
Aparna Gaonkar
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
hydrogenbigdataanalysis
Manvi Chandra
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData
 
Scalable Event Processing with WSO2CEP @ WSO2Con2015eu
Sriskandarajah Suhothayan
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
Apache Hive for modern DBAs
Luis Marques
 
Speeding up big data with event processing
Alexandre de Castro Alves
 
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
July Patch Tuesday
Ivanti
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
July Patch Tuesday
Ivanti
 

Bringing complex event processing to Spark streaming

  • 1. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1 Oracle Stream Analytics Complex Event Processing for Apache Spark Streaming Complex Event Processing for Spark Streaming Prabhu Thukkaram, Senior Director, Oracle Product Development Hoyong Park, Architect, Oracle Product Development
  • 2. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 2 Complex Event Processing with Continuous Query Processor Continuous Query Processor Pre-registered Queries E3@T3, E2@T2, E1@T1 ResultsInput Events R3@T3, R2@T2, R1@T1 Heartbeats
  • 3. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | Why Continuous Query Processor ? Oracle Confidential – Internal/Restricted/Highly Restricted 3 • Complex event processing requires events to be processed one at a time – Each event must be processed as identified by its individual timestamp – Real world events originate at different times and must be processed as such – CEP applications seek correlation and patterns across events in time within or across batches and irrespective of batch boundaries • Window length can span fractional batches • Micro-batching with Spark Streaming – All events in the batch are identified by same time (RDD Time) – No progression of time between events in the same batch – No progression of time when RDD partitions are empty • Critical for missing event scenarios. E.g. alert when order status “Received” is not followed by “Shipped” for order Id 10001 within 1 hour
  • 4. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | Pattern Detection in CQP Oracle Confidential – Internal/Restricted/Highly Restricted 4 Checks if temperature readings from a power sensor are wobbling during a certain time interval. The CQP code checks for a W-Pattern in temperature readings during a 10 minute interval and selects support levels as output SELECT LAST(A.value), LAST(C.value) FROM TEMP_STREAM MATCH_RECOGNIZE ( PARTITION BY DEVICE_ID PATTERN (A+ B+ C+ D+) DURATION OF 10 MINUTES DEFINE A AS (value < PREV(value)) B AS (value > PREV(value)) C AS (value < PREV(value)) D AS (value > PREV(value)) ) A B C D 10 Minutes
  • 5. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | Distributed Complex Event Processing Oracle Confidential – Internal/Restricted/Highly Restricted 5 • Continuous Query Processor – Event by event processing • Each event is assigned a unique timestamp • Apache Spark – Distributed computing with scale out and fault tolerance Spark Streaming + Continuous Query Processor = Distributed, Scalable, and Fault Tolerant CEP Platform
  • 6. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | How CQP complements Spark Streaming ? Oracle Confidential – Internal/Restricted/Highly Restricted 6 • Continuous, event-by-event, and stateful processing • Flexible temporal windows (Time & Rows) • Automatic progression of time – Heartbeat propagation to advance time • Pattern detection without batch boundaries – Built-in finite state automaton • Declarative SQL like language
  • 7. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | How Spark complements CQP ? • Distributed computing framework for CQP • OOTB sources for data Ingestion • Horizontal scale-out • Fault tolerance and high availability • Spark can detect failures, restart required resources, and automatically replay parts of the stream that experienced failure Oracle Confidential – Internal
  • 8. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | High-Level Architecture Data Stream Oracle’s CQP Engine Finite State Automaton for Pattern Detection across Discrete Events HDFS Journaled CQP engine State serialized to HDFS after computing each partition CQP Engine State restored on Executor Restart and Recompute of a partition Geo Sensing Cartridge for Spatial Analytics RETE Rules for Conditional Logic Complex Pattern Detection, Temporal Queries, Spatial Queries, Contextual Queries, and Conditional Logic are all executed in Oracle’s CEP Engine Distributed Cache for Ultra-fast Context Lookup
  • 9. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | CQP Spark Application Oracle Confidential – Internal Cluster Spark Standalone or YARN or Mesos Spark Driver Spark Executor Spark Executor Spark Executor • Transformation delegated to CQP • CQP application parsed in Spark’s Driver process • Driver builds a DAG (job definition) from CQP application • Driver runs job for each micro-batch CQP CQP CQP
  • 10. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | Spark Integration - High Level • Extensions – CQLDStream extends DStream overriding compute() – CQLRDD extends RDD overriding compute() • Initialization/DAG creation phase – Create Spark Context, Streaming Context, and start CQP on each Executor – Setup DAG of CQLDStream • Physical plan execution phase – Assign unique timestamp to each event in micro-batch based on RDD time – Register query with CQP on RDD compute if not already registered – Send micro-batch to CQP and accept returned results – Checkpoint CQP state if HA enabled Oracle Confidential – Internal
  • 11. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | Spark Integration - High Availability • Executor Failure – Restart CQP on newly restarted Executor – On RDD compute, re-register query and restore query state from snapshots – Continue processing stream Oracle Confidential – Internal
  • 12. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Demo
  • 13. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal Q & A
  • 14. Copyright Š 2014 Oracle and/or its affiliates. All rights reserved. | CQP scalability in Spark Oracle Confidential – Internal/Restricted/Highly Restricted 14 0 20000 40000 60000 80000 w1(c3) w2(c5) w3(c7) w4(c9) Processing Time (seconds) for 40 Million Records 0 5000 10000 15000 20000 w1(c3) w2(c5) w3(c7) w4(c9) Avg. Processing Time Per Batch (milliseconds) 0 20 40 60 80 100 120 w1(c3) w2(c5) w3(c7) w4(c9) Number of Batches Processed over 10 Minutes Legend Wn = n number of workers or executors W2(c5) means 2 executors and a total of 5 cores across both executors