SlideShare a Scribd company logo
Apache Apex
Intro to Apex
Ingestion and Dimensions Compute for a customer use-case
Devendra Tagare
devendrat@datatorrent.com
@devtagare
9h July 2016
What is Apex
2
• Platform and runtime engine that enables development of scalable and fault-
tolerant distributed applications
• Hadoop native
• Process streaming or batch big data
• High throughput and low latency
• Library of commonly needed business logic
• Write any custom business logic in your application
Applications on Apex
3
• Distributed processing
• Application logic broken into components called operators that run in a distributed fashion across your
cluster
• Scalable
• Operators can be scaled up or down at runtime according to the load and SLA
• Fault tolerant
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved
• Long running applications
• Operators
• Use library to build applications quickly
• Write your own in Java using the API
• Operational insight – DataTorrent RTS
• See how each operator is performing and even record data
Apex Stack Overview
4
Apex Operator Library - Malhar
5
Native Hadoop Integration
6
• YARN is
the
resource
manager
• HDFS
used for
storing
any
persistent
state
Application Development Model
7
 A Stream is a sequence of data tuples
 A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
 Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
StreamTuple Tuple er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
Advanced Windowing Support
8
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency
Application in Java
9
Partitioning and unification
10
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
Advanced Partitioning
11
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
Dynamic Partitioning
12
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
• Ingest from Kafka and S3
• Parse, Filter and Enrich
• Dimensional compute for key performance indicators
• Reporting of critical metrics around campaign monetization
• Aggregate counters & reporting on top N metrics
• Low latency querying using Kafka in pub-sub model
Use Case ...
Screenshots - Demo UI
Proprietary and Confidential
Scale
• 6 geographically distributed data centers
• Combination of co-located & AWS based DC’s.
• > 5 PB under the data management
• 22 TB / day data generated from auction & client logs
• heterogeneous data log formats
• North of 15 Bn impressions / day.
• Average data inflow of 200K events/s
15
• Ad server log events consumed as Avro encoded, Snappy compressed files
from S3. New files uploaded every 10-20 minutes.
• Data may arrive in S3 out of order (time stamps).
• Event size is about 2KB uncompressed, only subset of fields retrieved for
aggregation.
• Aggregates kept in memory (checkpointed) with expiration policy, query
processing against in memory data.
• Front-end integration through Kafka based query protocol for realtime
dashboard components.
Initial Requirements
Proprietary and Confidential
Apache Apex
17
AdServer
REST proxy
REST proxy
Real-time architecture- Powered By Apex
Kafka
Cluster
S3Reader
S3Reader
Filter Operator Filter Operator
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions Store
Query Query
Result
Kafka
Cluster
Auction Logs
Middleware
Auction Logs
Filtered Events Filtered Events
Aggregates
Query from MW
Query Query Results
S3 S3 Client logsAuction Logs
Architecture 1.0 - Batch Reads + Streaming Aggregations
• Unstable S3 client libraries
– Unpredictable hangs and Corrupted data
– On Hang, Master kills the container and restart reading of file from different container
– Corrupt files caused containers to kill – application configurable retry mechanism and skip bad
files
– Limited read consumption throughput – 1 reader/file
• Out of Order data
– Some timestamp in future and past
• Spike in load when new files are added followed by period of inactivity
• Memory Requirement for Store
– Cardinality Estimation for incoming data
Challenges
Proprietary and Confidential
Apache Apex
19
REST proxy
Real-time architecture- Powered By Apex
Client logs
Kafka
Input
(Auction logs)
ETL operator
Filter Operator Filter Operator
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions
Store/HDHT
Query Query
Result
Kafka
Cluster
Auction Logs
Kafka
Cluster
Middleware
AdServer
REST proxy
Kafka
Cluster
Auction
Logs
Client logs
Kafka Messages
Decompress
& Flatten
Filtered Events Filtered Events
Aggregates
Query from MW
Query Query Results
S3
S3Reader
Kafka
Input
(Auction logs)
Auction
Logs
Architecture 2.0 - Batch + Streaming
Challenges
• Complex Logical DAG
• Kafka Operator Issues
– Dynamic Partitioning
– Memory Configuration
– Offset snapshotting to ensure exactly once semantics
• Resource Allocation
– More memory requirement for Store (Large number of Unifiers)
• Harder Debugging (More number of components)
– GB(s) of container logs
– Difficult to locate the sequence of failure
• More of data transferred over wire within cluster
• Limit Kafka read rate
Proprietary and Confidential 21
User
Browser
AdServer
REST proxy
REST proxy
Real-time architecture- Powered By Apex
Kafka
Cluster
Client logs
Kafka
Input
(Auction logs)
Kafka
Input
(Client logs)
CDN
(Caching of
logs)
ETL operator ETL operator
Filter Operator Filter Operator
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions Store
Query Query
Result
Kafka
Cluster
Auction Logs
Client logs
Middleware
Auction Logs
Client logs
Kafka Messages Kafka Messages
Decompress
& Flatten
Decompress
& Flatten
Filtered Events Filtered Events
Aggregates
Query from MW
Query Query Results
Architecture 3.0 - Streaming
Operational Architecture
Proprietary and Confidential
Application Configuration
• 64 Kafka Input operators reading from 6 geographically distributed DC’s
• Under 40 seconds end to end latency - from ad-serving to visualization
• 32 instances of in-memory distribute store
• 64 aggregators
• 1.2 TB memory footprint @ peak load
• In-memory store was later replaced by HDHT for fault tolerance
23
Proprietary and Confidential
Learning’s
• DAG – sizing, locality & partitioning (Benchmark)
• Memory sizing for the store or other memory heavy operators.
• Cardinality estimation for incoming data is critical.
• Upstream operators tend to require more memory than down-stream
operators for high velocity reads.
• Back pressure from down-stream failures due to skew in velocity of events
& upstream failures .. Buffer Server sizing is critical.
• For end to end exactly once its necessary to understand the external
systems semantics & delivery guarantees.
• Think fault tolerance & recovery before starting implementation.
24
Proprietary and Confidential
Before And After
25
5 Hours + 20 Minute
• No real-time processing system in place.
• Publishers and buyers could only rely on a
batch processing system for gathering
relevant data
• Outdated data, not relevant to
current time
• Current data being pushed to a
waiting queue
• Cumbersome batch-processing
lifecycle
• No visualization for reports
• No glimpse into everyday
happenings, translating to lost
decisions or untimely decision
making scenarios
Before Scenario After Scenario
• Phase 1
• With DataTorrent RTS (built
on Apache Apex), Dev team
put together the first real time
analytics platform
• This enabled Reporting of
critical metrics around
campaign monetization
• Reuse of batch ingestion
mechanism for the impression
data, shared with other
pipelines (S3)
~ 30 seconds
No Real-time Batch + Real-time
• Phase 2
• Reduce end-to-end latency
through real-time ingestion of
impression data from Kafka
• Results available much sooner
to the user
• Balances load (no more batch
ingestion spikes), reduces
resource consumption
• Handles ever growing traffic
with more efficient resource
utilization.
Real-time Streaming
Proprietary and Confidential
Operators used
S3 reader (File Input Operator)
• Recursively reading the contents of a S3 bucket based on a partitioning pattern
• Inclusion & exclusion support
• Fault tolerance (replay and idempotent)
• Throughput of over 12K reads/second for event size of 1.2 KB each
Kafka Input Operator
• Ability to consume from multiple Kafka clusters
• Offset management support
• Fault tolerant reads
• Support for idempotent & exactly once semantics
• Controlled reads for managing back-pressure
POJO Enrichment Operator
• takes a POJO as input and does a look-up in a store for given key
• supports caching
• stores are pluggable
• App builder ready
26
Proprietary and Confidential
Operators used (cont …)
Parser
• Specify JSON schema
• Emits a POJO based on the output schema
• No user code required
Dimension Store
• Distributed in-memory store
• Supports re-aggregation of events
• Partitioning of aggregates per view
• Low latency query support with a pub/sub model using Kafka
HDHT
• HDFS backed embedded key-value store
• Fault tolerant, random read & write
• Durability in-case of cold restarts
27
Dimensional Model - Key Concepts
Metrics : pieces of information we want to collect statistics about.
Dimensions : variables which can impact our measures.
Combinations : set of dimensions for which one or metric would be aggregated.They are sub-sets of
dimensions.
Aggregations : the aggregate function eg.. SUM, TOPN, Standard deviation.
Example :
Dimensions - campaignId, advertiserId, time
Metrics - Cost, revenue, clicks, impressions
Aggregate functions -SUM,AM etc..
Combinations :
1. campaignId x time - cost,revenue
2. advertiser - revenue, impressions
3. campaignId x advertiser x time - revenue, clicks, impressions
How to aggregate on the combinations ?
Dimensional Model
Dimensions Schema
{"keys":[{"name":"campaignId","type":"integer"},
{"name":"adId","type":"integer"},
{"name":"creativeId","type":"integer"},
{"name":"publisherId","type":"integer"},
{"name":"adOrderId","type":"integer"}],
"timeBuckets":["1h","1d"],
"values":
[{"name":"impressions","type":"integer","aggregators":["SUM"]},
{"name":"clicks","type":"integer","aggregators":["SUM"]},
{"name":"revenue","type":"integer"}],
"dimensions":
[{"combination":["campaignId","adId"]},
{"combination":["creativeId","campaignId"]},
{"combination":["campaignId"]},
{"combination":["publisherId","adOrderId","campaignId"],"additionalValues":["revenue:SUM"]}]
}
Proprietary and Confidential
More Use-cases
• Real-time Monitoring
Alerts on deal tracking & monetization
Campaign & deal health
• Real-time Learning
Using the lost bid insights for price recommendations.
• Allocation Engine
Feedback to ad serving for guaranteed delivery & line item pacing
30
Data Processing Pipeline Example
App Builder
31
Monitoring Console
Logical View
32
Monitoring Console
Physical View
33
Real-Time Dashboards
Real Time Visualization
34
Q&A
35
Resources
36
• http://apex.apache.org/
• Learn more: http://apex.apache.org/docs.html
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups – http://www.meetup.com/pro/apacheapex/
• More examples: https://github.com/DataTorrent/examples
• Slideshare: http://www.slideshare.net/ApacheApex/presentations
• https://www.youtube.com/results?search_query=apache+apex
• Free Enterprise License for Startups -
https://www.datatorrent.com/product/startup-accelerator/

More Related Content

What's hot (20)

PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPTX
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
PPTX
Introduction to Real-Time Data Processing
Apache Apex
 
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Apex
 
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PDF
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
PPTX
Capital One's Next Generation Decision in less than 2 ms
Apache Apex
 
PPTX
Smart Partitioning with Apache Apex (Webinar)
Apache Apex
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
PPTX
Apache Beam (incubating)
Apache Apex
 
PDF
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
PPTX
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
PPTX
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Introduction to Real-Time Data Processing
Apache Apex
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Apex
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Capital One's Next Generation Decision in less than 2 ms
Apache Apex
 
Smart Partitioning with Apache Apex (Webinar)
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Apache Beam (incubating)
Apache Apex
 
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 

Viewers also liked (7)

PDF
Crossfilter MadJS
Ethan Jewett
 
PPTX
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
PPT
Aggregate fact tables
Siddique Ibrahim
 
PDF
Elasticsearch Introduction to Data model, Search & Aggregations
Alaa Elhadba
 
PDF
Elasticsearch in Netflix
Danny Yuan
 
PPT
Datacube
man2sandsce17
 
PDF
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Sematext Group, Inc.
 
Crossfilter MadJS
Ethan Jewett
 
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
Aggregate fact tables
Siddique Ibrahim
 
Elasticsearch Introduction to Data model, Search & Aggregations
Alaa Elhadba
 
Elasticsearch in Netflix
Danny Yuan
 
Datacube
man2sandsce17
 
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Sematext Group, Inc.
 
Ad

Similar to Ingestion and Dimensions Compute and Enrich using Apache Apex (20)

PDF
Real Time Insights for Advertising Tech
Apache Apex
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
PDF
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
PDF
BigDataSpain 2016: Introduction to Apache Apex
Thomas Weise
 
PPTX
Stream data from Apache Kafka for processing with Apache Apex
Apache Apex
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
PPTX
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lightbend
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PDF
Rethinking Streaming Analytics For Scale
Helena Edelson
 
PDF
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
PDF
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DATAVERSITY
 
PDF
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
confluent
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PPTX
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
PDF
Streaming architecture patterns
hadooparchbook
 
PDF
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
confluent
 
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
PDF
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
Real Time Insights for Advertising Tech
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
BigDataSpain 2016: Introduction to Apache Apex
Thomas Weise
 
Stream data from Apache Kafka for processing with Apache Apex
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lightbend
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Rethinking Streaming Analytics For Scale
Helena Edelson
 
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DATAVERSITY
 
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
confluent
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Streaming architecture patterns
hadooparchbook
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
confluent
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
Ad

More from Apache Apex (14)

PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
PPTX
Deep Dive into Apache Apex App Development
Apache Apex
 
PPTX
Hadoop Interacting with HDFS
Apache Apex
 
PPTX
Introduction to Yarn
Apache Apex
 
PPTX
Introduction to Map Reduce
Apache Apex
 
PPTX
HDFS Internals
Apache Apex
 
PPTX
Intro to Big Data Hadoop
Apache Apex
 
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
PPTX
Apache Apex & Bigtop
Apache Apex
 
PDF
Building Your First Apache Apex Application
Apache Apex
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Deep Dive into Apache Apex App Development
Apache Apex
 
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to Yarn
Apache Apex
 
Introduction to Map Reduce
Apache Apex
 
HDFS Internals
Apache Apex
 
Intro to Big Data Hadoop
Apache Apex
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
Apache Apex & Bigtop
Apache Apex
 
Building Your First Apache Apex Application
Apache Apex
 

Recently uploaded (20)

PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 

Ingestion and Dimensions Compute and Enrich using Apache Apex

  • 1. Apache Apex Intro to Apex Ingestion and Dimensions Compute for a customer use-case Devendra Tagare [email protected] @devtagare 9h July 2016
  • 2. What is Apex 2 • Platform and runtime engine that enables development of scalable and fault- tolerant distributed applications • Hadoop native • Process streaming or batch big data • High throughput and low latency • Library of commonly needed business logic • Write any custom business logic in your application
  • 3. Applications on Apex 3 • Distributed processing • Application logic broken into components called operators that run in a distributed fashion across your cluster • Scalable • Operators can be scaled up or down at runtime according to the load and SLA • Fault tolerant • Automatically recover from node outages without having to reprocess from beginning • State is preserved • Long running applications • Operators • Use library to build applications quickly • Write your own in Java using the API • Operational insight – DataTorrent RTS • See how each operator is performing and even record data
  • 6. Native Hadoop Integration 6 • YARN is the resource manager • HDFS used for storing any persistent state
  • 7. Application Development Model 7  A Stream is a sequence of data tuples  A typical Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded  Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output StreamTuple Tuple er Operator er Operator er Operator er Operator er Operator er Operator
  • 8. Advanced Windowing Support 8  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  • 10. Partitioning and unification 10 NxM PartitionsUnifier 0 1 2 3 Logical DAG 0 1 2 1 1 Unifier 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 Unifier 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck Unifier Unifier0 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  • 11. Advanced Partitioning 11 0 1a 1b 2 3 4Unifier Physical DAG 0 4 3a2a1a 1b 2b 3b Unifier Physical DAG with Parallel Partition Parallel Partition Container uopr uopr1 uopr2 uopr3 uopr4 uopr1 uopr2 uopr3 uopr4 dopr dopr doprunifier unifier unifier unifier Container Container NICNIC NICNIC NIC Container NIC Logical Plan Execution Plan, for N = 4; M = 1 Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers Cascading Unifiers 0 1 2 3 4 Logical DAG
  • 12. Dynamic Partitioning 12 • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 2b 2c 3 2a 2d 1b 1a1a 2a 1b 2b 3 1a 2b 1b 2c 3b 2a 2d 3a Unifiers not shown
  • 13. • Ingest from Kafka and S3 • Parse, Filter and Enrich • Dimensional compute for key performance indicators • Reporting of critical metrics around campaign monetization • Aggregate counters & reporting on top N metrics • Low latency querying using Kafka in pub-sub model Use Case ...
  • 15. Proprietary and Confidential Scale • 6 geographically distributed data centers • Combination of co-located & AWS based DC’s. • > 5 PB under the data management • 22 TB / day data generated from auction & client logs • heterogeneous data log formats • North of 15 Bn impressions / day. • Average data inflow of 200K events/s 15
  • 16. • Ad server log events consumed as Avro encoded, Snappy compressed files from S3. New files uploaded every 10-20 minutes. • Data may arrive in S3 out of order (time stamps). • Event size is about 2KB uncompressed, only subset of fields retrieved for aggregation. • Aggregates kept in memory (checkpointed) with expiration policy, query processing against in memory data. • Front-end integration through Kafka based query protocol for realtime dashboard components. Initial Requirements
  • 17. Proprietary and Confidential Apache Apex 17 AdServer REST proxy REST proxy Real-time architecture- Powered By Apex Kafka Cluster S3Reader S3Reader Filter Operator Filter Operator Dimensions Aggregator Dimensions Aggregator Dimensions Store Query Query Result Kafka Cluster Auction Logs Middleware Auction Logs Filtered Events Filtered Events Aggregates Query from MW Query Query Results S3 S3 Client logsAuction Logs Architecture 1.0 - Batch Reads + Streaming Aggregations
  • 18. • Unstable S3 client libraries – Unpredictable hangs and Corrupted data – On Hang, Master kills the container and restart reading of file from different container – Corrupt files caused containers to kill – application configurable retry mechanism and skip bad files – Limited read consumption throughput – 1 reader/file • Out of Order data – Some timestamp in future and past • Spike in load when new files are added followed by period of inactivity • Memory Requirement for Store – Cardinality Estimation for incoming data Challenges
  • 19. Proprietary and Confidential Apache Apex 19 REST proxy Real-time architecture- Powered By Apex Client logs Kafka Input (Auction logs) ETL operator Filter Operator Filter Operator Dimensions Aggregator Dimensions Aggregator Dimensions Store/HDHT Query Query Result Kafka Cluster Auction Logs Kafka Cluster Middleware AdServer REST proxy Kafka Cluster Auction Logs Client logs Kafka Messages Decompress & Flatten Filtered Events Filtered Events Aggregates Query from MW Query Query Results S3 S3Reader Kafka Input (Auction logs) Auction Logs Architecture 2.0 - Batch + Streaming
  • 20. Challenges • Complex Logical DAG • Kafka Operator Issues – Dynamic Partitioning – Memory Configuration – Offset snapshotting to ensure exactly once semantics • Resource Allocation – More memory requirement for Store (Large number of Unifiers) • Harder Debugging (More number of components) – GB(s) of container logs – Difficult to locate the sequence of failure • More of data transferred over wire within cluster • Limit Kafka read rate
  • 21. Proprietary and Confidential 21 User Browser AdServer REST proxy REST proxy Real-time architecture- Powered By Apex Kafka Cluster Client logs Kafka Input (Auction logs) Kafka Input (Client logs) CDN (Caching of logs) ETL operator ETL operator Filter Operator Filter Operator Dimensions Aggregator Dimensions Aggregator Dimensions Store Query Query Result Kafka Cluster Auction Logs Client logs Middleware Auction Logs Client logs Kafka Messages Kafka Messages Decompress & Flatten Decompress & Flatten Filtered Events Filtered Events Aggregates Query from MW Query Query Results Architecture 3.0 - Streaming
  • 23. Proprietary and Confidential Application Configuration • 64 Kafka Input operators reading from 6 geographically distributed DC’s • Under 40 seconds end to end latency - from ad-serving to visualization • 32 instances of in-memory distribute store • 64 aggregators • 1.2 TB memory footprint @ peak load • In-memory store was later replaced by HDHT for fault tolerance 23
  • 24. Proprietary and Confidential Learning’s • DAG – sizing, locality & partitioning (Benchmark) • Memory sizing for the store or other memory heavy operators. • Cardinality estimation for incoming data is critical. • Upstream operators tend to require more memory than down-stream operators for high velocity reads. • Back pressure from down-stream failures due to skew in velocity of events & upstream failures .. Buffer Server sizing is critical. • For end to end exactly once its necessary to understand the external systems semantics & delivery guarantees. • Think fault tolerance & recovery before starting implementation. 24
  • 25. Proprietary and Confidential Before And After 25 5 Hours + 20 Minute • No real-time processing system in place. • Publishers and buyers could only rely on a batch processing system for gathering relevant data • Outdated data, not relevant to current time • Current data being pushed to a waiting queue • Cumbersome batch-processing lifecycle • No visualization for reports • No glimpse into everyday happenings, translating to lost decisions or untimely decision making scenarios Before Scenario After Scenario • Phase 1 • With DataTorrent RTS (built on Apache Apex), Dev team put together the first real time analytics platform • This enabled Reporting of critical metrics around campaign monetization • Reuse of batch ingestion mechanism for the impression data, shared with other pipelines (S3) ~ 30 seconds No Real-time Batch + Real-time • Phase 2 • Reduce end-to-end latency through real-time ingestion of impression data from Kafka • Results available much sooner to the user • Balances load (no more batch ingestion spikes), reduces resource consumption • Handles ever growing traffic with more efficient resource utilization. Real-time Streaming
  • 26. Proprietary and Confidential Operators used S3 reader (File Input Operator) • Recursively reading the contents of a S3 bucket based on a partitioning pattern • Inclusion & exclusion support • Fault tolerance (replay and idempotent) • Throughput of over 12K reads/second for event size of 1.2 KB each Kafka Input Operator • Ability to consume from multiple Kafka clusters • Offset management support • Fault tolerant reads • Support for idempotent & exactly once semantics • Controlled reads for managing back-pressure POJO Enrichment Operator • takes a POJO as input and does a look-up in a store for given key • supports caching • stores are pluggable • App builder ready 26
  • 27. Proprietary and Confidential Operators used (cont …) Parser • Specify JSON schema • Emits a POJO based on the output schema • No user code required Dimension Store • Distributed in-memory store • Supports re-aggregation of events • Partitioning of aggregates per view • Low latency query support with a pub/sub model using Kafka HDHT • HDFS backed embedded key-value store • Fault tolerant, random read & write • Durability in-case of cold restarts 27
  • 28. Dimensional Model - Key Concepts Metrics : pieces of information we want to collect statistics about. Dimensions : variables which can impact our measures. Combinations : set of dimensions for which one or metric would be aggregated.They are sub-sets of dimensions. Aggregations : the aggregate function eg.. SUM, TOPN, Standard deviation. Example : Dimensions - campaignId, advertiserId, time Metrics - Cost, revenue, clicks, impressions Aggregate functions -SUM,AM etc.. Combinations : 1. campaignId x time - cost,revenue 2. advertiser - revenue, impressions 3. campaignId x advertiser x time - revenue, clicks, impressions How to aggregate on the combinations ?
  • 30. Proprietary and Confidential More Use-cases • Real-time Monitoring Alerts on deal tracking & monetization Campaign & deal health • Real-time Learning Using the lost bid insights for price recommendations. • Allocation Engine Feedback to ad serving for guaranteed delivery & line item pacing 30
  • 31. Data Processing Pipeline Example App Builder 31
  • 34. Real-Time Dashboards Real Time Visualization 34
  • 36. Resources 36 • http://apex.apache.org/ • Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups – http://www.meetup.com/pro/apacheapex/ • More examples: https://github.com/DataTorrent/examples • Slideshare: http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/

Editor's Notes

  • #27: Thomas – Mention these are extensions of malhar (Open Source)