SlideShare a Scribd company logo
Pulsar: Real-time Analytics at Scale
with Kafka, Kylin and Druid
September 2016
Tony Ng
Director of Engineering
EBAY
BUSINESS
*Q2 2016 data
New Items
80%
Fixed Price
Items
86%
Evolving from our Auction Roots..
Ships for Free
65%
EBAY AT A GLANCE
$8.6B
Revenue in 2015
$82B
GMV in 2015
1B
Live Listings *
164M
Global Active Buyers *
190
Countries eBay apps
are available in #
*Q2 2016 data
#Q4 2015 data
326M
App downloads*
CHECKOUT
WATCH LIST
SHARE
LOCAL
PREFERENCES
BID
10’s TB
USER BEHAVIORAL
DATA / DAY
10’s B
DATABASE
QUERIES / DAY
100’s PB
DATA
CLICK
SEARCH
ENTERPRISE DATA PLATFORM
Data Stores
Data Streams &
Processing
Machine Learning
Enterprise Data Ecosystem
Data Ingestion
Personalization /
Optimization
Insights / Reporting
Kylin
Open-source real-time analytics and stream
processing framework
Pulsar Stream
• Focus on user behavioral data processing
• Complex event processing
– Streaming SQL with extensible annotations
– Java
• SQL for common stream operations (Filtering, mutation, aggregation) with
time windows
• Declarative topology construction
• Each stage can adopt its own release and deployment cycles
• Dynamic partitioning and flow control
Multi Stage Distributed Pipeline
Event Filtering and Routing Example
// create filtered stream
insert into FilteredStream select guid, evt_type, C1, C2, C3
from RawStream where evt_type = ‘bid’;
// publish and route filtered stream
@PublishOn(topics=“Topic1”)
@Output(“OutboundChannel”)
@ClusterAffinityTag(column = guid)
select * from SubStream;
Aggregate Computation Example
// create 10-second time window context
create context MCContext start @now and pattern
(timer:interval(10)];
// create aggreated stream within specified time window
context MCContext insert into AggStream
select count(*) as M1, guid, evt_type from RawStream
group by guid, evt_type output snapshot when terminated;
// publish aggregated stream
select * from AggStream;
Stream Aggregation Time Window
Sliding
Window
Tumbling
Window
TopN Computation Example
// create 60-second time window context
create context MCContext start @now and pattern
(timer:interval(60)];
// create topN stream via sorting
context MCContext insert into TopNStream
select count(*) as M1, guid, evt_type from RawStream
group by guid, evt_type order by M1 limit 10;
// publish topN stream
select * from TopNStream;
PULSAR BEHAVIORAL
DATA PIPELINE
Pulsar Behavioral Data Pipeline
Sessionizer
Metrics
Calculator
Event
Distributor
Real Time
Consumers
Metrics
Store
Collector
Real-time Pipeline
BOT
Detection
Enriched
Sessionized Events
Producing
Applications
Real Time
Dashboard and Services
Kafka DruidHaddop /
Kylin
Batch
Loader
Sessionization: Group together events of a single user visit
e1 e2 e3 e4 e5 e7 e8
User 1: >30 min of inactivity
Session A (User 1): e1, e2, e4
Session B (User 2): e3, e5, e6
Session C (User 1): e7, e8
e6
. . .
Sessionization Challenges
• Session state management
– High read/write throughput
– State recovery when node crash/fail
• Session Expiration
– Full table scan is not acceptable
Sessionization Solution
• Long live state management (At least 30 minutes)
– Local Off-Heap Cache
• Instantaneous Session Expiration (<= 1sec delay)
– Double-Linked Off-Heap Map (Local Access)
– Order by Expiration time (O(1))
• Pluggable Sessionization logic
– SQL with customized annotation
– Counter
– State
Sessionizer Architecture
Collector
Sessionizer
Sessionizer
Local Off-Heap Session Cache
I
M
C
Timer
Remote
Store Client
Remote Session Store
Recovery
O
M
C
BotDetection
Distributor
Persist
Sync
Bot Detection Overview
• Detect non-human activities in near realtime
• May treat bot traffic differently during analysis
• High level bot rules
– Self-declared bots by user agent
– Behavior within a session or time window
• Tag events with bot flag
Bot Detection
SessionizerCollector
BotDetection
Distributor
Behavior Events
And Metrics
BotSignature
Bot Detection
Service
BotSignature
Bot tagged
Stream
PULSAR INTEGRATION
WITH OTHER SYSTEMS
Divider sub-headline goes here
Pulsar Integration with Kafka
•Kafka
– Persistent messaging queue
– High availability, scalability and throughput
•Pulsar leveraging Kafka
– Supports pull and hybrid messaging model
– Loading of data from real-time pipeline into Hadoop and other metric stores
– Use schema to validate event payload
2
Messaging Models
Producer
Producer
Queue
Kafka
Producer
Queue
Kafka
Replayer
Push Model
Pull Model
Pause/Resume
Hybrid Model
(At most once delivery semantics)
(At least once delivery semantics)
Consumer
Consumer
Consumer
Netty
Pulsar Integration with Kylin
•Apache Kylin
– Distributed analytics engine
– SQL interface and multi-dimensional analysis (OLAP) on Hadoop
– Interactive Query on Billions of Rows
•Pulsar leveraging Kylin
– Build multi-dimensional OLAP cube over long time period
– Aggregate/drill-down on dimensions such as browser, OS, device, geo location
– Capture metrics such as session length, page views, event counts
2
Pulsar Integration with Druid
•Druid
– Real-time ROLAP engine for aggregation, drill-down and slice-n-dice
•Pulsar leveraging Druid
– Real-time analytics dashboard
– Near real-time metrics like number of visitors in the last 5 minutes, refreshing
every 10 seconds
– Aggregate/drill-down on dimensions such as browser, OS, device, geo location
2
BEHAVIORAL DATA
DRIVEN APPLICATIONS
Divider sub-headline goes here
http://www.ebay.com/trending
TRENDING: ALGORITHMS NARROW THE FOCUS
Algorithms and machine learning
identify significant trends
Humans provide the context
and and interesting story
search
view
watch
bid
purchase
w
time
NearlineOffline
(historical)
s vv…events… b
Online
(in-session)
vpv s
Activity Timeline for Personalization
Customer Profile
• Price
• Category
• Sale Type
• Item Condition
• Deals
Customer Intent
• Price
• Category
• Sale Type
• Item Condition
• Deals
EXAMPLE PERSONALIZED CONTENT
Personalized digest for a
consumer interested in
jewelry and accessories
Personalized digest for a
consumer interested in
auto and electronics
Behavior Data: A/B Testing
3
Behavioral Data: A/B Testing
More Information
•GitHub: http://github.com/pulsarIO
–repos: pipeline, framework, docker files
•Website: http://gopulsar.io
–Technical whitepaper
–Getting started
–Documentation
•Google group: http://groups.google.com/d/forum/pulsar
3
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

More Related Content

What's hot (20)

PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
PDF
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
PDF
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
PDF
Lambda architecture
Szilveszter Molnár
 
PDF
Cassandra & Spark for IoT
Matthias Niehoff
 
PPTX
Quark Virtualization Engine for Analytics
DataWorks Summit/Hadoop Summit
 
PDF
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
PDF
Realtime Reporting using Spark Streaming
Santosh Sahoo
 
PPTX
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
DataWorks Summit
 
PDF
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Trieu Nguyen
 
PDF
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
PPTX
Large-scaled telematics analytics
DataWorks Summit
 
PDF
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
 
PDF
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
DataWorks Summit
 
PDF
Spark Intro @ analytics big data summit
Sujee Maniyam
 
PDF
Lambda Architectures in Practice
C4Media
 
PDF
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Data Con LA
 
PDF
Spark Summit - Stratio Streaming
Stratio
 
PPTX
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
Lambda architecture
Szilveszter Molnár
 
Cassandra & Spark for IoT
Matthias Niehoff
 
Quark Virtualization Engine for Analytics
DataWorks Summit/Hadoop Summit
 
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
Realtime Reporting using Spark Streaming
Santosh Sahoo
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
DataWorks Summit
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Trieu Nguyen
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
Large-scaled telematics analytics
DataWorks Summit
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
DataWorks Summit
 
Spark Intro @ analytics big data summit
Sujee Maniyam
 
Lambda Architectures in Practice
C4Media
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Data Con LA
 
Spark Summit - Stratio Streaming
Stratio
 
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

PPTX
Scalable Real-time analytics using Druid
DataWorks Summit/Hadoop Summit
 
PDF
OLAP options on Hadoop
Yuta Imai
 
PDF
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
SANG WON PARK
 
PPT
Case Study: Realtime Analytics with Druid
Salil Kalia
 
PPTX
Druid at Hadoop Ecosystem
Slim Bouguerra
 
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
PPTX
Design cube in Apache Kylin
Yang Li
 
PPTX
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
PPTX
Programmatic Bidding Data Streams & Druid
Charles Allen
 
PPTX
Druid realtime indexing
Seoeun Park
 
PPTX
Apache Kylin – Cubes on Hadoop
DataWorks Summit
 
PDF
eBay Architecture
Tony Ng
 
PPTX
Drilling into Data with Apache Drill
DataWorks Summit
 
PDF
Interactive analytics at scale with druid
Julien Lavigne du Cadet
 
PPTX
PayPal Real Time Analytics
Anil Madan
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PDF
Big Data MDX with Mondrian and Apache Kylin
inovex GmbH
 
PDF
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
PDF
IS OLAP DEAD IN THE AGE OF BIG DATA?
DataWorks Summit
 
PPTX
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Scalable Real-time analytics using Druid
DataWorks Summit/Hadoop Summit
 
OLAP options on Hadoop
Yuta Imai
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
SANG WON PARK
 
Case Study: Realtime Analytics with Druid
Salil Kalia
 
Druid at Hadoop Ecosystem
Slim Bouguerra
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
Design cube in Apache Kylin
Yang Li
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
Programmatic Bidding Data Streams & Druid
Charles Allen
 
Druid realtime indexing
Seoeun Park
 
Apache Kylin – Cubes on Hadoop
DataWorks Summit
 
eBay Architecture
Tony Ng
 
Drilling into Data with Apache Drill
DataWorks Summit
 
Interactive analytics at scale with druid
Julien Lavigne du Cadet
 
PayPal Real Time Analytics
Anil Madan
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Big Data MDX with Mondrian and Apache Kylin
inovex GmbH
 
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
DataWorks Summit
 
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Ad

Similar to Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid (20)

PPTX
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
DataScienceConferenc1
 
PPTX
Data & Analytics Forum: Moving Telcos to Real Time
SingleStore
 
PDF
Real Time Insights for Advertising Tech
Apache Apex
 
PDF
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
confluent
 
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
Guido Schmutz
 
PDF
SnappyData @ Seattle Spark Meetup
SnappyData
 
PDF
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
confluent
 
PDF
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
Altinity Ltd
 
PDF
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2
 
PDF
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PDF
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
PDF
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
PDF
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
confluent
 
PDF
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
confluent
 
PPTX
Volta: Logging, Metrics, and Monitoring as a Service
LN Renganarayana
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
DataScienceConferenc1
 
Data & Analytics Forum: Moving Telcos to Real Time
SingleStore
 
Real Time Insights for Advertising Tech
Apache Apex
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
confluent
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Guido Schmutz
 
SnappyData @ Seattle Spark Meetup
SnappyData
 
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
confluent
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
Altinity Ltd
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2
 
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
confluent
 
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
confluent
 
Volta: Logging, Metrics, and Monitoring as a Service
LN Renganarayana
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Ad

Recently uploaded (20)

PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 

Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid

  • 1. Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid September 2016 Tony Ng Director of Engineering
  • 3. *Q2 2016 data New Items 80% Fixed Price Items 86% Evolving from our Auction Roots.. Ships for Free 65%
  • 4. EBAY AT A GLANCE $8.6B Revenue in 2015 $82B GMV in 2015 1B Live Listings * 164M Global Active Buyers * 190 Countries eBay apps are available in # *Q2 2016 data #Q4 2015 data 326M App downloads*
  • 5. CHECKOUT WATCH LIST SHARE LOCAL PREFERENCES BID 10’s TB USER BEHAVIORAL DATA / DAY 10’s B DATABASE QUERIES / DAY 100’s PB DATA CLICK SEARCH
  • 6. ENTERPRISE DATA PLATFORM Data Stores Data Streams & Processing Machine Learning Enterprise Data Ecosystem Data Ingestion Personalization / Optimization Insights / Reporting Kylin
  • 7. Open-source real-time analytics and stream processing framework
  • 8. Pulsar Stream • Focus on user behavioral data processing • Complex event processing – Streaming SQL with extensible annotations – Java • SQL for common stream operations (Filtering, mutation, aggregation) with time windows • Declarative topology construction • Each stage can adopt its own release and deployment cycles • Dynamic partitioning and flow control
  • 10. Event Filtering and Routing Example // create filtered stream insert into FilteredStream select guid, evt_type, C1, C2, C3 from RawStream where evt_type = ‘bid’; // publish and route filtered stream @PublishOn(topics=“Topic1”) @Output(“OutboundChannel”) @ClusterAffinityTag(column = guid) select * from SubStream;
  • 11. Aggregate Computation Example // create 10-second time window context create context MCContext start @now and pattern (timer:interval(10)]; // create aggreated stream within specified time window context MCContext insert into AggStream select count(*) as M1, guid, evt_type from RawStream group by guid, evt_type output snapshot when terminated; // publish aggregated stream select * from AggStream;
  • 12. Stream Aggregation Time Window Sliding Window Tumbling Window
  • 13. TopN Computation Example // create 60-second time window context create context MCContext start @now and pattern (timer:interval(60)]; // create topN stream via sorting context MCContext insert into TopNStream select count(*) as M1, guid, evt_type from RawStream group by guid, evt_type order by M1 limit 10; // publish topN stream select * from TopNStream;
  • 15. Pulsar Behavioral Data Pipeline Sessionizer Metrics Calculator Event Distributor Real Time Consumers Metrics Store Collector Real-time Pipeline BOT Detection Enriched Sessionized Events Producing Applications Real Time Dashboard and Services Kafka DruidHaddop / Kylin Batch Loader
  • 16. Sessionization: Group together events of a single user visit e1 e2 e3 e4 e5 e7 e8 User 1: >30 min of inactivity Session A (User 1): e1, e2, e4 Session B (User 2): e3, e5, e6 Session C (User 1): e7, e8 e6 . . .
  • 17. Sessionization Challenges • Session state management – High read/write throughput – State recovery when node crash/fail • Session Expiration – Full table scan is not acceptable
  • 18. Sessionization Solution • Long live state management (At least 30 minutes) – Local Off-Heap Cache • Instantaneous Session Expiration (<= 1sec delay) – Double-Linked Off-Heap Map (Local Access) – Order by Expiration time (O(1)) • Pluggable Sessionization logic – SQL with customized annotation – Counter – State
  • 19. Sessionizer Architecture Collector Sessionizer Sessionizer Local Off-Heap Session Cache I M C Timer Remote Store Client Remote Session Store Recovery O M C BotDetection Distributor Persist Sync
  • 20. Bot Detection Overview • Detect non-human activities in near realtime • May treat bot traffic differently during analysis • High level bot rules – Self-declared bots by user agent – Behavior within a session or time window • Tag events with bot flag
  • 21. Bot Detection SessionizerCollector BotDetection Distributor Behavior Events And Metrics BotSignature Bot Detection Service BotSignature Bot tagged Stream
  • 22. PULSAR INTEGRATION WITH OTHER SYSTEMS Divider sub-headline goes here
  • 23. Pulsar Integration with Kafka •Kafka – Persistent messaging queue – High availability, scalability and throughput •Pulsar leveraging Kafka – Supports pull and hybrid messaging model – Loading of data from real-time pipeline into Hadoop and other metric stores – Use schema to validate event payload 2
  • 24. Messaging Models Producer Producer Queue Kafka Producer Queue Kafka Replayer Push Model Pull Model Pause/Resume Hybrid Model (At most once delivery semantics) (At least once delivery semantics) Consumer Consumer Consumer Netty
  • 25. Pulsar Integration with Kylin •Apache Kylin – Distributed analytics engine – SQL interface and multi-dimensional analysis (OLAP) on Hadoop – Interactive Query on Billions of Rows •Pulsar leveraging Kylin – Build multi-dimensional OLAP cube over long time period – Aggregate/drill-down on dimensions such as browser, OS, device, geo location – Capture metrics such as session length, page views, event counts 2
  • 26. Pulsar Integration with Druid •Druid – Real-time ROLAP engine for aggregation, drill-down and slice-n-dice •Pulsar leveraging Druid – Real-time analytics dashboard – Near real-time metrics like number of visitors in the last 5 minutes, refreshing every 10 seconds – Aggregate/drill-down on dimensions such as browser, OS, device, geo location 2
  • 29. TRENDING: ALGORITHMS NARROW THE FOCUS Algorithms and machine learning identify significant trends Humans provide the context and and interesting story
  • 30. search view watch bid purchase w time NearlineOffline (historical) s vv…events… b Online (in-session) vpv s Activity Timeline for Personalization Customer Profile • Price • Category • Sale Type • Item Condition • Deals Customer Intent • Price • Category • Sale Type • Item Condition • Deals
  • 31. EXAMPLE PERSONALIZED CONTENT Personalized digest for a consumer interested in jewelry and accessories Personalized digest for a consumer interested in auto and electronics
  • 32. Behavior Data: A/B Testing 3 Behavioral Data: A/B Testing
  • 33. More Information •GitHub: http://github.com/pulsarIO –repos: pipeline, framework, docker files •Website: http://gopulsar.io –Technical whitepaper –Getting started –Documentation •Google group: http://groups.google.com/d/forum/pulsar 3

Editor's Notes

  • #3: spectrum of our inventory (old and new) … if it exists, probably on sale @ ebay