SlideShare a Scribd company logo
APACHE IGNITE AS A DATA PROCESSING HUB
ROMAN SHTYKH
CYBERAGENT, INC.
See all the presentations from the In-Memory Computing
Summit at http://imcsummit.org
INTRODUCTION
ABOUT ME
Roman Shtykh
 R&D Engineer at CyberAgent, Inc.
 Areas of focus
 Data streaming and NLP
 Committer on the Apache Ignite and MyBatis projects
 Judoka
 @rshtykh
CYBERAGENT, INC.
 Internet ads
 Games
 Media
 Investing
25%
13%
52%
3%
7%
Games
Media
Internet ads
Investing
Other
* As of Sep 2015
AMEBA SERVICES
・ Monthly visitors (DUB total):
6 billion*
・ Number of member users :
about 39 million*
CyberAgent, Inc.
Ameba Services
* As of Dec 2014
• Games
• Community services
• Content curation
• Other
AMEBA SERVICES
Ameba Pigg
CONTENTS
 Apache Ignite
 Feed your data
 Log Aggregation with Apache Flume
 Integration with Apache Ignite
 Streaming Data with Apache Kafka
 Data Pipeline with Kafka and Ignite: Example
APACHE IGNITE
 “High-performance, integrated and distributed in-memory platform for computing and transacting on
large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or
flash-based technologies.”
 High performance, unlimited scalability and resiliency
 High-performance transactions and fast analytics
 Hadoop Acceleration, Apache Spark
 Apache project
https://ignite.apache.org/
MAKING APACHE IGNITE A DATA PROCESSING HUB
 Question: How to feed data?
 A simple solution: Create a client node
MAKING APACHE IGNITE A DATA PROCESSING HUB
 Question: How to feed data?
 A simple solution: Create a client node
 Is it reliable?
 Does it scale?
 Ignite-only solution?
 Does it keep your operational costs low?
MAKING APACHE IGNITE A DATA PROCESSING HUB
 Question: How to feed data?
 A simple solution: Create a client node
 Is it reliable?
 Does it scale?
 Ignite-only solution?
 Does it keep your operational costs low?
LOG AGGREGATION
WITH APACHE
FLUME
LOG AGGREGATION WITH APACHE FLUME
 Flume
 “Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log
data.”
 Scalable
 Flexible
 Robust and fault tolerant
 Declarative configuration
 Apache project
DATA FLOW IN FLUME
Source Sink
Agent
ChannelIncoming data
to another Agent
or Destination
DATA FLOW IN FLUME (REPLICATION/MULTIPLEXING)
Source
Sink
Agent
Channel
Incoming data
SinkChannelChannel Selector
DATA FLOW IN FLUME (RELIABILITY)
 No data is lost (configurable)
Source Sink
Agent
ChannelIncoming data
Source tx Sink tx
LOG TRANSFER AT AMEBA
Ameba
Service
Aggregato
r
Aggregato
r
Aggregat
or
Monitoring
Recommend
er System
Elastic
Search
Hadoop
Batch processing
HBase
Stream Processing
(Onix)
Stream Processing
(HBaseSink)
Ameba
Service
Ameba
Service
LOG TRANSFER AT AMEBA
 Web Hosts
 More than 1600
 Size
 5.0 TB/day (raw)
 Traffic at peak
 160Mbps (compressed)
IGNITE SINK
 Reads Flume events from a channel
 With a user-implemented pluggable transformer converts them into cacheable entries
 Adding it requires no modification to the existing architecture
FLUME ⇒ IGNITE (1)
Source
Ignite
Sink
Agent
ChannelIncoming data
new connection
FLUME ⇒ IGNITE (2)
Source
Ignite
Sink
Agent
ChannelIncoming data
Sink tx
start tx
FLUME ⇒ IGNITE (3)
Source
Ignite
Sink
Agent
ChannelIncoming data
Sink tx
take
event send events
ENABLING FLUME SINK
 Steps
1. Implement EventTransformer
 convert Flume events into cacheable entries (java.util.Map<K, V>)
2. Put transformer’s jar to ${FLUME_HOME}/plugins.d/ignite/lib
3. Put IgniteSink and Ignite core jar files to ${FLUME_HOME}/plugins.d/ignite/libext
4. Set up a Flume agent
 Sink setup
a1.sinks.k1.type = org.apache.ignite.stream.flume.IgniteSink
a1.sinks.k1.igniteCfg = /some-path/ignite.xml
a1.sinks.k1.cacheName = testCache
a1.sinks.k1.eventTransformer = my.company.MyEventTransformer
a1.sinks.k1.batchSize = 100
FLUME SINKS
 HDFS
 THRIFT
 AVRO
 HBASE
 ElasticSearch
 IRC
 IGNITE
APACHE FLUME & APACHE IGNITE
 If you do data aggregation with Flume
 Adding an Ignite cluster is as simple as writing a simple data transformer and deploying a new Flume agent
 If you store your data (and do computations) in Ignite
 Improving data injection becomes easy with Flume sink
 Combining Apache Flume and Ignite makes/keeps your data pipeline (both aggregation and processing)
 Scalable
 Reliable
 Highly-Performant
STREAMING DATA
WITH APACHE KAFKA
APACHE KAFKA
“Publish-subscribe messaging rethought as a distributed commit log”
 Low latency
 High Throughput
 Partitioned and Replicated
 Kafka is an essential component of any data pipeline today
http://kafka.apache.org/
APACHE KAFKA
 Messages are grouped in topics
 Each partition is a log
 Each partition is managed by a broker
(when replicated, one broker is the partition leader)
 Producers & consumers (consumer groups)
 Used for
 Log aggregation
 Activity tracking
 Monitoring
 Stream processing
http://kafka.apache.org/documentation.html
KAFKA CONNECT
 Designed for large scale stream data integration using Kafka
 Provides an abstraction from communication with your Kafka cluster
 Offset management
 Delivery semantics
 Fault tolerance
 Monitoring, etc.
 Worker (scalability & fault tolerance)
 Connector (task config)
 Task (thread)
 Standalone & Distributed execution models
http://www.confluent.io/blog/apache-kafka-0.9-is-released
INGESTING DATA STREAMS
 Two ways
 Kafka Streamer
 Sink Connector
SQL queries Distributed closures
Transactions
Connect
ETL
STREAMING VIA SINK CONNECTOR
 Configure your connector
 Configure Kafka Connect worker
 Start your connector
# connector
name=my-ignite-connector
connector.class=IgniteSinkConnector
tasks.max=2
topics=someTopic1,someTopic2
# cache
cacheName=myCache
cacheAllowOverwrite=true
igniteCfg=/some-path/ignite.xml
$ bin/connect-standalone.sh myconfig/connect-standalone.properties myconfig/ignite-connector.properties
STREAMING VIA SINK CONNECTOR
 Easy data pipeline
 Records from Kafka are written to Ignite grid via high-performance IgniteDataStreamer
 At-least-once delivery guarantee
 As of 1.6, start a new connector to write to a different cache
a b c d e
0 1 2 … Kafka offsets
a.key, a.val
b.key, b.val
…
a2 b2 c2 d2 e2
INGESTING DATA STREAMS
 Bi-directional streaming
SQL queries Distributed closures
Transactions
Connect
Events
Continuous queries
ConnectSin
k
Sourc
e
STREAMING BACK TO KAFKA
 Listening to cache events
 PUT
 READ
 REMOVED
 EXPIRED, etc.
 Remote filtering can be enabled
 Kafka Connect offsets are ignored
 Currently, no delivery guarantees
evt1
evt2
evt3
as records
ENABLING SOURCE CONNECTOR
 Configure your connector
 Define a remote filter if needed
cacheFilterCls=MyCacheEventFilter
 Make sure that event listening is enabled
on the server nodes
 Configure Kafka Connect worker
 Start your connector
#connector
name=ignite-src-connector
connector.class=org.apache.ignite.stream.kafka.connect.IgniteSourceConn
ector
tasks.max=2
#topics, events
topicNames=test
cacheEvts=put,removed
#cache
cacheName=myCache
igniteCfg=myconfig/ignite.xml
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.ignite.stream.kafka.connect.serialization.CacheEventCo
nverter
APACHE KAFKA & APACHE IGNITE
 If you do data streaming with Kafka
 Adding an Ignite cluster is as simple as writing a configuration file (and creating a filter if you need it for source)
 If you store your data (and do computations) in Ignite
 Improving data injection and listening for events on data becomes easy with Kafka Connectors
 Combining Apache Kafka and Ignite makes/keeps your data pipeline
 Scalable
 Reliable
 Highly-Performant
 Covers a wide range of ETL contexts
DATA PIPELINE WITH
KAFKA AND IGNITE
EXAMPLE
DATA PIPELINE WITH KAFKA AND IGNITE
 Requirements
 instant processing and analysis
 scalable and resilient to failures
 low latency
 high throughput
 flexibility
DATA PIPELINE WITH KAFKA AND IGNITE
 Filter and aggregate events
data Flume
filter/transform
data
slow down on heavy loads
more channels/layers
DATA PIPELINE WITH KAFKA AND IGNITE
data
filter
transfor
m
etc.
• Parsimonious resource use
• Replay enabled
• More operations on streams
• Flexibility
Other
source
s
DATA PIPELINE WITH KAFKA AND IGNITE
 Filter and aggregate events
 Store events
 Notify about updates on aggregates
data
filter
transfor
m
etc.
Connectors
DATA PIPELINE WITH KAFKA AND IGNITE
 Filter and aggregate events
 Store events
 Notify about updates on aggregates
data
filter
transfor
m
etc.
Connectors
DATA PIPELINE WITH KAFKA AND IGNITE
 Improving ads delivery
clicks
impressions
ads
Ads
delivery
Ads
recommender
storage/
computatio
n
Image
storage
data & computation
in one place
DATA PIPELINE WITH KAFKA AND IGNITE
 Improving ads delivery
 Better network utilization and reliability
clicks
impression
s
ads
Ads
delivery
Ads
recommende
r
storage/
computatio
n
Image
storag
e
Anomaly
detection
OTHER
INTEGRATIONS
OTHER COMPLETED INTEGRATIONS
 CAMEL
 MQTT
 STORM
 FLINK SINK
 TWITTER
THE END

More Related Content

What's hot (19)

PDF
Beyond Relational
Lynn Langit
 
PPTX
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
DataWorks Summit/Hadoop Summit
 
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
PDF
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
JAXLondon2014
 
PDF
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
PDF
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
In-Memory Computing Summit
 
PDF
Architecture at Scale
Elasticsearch
 
PPTX
Self-Service Provisioning and Hadoop Management with Apache Ambari
DataWorks Summit
 
PDF
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Andrei Savu
 
PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
PDF
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
PDF
Big Data Tools in AWS
Shu-Jeng Hsieh
 
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Denis Magda
 
PDF
HDInsight Informative articles
Karan Gulati
 
PDF
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Alluxio, Inc.
 
PPTX
In Flux Limiting for a multi-tenant logging service
DataWorks Summit/Hadoop Summit
 
PPTX
Cloud native data platform
Li Gao
 
Beyond Relational
Lynn Langit
 
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
DataWorks Summit/Hadoop Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
JAXLondon2014
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
Spark introduction and architecture
Sohil Jain
 
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
In-Memory Computing Summit
 
Architecture at Scale
Elasticsearch
 
Self-Service Provisioning and Hadoop Management with Apache Ambari
DataWorks Summit
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Andrei Savu
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
Kinesis vs-kafka-and-kafka-deep-dive
Yifeng Jiang
 
Big Data Tools in AWS
Shu-Jeng Hsieh
 
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Denis Magda
 
HDInsight Informative articles
Karan Gulati
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Alluxio, Inc.
 
In Flux Limiting for a multi-tenant logging service
DataWorks Summit/Hadoop Summit
 
Cloud native data platform
Li Gao
 

Viewers also liked (20)

PDF
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 
PDF
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
DataStax
 
PPTX
Apache ignite Datagrid
Surinder Mehra
 
PDF
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
In-Memory Computing Summit
 
PDF
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
Joseph Kuo
 
PDF
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
In-Memory Computing Summit
 
PPTX
scale14x-bigtop-overview-roadmap
Nate D'Amico
 
PDF
Processamento em Big Data
Luiz Henrique Zambom Santana
 
PPT
Petrolook Dataflow Diagram
mriddel
 
PPTX
Hadoop 3.0 features
anand murari
 
PPTX
IMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent Memory
In-Memory Computing Summit
 
PDF
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...
In-Memory Computing Summit
 
PPTX
10 Things About Spark
Roger Brinkley
 
PPTX
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
In-Memory Computing Summit
 
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
PPTX
The Evolution of Apache Kylin
DataWorks Summit/Hadoop Summit
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
DataStax
 
Apache ignite Datagrid
Surinder Mehra
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
In-Memory Computing Summit
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
Joseph Kuo
 
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
In-Memory Computing Summit
 
scale14x-bigtop-overview-roadmap
Nate D'Amico
 
Processamento em Big Data
Luiz Henrique Zambom Santana
 
Petrolook Dataflow Diagram
mriddel
 
Hadoop 3.0 features
anand murari
 
IMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent Memory
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
In-Memory Computing Summit
 
IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...
In-Memory Computing Summit
 
IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...
In-Memory Computing Summit
 
10 Things About Spark
Roger Brinkley
 
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
In-Memory Computing Summit
 
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
In-Memory Computing Summit
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
The Evolution of Apache Kylin
DataWorks Summit/Hadoop Summit
 
Ad

Similar to IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub (20)

PDF
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
PDF
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
PDF
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
PPTX
Current and Future of Apache Kafka
Joe Stein
 
PPTX
Streaming Data Ingest and Processing with Apache Kafka
Attunity
 
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
PDF
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
PPTX
Building Event-Driven Systems with Apache Kafka
Brian Ritchie
 
PDF
Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...
tjademargis
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
PDF
Music city data Hail Hydrate! from stream to lake
Timothy Spann
 
PDF
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Timothy Spann
 
PDF
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis
 
PDF
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
PDF
Apache Kafka - A Distributed Streaming Platform
Paolo Castagna
 
PDF
Apache kafka-a distributed streaming platform
confluent
 
PDF
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Timothy Spann
 
ODP
Stream processing using Kafka
Knoldus Inc.
 
PDF
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Timothy Spann
 
PDF
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Accentfuture
 
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
Current and Future of Apache Kafka
Joe Stein
 
Streaming Data Ingest and Processing with Apache Kafka
Attunity
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
Building Event-Driven Systems with Apache Kafka
Brian Ritchie
 
Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...
tjademargis
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Music city data Hail Hydrate! from stream to lake
Timothy Spann
 
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Timothy Spann
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis
 
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
Apache Kafka - A Distributed Streaming Platform
Paolo Castagna
 
Apache kafka-a distributed streaming platform
confluent
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Timothy Spann
 
Stream processing using Kafka
Knoldus Inc.
 
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Timothy Spann
 
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Accentfuture
 
Ad

More from In-Memory Computing Summit (19)

PPTX
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...
In-Memory Computing Summit
 
PDF
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise Grade
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Keynote - Robert Barr - In Memory Computing for Financial Ser...
In-Memory Computing Summit
 
PPTX
IMC Summit 2016 Keynote - Jason Stamper - In-Memory: The Foundation of the In...
In-Memory Computing Summit
 
PPTX
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
In-Memory Computing Summit
 
PPTX
IMCSummit 2016 Keynote - Abe Kleinfeld - The In-Memory Computing Landscape: L...
In-Memory Computing Summit
 
PPTX
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
PDF
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...
In-Memory Computing Summit
 
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise Grade
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
In-Memory Computing Summit
 
IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...
In-Memory Computing Summit
 
IMC Summit 2016 Keynote - Robert Barr - In Memory Computing for Financial Ser...
In-Memory Computing Summit
 
IMC Summit 2016 Keynote - Jason Stamper - In-Memory: The Foundation of the In...
In-Memory Computing Summit
 
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
In-Memory Computing Summit
 
IMCSummit 2016 Keynote - Abe Kleinfeld - The In-Memory Computing Landscape: L...
In-Memory Computing Summit
 
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
 

Recently uploaded (20)

PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Tally software_Introduction_Presentation
AditiBansal54083
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 

IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub

  • 1. APACHE IGNITE AS A DATA PROCESSING HUB ROMAN SHTYKH CYBERAGENT, INC. See all the presentations from the In-Memory Computing Summit at http://imcsummit.org
  • 3. ABOUT ME Roman Shtykh  R&D Engineer at CyberAgent, Inc.  Areas of focus  Data streaming and NLP  Committer on the Apache Ignite and MyBatis projects  Judoka  @rshtykh
  • 4. CYBERAGENT, INC.  Internet ads  Games  Media  Investing 25% 13% 52% 3% 7% Games Media Internet ads Investing Other * As of Sep 2015
  • 5. AMEBA SERVICES ・ Monthly visitors (DUB total): 6 billion* ・ Number of member users : about 39 million* CyberAgent, Inc. Ameba Services * As of Dec 2014 • Games • Community services • Content curation • Other
  • 7. CONTENTS  Apache Ignite  Feed your data  Log Aggregation with Apache Flume  Integration with Apache Ignite  Streaming Data with Apache Kafka  Data Pipeline with Kafka and Ignite: Example
  • 8. APACHE IGNITE  “High-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.”  High performance, unlimited scalability and resiliency  High-performance transactions and fast analytics  Hadoop Acceleration, Apache Spark  Apache project https://ignite.apache.org/
  • 9. MAKING APACHE IGNITE A DATA PROCESSING HUB  Question: How to feed data?  A simple solution: Create a client node
  • 10. MAKING APACHE IGNITE A DATA PROCESSING HUB  Question: How to feed data?  A simple solution: Create a client node  Is it reliable?  Does it scale?  Ignite-only solution?  Does it keep your operational costs low?
  • 11. MAKING APACHE IGNITE A DATA PROCESSING HUB  Question: How to feed data?  A simple solution: Create a client node  Is it reliable?  Does it scale?  Ignite-only solution?  Does it keep your operational costs low?
  • 13. LOG AGGREGATION WITH APACHE FLUME  Flume  “Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.”  Scalable  Flexible  Robust and fault tolerant  Declarative configuration  Apache project
  • 14. DATA FLOW IN FLUME Source Sink Agent ChannelIncoming data to another Agent or Destination
  • 15. DATA FLOW IN FLUME (REPLICATION/MULTIPLEXING) Source Sink Agent Channel Incoming data SinkChannelChannel Selector
  • 16. DATA FLOW IN FLUME (RELIABILITY)  No data is lost (configurable) Source Sink Agent ChannelIncoming data Source tx Sink tx
  • 17. LOG TRANSFER AT AMEBA Ameba Service Aggregato r Aggregato r Aggregat or Monitoring Recommend er System Elastic Search Hadoop Batch processing HBase Stream Processing (Onix) Stream Processing (HBaseSink) Ameba Service Ameba Service
  • 18. LOG TRANSFER AT AMEBA  Web Hosts  More than 1600  Size  5.0 TB/day (raw)  Traffic at peak  160Mbps (compressed)
  • 19. IGNITE SINK  Reads Flume events from a channel  With a user-implemented pluggable transformer converts them into cacheable entries  Adding it requires no modification to the existing architecture
  • 20. FLUME ⇒ IGNITE (1) Source Ignite Sink Agent ChannelIncoming data new connection
  • 21. FLUME ⇒ IGNITE (2) Source Ignite Sink Agent ChannelIncoming data Sink tx start tx
  • 22. FLUME ⇒ IGNITE (3) Source Ignite Sink Agent ChannelIncoming data Sink tx take event send events
  • 23. ENABLING FLUME SINK  Steps 1. Implement EventTransformer  convert Flume events into cacheable entries (java.util.Map<K, V>) 2. Put transformer’s jar to ${FLUME_HOME}/plugins.d/ignite/lib 3. Put IgniteSink and Ignite core jar files to ${FLUME_HOME}/plugins.d/ignite/libext 4. Set up a Flume agent  Sink setup a1.sinks.k1.type = org.apache.ignite.stream.flume.IgniteSink a1.sinks.k1.igniteCfg = /some-path/ignite.xml a1.sinks.k1.cacheName = testCache a1.sinks.k1.eventTransformer = my.company.MyEventTransformer a1.sinks.k1.batchSize = 100
  • 24. FLUME SINKS  HDFS  THRIFT  AVRO  HBASE  ElasticSearch  IRC  IGNITE
  • 25. APACHE FLUME & APACHE IGNITE  If you do data aggregation with Flume  Adding an Ignite cluster is as simple as writing a simple data transformer and deploying a new Flume agent  If you store your data (and do computations) in Ignite  Improving data injection becomes easy with Flume sink  Combining Apache Flume and Ignite makes/keeps your data pipeline (both aggregation and processing)  Scalable  Reliable  Highly-Performant
  • 27. APACHE KAFKA “Publish-subscribe messaging rethought as a distributed commit log”  Low latency  High Throughput  Partitioned and Replicated  Kafka is an essential component of any data pipeline today http://kafka.apache.org/
  • 28. APACHE KAFKA  Messages are grouped in topics  Each partition is a log  Each partition is managed by a broker (when replicated, one broker is the partition leader)  Producers & consumers (consumer groups)  Used for  Log aggregation  Activity tracking  Monitoring  Stream processing http://kafka.apache.org/documentation.html
  • 29. KAFKA CONNECT  Designed for large scale stream data integration using Kafka  Provides an abstraction from communication with your Kafka cluster  Offset management  Delivery semantics  Fault tolerance  Monitoring, etc.  Worker (scalability & fault tolerance)  Connector (task config)  Task (thread)  Standalone & Distributed execution models http://www.confluent.io/blog/apache-kafka-0.9-is-released
  • 30. INGESTING DATA STREAMS  Two ways  Kafka Streamer  Sink Connector SQL queries Distributed closures Transactions Connect ETL
  • 31. STREAMING VIA SINK CONNECTOR  Configure your connector  Configure Kafka Connect worker  Start your connector # connector name=my-ignite-connector connector.class=IgniteSinkConnector tasks.max=2 topics=someTopic1,someTopic2 # cache cacheName=myCache cacheAllowOverwrite=true igniteCfg=/some-path/ignite.xml $ bin/connect-standalone.sh myconfig/connect-standalone.properties myconfig/ignite-connector.properties
  • 32. STREAMING VIA SINK CONNECTOR  Easy data pipeline  Records from Kafka are written to Ignite grid via high-performance IgniteDataStreamer  At-least-once delivery guarantee  As of 1.6, start a new connector to write to a different cache a b c d e 0 1 2 … Kafka offsets a.key, a.val b.key, b.val … a2 b2 c2 d2 e2
  • 33. INGESTING DATA STREAMS  Bi-directional streaming SQL queries Distributed closures Transactions Connect Events Continuous queries ConnectSin k Sourc e
  • 34. STREAMING BACK TO KAFKA  Listening to cache events  PUT  READ  REMOVED  EXPIRED, etc.  Remote filtering can be enabled  Kafka Connect offsets are ignored  Currently, no delivery guarantees evt1 evt2 evt3 as records
  • 35. ENABLING SOURCE CONNECTOR  Configure your connector  Define a remote filter if needed cacheFilterCls=MyCacheEventFilter  Make sure that event listening is enabled on the server nodes  Configure Kafka Connect worker  Start your connector #connector name=ignite-src-connector connector.class=org.apache.ignite.stream.kafka.connect.IgniteSourceConn ector tasks.max=2 #topics, events topicNames=test cacheEvts=put,removed #cache cacheName=myCache igniteCfg=myconfig/ignite.xml key.converter=org.apache.kafka.connect.storage.StringConverter value.converter=org.apache.ignite.stream.kafka.connect.serialization.CacheEventCo nverter
  • 36. APACHE KAFKA & APACHE IGNITE  If you do data streaming with Kafka  Adding an Ignite cluster is as simple as writing a configuration file (and creating a filter if you need it for source)  If you store your data (and do computations) in Ignite  Improving data injection and listening for events on data becomes easy with Kafka Connectors  Combining Apache Kafka and Ignite makes/keeps your data pipeline  Scalable  Reliable  Highly-Performant  Covers a wide range of ETL contexts
  • 37. DATA PIPELINE WITH KAFKA AND IGNITE EXAMPLE
  • 38. DATA PIPELINE WITH KAFKA AND IGNITE  Requirements  instant processing and analysis  scalable and resilient to failures  low latency  high throughput  flexibility
  • 39. DATA PIPELINE WITH KAFKA AND IGNITE  Filter and aggregate events data Flume filter/transform data slow down on heavy loads more channels/layers
  • 40. DATA PIPELINE WITH KAFKA AND IGNITE data filter transfor m etc. • Parsimonious resource use • Replay enabled • More operations on streams • Flexibility Other source s
  • 41. DATA PIPELINE WITH KAFKA AND IGNITE  Filter and aggregate events  Store events  Notify about updates on aggregates data filter transfor m etc. Connectors
  • 42. DATA PIPELINE WITH KAFKA AND IGNITE  Filter and aggregate events  Store events  Notify about updates on aggregates data filter transfor m etc. Connectors
  • 43. DATA PIPELINE WITH KAFKA AND IGNITE  Improving ads delivery clicks impressions ads Ads delivery Ads recommender storage/ computatio n Image storage data & computation in one place
  • 44. DATA PIPELINE WITH KAFKA AND IGNITE  Improving ads delivery  Better network utilization and reliability clicks impression s ads Ads delivery Ads recommende r storage/ computatio n Image storag e Anomaly detection
  • 46. OTHER COMPLETED INTEGRATIONS  CAMEL  MQTT  STORM  FLINK SINK  TWITTER