SlideShare a Scribd company logo
Lambda Architecture Platform 
Using SQL 
Sep 13 2014 
HadoopCon 2014 Taiwan 
TAGOMORI Satoshi (@tagomoris)
Taipei
Topics 
About Me & LINE 
Data analytics workloads 
Batch processing 
Stream processing 
Lambda architecture 
Lambda architecture using SQL 
Norikra: Stream processing with SQL 
13:30-14:20 4F
@tagomoris 
Satoshi Tagomori (田籠 聡) 
LINE Corporation Analytics Platform Team
Tokyo
Lambda Architecture Using SQL
LINE Offices 
Tokyo HQ 
Spain 
Thailand 
Taipei 
USA 
Korea
LINE is born! JUNE 23, 2011
Lambda Architecture Using SQL
Data Analytics 
Workload 
Part 01
Various Data Analytics Workload 
Reports 
Monthly/Daily reports 
Hourly (or shorter) news 
Real-time metrics 
Automatically updated reports/graphs 
Alerts for abuse of services, overload, ...
Lambda Architecture Using SQL
Batch Processing 
Hadoop 
MapReduce (or Spark, Tez) & DSLs (Hive, Pig, ...) 
For reports 
MPP Engines 
Cloudera Impala, Apache Drill, Facebook Presto, ... 
For interactive analysis 
For reports of shorter window
Stream Processing 
Apache Storm 
Incubator project 
“Distributed and fault-tolerant realtime computation” 
Norikra 
by tagomoris 
Non-distributed “Stream processing with SQL”
Why Stream Processing? 
Less latency 
Realtime metrics 
Short-term prompt reports 
Less computing power 
10Mbps for batch processing: 100GB/day 
10Mbps for stream processing: 1 Server 
No query schedule management 
Once query registered, it runs forever
Disadvantage of Stream Processing 
Queries must be written before data 
There should be another way to query past data 
Queries cannot be run twice 
All results will be lost when any error occurs 
All data have gone when bugs found 
Disorders of events break results 
Recorded time based queries? Or arrival time based queries?
Part 02 
Lambda Architecture
Lambda Architecture 
“The Lambda-architecture aims to satisfy the needs for a 
robust system that is fault-tolerant, both against hardware 
failures and human mistakes, being able to serve a wide 
range of workloads and use cases, and in which low-latency 
reads and updates are required. The resulting system should 
be linearly scalable, and it should scale out rather than up.” 
http://lambda-architecture.net/
Lambda Architecture: Overview 
new data 
batch layer 
master dataset 
serving layer 
view 
speed layer 
real-time view 
query
Twitter Summingbird 
Lambda architecture library 
Batch mode: Scalding on Hadoop MapReduce 
Realtime mode: Storm 
Word counting by Summingbird (scala): 
def wordCount[P <: Platform[P]] 
(source: Producer[P, String], store: P#Store[String, Long]) = 
source.flatMap { sentence => 
toWords(sentence).map(_ -> 1L) 
}.sumByKey(store) 
https://github.com/twitter/summingbird 
https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
What Lambda Architecture Provides 
Replayable queries 
Redo queries anytime if results of speed layer are broken 
Accurate results on demand 
Prompt reports in speed layer with arrival time 
Fixed reports in batch layer with recorded time 
... And many more benefits of stream processing
Why All of Us Don’t Use It? 
Storm doesn’t fit well with many uses 
Storm requires computer resources too big to deploy 
Summingbird requires many steps to deploy 
Many directors/analysts don’t write Scala/Java 
Summingbird DSL is not enough easy for non-professional 
people
Lambda Architecture 
Using SQL 
Part 03
Existing Hadoop Platform 
new data 
HDFS hive 
query 
Fluentd 
presto 
query
Norikra 
Schema-less stream processing with SQL 
“Norikra is a open source server software provides "Stream 
Processing" with SQL, written in JRuby, runs on JVM, licensed 
under GPLv2.” 
SELECT 
path, 
COUNT(1, status=200) AS success_count, 
COUNT(1, status=500) AS server_error_count, 
COUNT(*) AS count 
FROM AccessLog.win:time_batch(10 min, 0L) 
WHERE service='myservice' AND path LIKE '/api/%' 
GROUP BY path 
http://norikra.github.io/
Added-on Lambda Architecture Platform 
new data 
presto 
query 
HDFS hive 
query 
norikra 
query
“Pseudo Lambda” Architecture Using SQL 
Lambda architecture platform 
with almost same queries 
SELECT path, 
COUNT(IF(status=200,1,NULL)) AS success_count, 
COUNT(IF(status=500,1,NULL)) AS server_error_count, 
COUNT(*) AS count 
FROM AccessLog 
WHERE service='myservice' AND path LIKE '/api/%' 
AND timestamp >= ‘2014-09-13 10:40:00’ 
AND timestamp < ‘2014-09-13 10:50:00’ 
GROUP BY path 
SELECT path, 
COUNT(1, status=200) AS success_count, 
COUNT(1, status=500) AS server_error_count, 
COUNT(*) AS count 
FROM AccessLog.win:time_batch(10 min, 0L) 
WHERE service='myservice' AND path LIKE '/api/%' 
GROUP BY path
“Pseudo Lambda” Architecture Using SQL 
SQL dialects are easy to learn! 
Standard SQL, Hive, Presto, Impala, Drill, ... 
+ Norikra 
For non-professional people too! 
SQL queries are very easy to write twice!
Use Cases in LINE 
Prompt reports for Ads service 
Short-term prompt reports by Norikra 
Daily fixed reports by Hive 
Summary of application server error log 
Aggregate error log for alerting by Norikra 
Check details with Hive, Presto (or grep!) 
See you later for details!
TMTOWTDI 
“There’s more than one way to do it.” 
- Perl programming language
SHARE 
What I want & What I’m doing! 
- tagomoris
Q & A

More Related Content

What's hot (20)

PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
PDF
Realtime Reporting using Spark Streaming
Santosh Sahoo
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
PPTX
Kafka Lambda architecture with mirroring
Anant Rustagi
 
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
PDF
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
SSR: Structured Streaming for R and Machine Learning
felixcss
 
PPTX
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
PPTX
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
PDF
Lambda architecture
Szilveszter Molnár
 
PDF
Tale of ISUCON and Its Bench Tools
SATOSHI TAGOMORI
 
PDF
Introduction to Presto at Treasure Data
Taro L. Saito
 
PDF
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
PDF
Stream Processing using Apache Spark and Apache Kafka
Abhinav Singh
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Realtime Reporting using Spark Streaming
Santosh Sahoo
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
Kafka Lambda architecture with mirroring
Anant Rustagi
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
Lambda architecture
Szilveszter Molnár
 
Tale of ISUCON and Its Bench Tools
SATOSHI TAGOMORI
 
Introduction to Presto at Treasure Data
Taro L. Saito
 
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
Stream Processing using Apache Spark and Apache Kafka
Abhinav Singh
 

Viewers also liked (19)

PDF
Lambda architecture for real time big data
Trieu Nguyen
 
PDF
Invitation for v1.0.0
SATOSHI TAGOMORI
 
PDF
Norikra: Stream Processing with SQL
SATOSHI TAGOMORI
 
PDF
Norikra: SQL Stream Processing In Ruby
SATOSHI TAGOMORI
 
PDF
BigQuery, Fluentd and tagomoris #gcpja
SATOSHI TAGOMORI
 
PDF
運用とデータ分析の遠くて近い関係、ISUCONを添えて
SATOSHI TAGOMORI
 
PDF
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
PDF
Effectively using Open Source with conda
Travis Oliphant
 
PDF
Lambda Architecture and open source technology stack for real time big data
Trieu Nguyen
 
PDF
Fluentd and WebHDFS
SATOSHI TAGOMORI
 
PDF
Microsoft Big Data @ SQLUG 2013
Nathan Bijnens
 
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
PDF
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 
PPTX
Design Principles for a Modern Data Warehouse
Rob Winters
 
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
DataWorks Summit
 
PDF
fluent-plugin-norikra #fluentdcasual
SATOSHI TAGOMORI
 
PDF
Hadoop and Kerberos
Yuta Imai
 
PPTX
Kostenlose Social Media Monitoring Tools
Kommunikation-zweinull
 
PPTX
Building an Effective Data Warehouse Architecture
James Serra
 
Lambda architecture for real time big data
Trieu Nguyen
 
Invitation for v1.0.0
SATOSHI TAGOMORI
 
Norikra: Stream Processing with SQL
SATOSHI TAGOMORI
 
Norikra: SQL Stream Processing In Ruby
SATOSHI TAGOMORI
 
BigQuery, Fluentd and tagomoris #gcpja
SATOSHI TAGOMORI
 
運用とデータ分析の遠くて近い関係、ISUCONを添えて
SATOSHI TAGOMORI
 
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
Effectively using Open Source with conda
Travis Oliphant
 
Lambda Architecture and open source technology stack for real time big data
Trieu Nguyen
 
Fluentd and WebHDFS
SATOSHI TAGOMORI
 
Microsoft Big Data @ SQLUG 2013
Nathan Bijnens
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 
Design Principles for a Modern Data Warehouse
Rob Winters
 
Implementing the Lambda Architecture efficiently with Apache Spark
DataWorks Summit
 
fluent-plugin-norikra #fluentdcasual
SATOSHI TAGOMORI
 
Hadoop and Kerberos
Yuta Imai
 
Kostenlose Social Media Monitoring Tools
Kommunikation-zweinull
 
Building an Effective Data Warehouse Architecture
James Serra
 
Ad

Similar to Lambda Architecture Using SQL (20)

PDF
Perfect Norikra 2nd Season
SATOSHI TAGOMORI
 
PPTX
Big Data_Architecture.pptx
betalab
 
PDF
Cloud Lambda Architecture Patterns
Asis Mohanty
 
ODP
Web-scale data processing: practical approaches for low-latency and batch
Edward Capriolo
 
PPTX
Your Guide to Streaming - The Engineer's Perspective
Ilya Ganelin
 
PDF
Lambda Architectures in Practice
C4Media
 
PPTX
Stream Computing (The Engineer's Perspective)
Ilya Ganelin
 
PDF
Building Scalable Big Data Pipelines
Christian Gügi
 
PDF
Lambda architecture @ Indix
Rajesh Muppalla
 
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
PPTX
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
DataStax Academy
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Lambda architecture
Mario Alexandro Santini
 
PDF
Realtime
 Distributed Analysis
 of Datastreams
Florian Stegmaier
 
PDF
Data Streaming Technology Overview
Dan Lynn
 
PPTX
Data streaming fundamentals
Mohammed Fazuluddin
 
PDF
Using Hazelcast in the Kappa architecture
Oliver Buckley-Salmon
 
PDF
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
PDF
Agile data lake? An oxymoron?
samthemonad
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Perfect Norikra 2nd Season
SATOSHI TAGOMORI
 
Big Data_Architecture.pptx
betalab
 
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Web-scale data processing: practical approaches for low-latency and batch
Edward Capriolo
 
Your Guide to Streaming - The Engineer's Perspective
Ilya Ganelin
 
Lambda Architectures in Practice
C4Media
 
Stream Computing (The Engineer's Perspective)
Ilya Ganelin
 
Building Scalable Big Data Pipelines
Christian Gügi
 
Lambda architecture @ Indix
Rajesh Muppalla
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
DataStax Academy
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Lambda architecture
Mario Alexandro Santini
 
Realtime
 Distributed Analysis
 of Datastreams
Florian Stegmaier
 
Data Streaming Technology Overview
Dan Lynn
 
Data streaming fundamentals
Mohammed Fazuluddin
 
Using Hazelcast in the Kappa architecture
Oliver Buckley-Salmon
 
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
Agile data lake? An oxymoron?
samthemonad
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Ad

More from SATOSHI TAGOMORI (20)

PDF
Ractor's speed is not light-speed
SATOSHI TAGOMORI
 
PDF
Good Things and Hard Things of SaaS Development/Operations
SATOSHI TAGOMORI
 
PDF
Maccro Strikes Back
SATOSHI TAGOMORI
 
PDF
Invitation to the dark side of Ruby
SATOSHI TAGOMORI
 
PDF
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
SATOSHI TAGOMORI
 
PDF
Make Your Ruby Script Confusing
SATOSHI TAGOMORI
 
PDF
Hijacking Ruby Syntax in Ruby
SATOSHI TAGOMORI
 
PDF
Lock, Concurrency and Throughput of Exclusive Operations
SATOSHI TAGOMORI
 
PDF
Data Processing and Ruby in the World
SATOSHI TAGOMORI
 
PDF
Planet-scale Data Ingestion Pipeline: Bigdam
SATOSHI TAGOMORI
 
PDF
Technologies, Data Analytics Service and Enterprise Business
SATOSHI TAGOMORI
 
PDF
Ruby and Distributed Storage Systems
SATOSHI TAGOMORI
 
PDF
Fluentd 101
SATOSHI TAGOMORI
 
PDF
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
PDF
How To Write Middleware In Ruby
SATOSHI TAGOMORI
 
PDF
Modern Black Mages Fighting in the Real World
SATOSHI TAGOMORI
 
PDF
Open Source Software, Distributed Systems, Database as a Cloud Service
SATOSHI TAGOMORI
 
PDF
Fluentd Overview, Now and Then
SATOSHI TAGOMORI
 
PDF
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
PDF
Distributed Logging Architecture in Container Era
SATOSHI TAGOMORI
 
Ractor's speed is not light-speed
SATOSHI TAGOMORI
 
Good Things and Hard Things of SaaS Development/Operations
SATOSHI TAGOMORI
 
Maccro Strikes Back
SATOSHI TAGOMORI
 
Invitation to the dark side of Ruby
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
SATOSHI TAGOMORI
 
Make Your Ruby Script Confusing
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby
SATOSHI TAGOMORI
 
Lock, Concurrency and Throughput of Exclusive Operations
SATOSHI TAGOMORI
 
Data Processing and Ruby in the World
SATOSHI TAGOMORI
 
Planet-scale Data Ingestion Pipeline: Bigdam
SATOSHI TAGOMORI
 
Technologies, Data Analytics Service and Enterprise Business
SATOSHI TAGOMORI
 
Ruby and Distributed Storage Systems
SATOSHI TAGOMORI
 
Fluentd 101
SATOSHI TAGOMORI
 
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
How To Write Middleware In Ruby
SATOSHI TAGOMORI
 
Modern Black Mages Fighting in the Real World
SATOSHI TAGOMORI
 
Open Source Software, Distributed Systems, Database as a Cloud Service
SATOSHI TAGOMORI
 
Fluentd Overview, Now and Then
SATOSHI TAGOMORI
 
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
Distributed Logging Architecture in Container Era
SATOSHI TAGOMORI
 

Recently uploaded (20)

PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Digital Circuits, important subject in CS
contactparinay1
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 

Lambda Architecture Using SQL

  • 1. Lambda Architecture Platform Using SQL Sep 13 2014 HadoopCon 2014 Taiwan TAGOMORI Satoshi (@tagomoris)
  • 3. Topics About Me & LINE Data analytics workloads Batch processing Stream processing Lambda architecture Lambda architecture using SQL Norikra: Stream processing with SQL 13:30-14:20 4F
  • 4. @tagomoris Satoshi Tagomori (田籠 聡) LINE Corporation Analytics Platform Team
  • 7. LINE Offices Tokyo HQ Spain Thailand Taipei USA Korea
  • 8. LINE is born! JUNE 23, 2011
  • 11. Various Data Analytics Workload Reports Monthly/Daily reports Hourly (or shorter) news Real-time metrics Automatically updated reports/graphs Alerts for abuse of services, overload, ...
  • 13. Batch Processing Hadoop MapReduce (or Spark, Tez) & DSLs (Hive, Pig, ...) For reports MPP Engines Cloudera Impala, Apache Drill, Facebook Presto, ... For interactive analysis For reports of shorter window
  • 14. Stream Processing Apache Storm Incubator project “Distributed and fault-tolerant realtime computation” Norikra by tagomoris Non-distributed “Stream processing with SQL”
  • 15. Why Stream Processing? Less latency Realtime metrics Short-term prompt reports Less computing power 10Mbps for batch processing: 100GB/day 10Mbps for stream processing: 1 Server No query schedule management Once query registered, it runs forever
  • 16. Disadvantage of Stream Processing Queries must be written before data There should be another way to query past data Queries cannot be run twice All results will be lost when any error occurs All data have gone when bugs found Disorders of events break results Recorded time based queries? Or arrival time based queries?
  • 17. Part 02 Lambda Architecture
  • 18. Lambda Architecture “The Lambda-architecture aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.” http://lambda-architecture.net/
  • 19. Lambda Architecture: Overview new data batch layer master dataset serving layer view speed layer real-time view query
  • 20. Twitter Summingbird Lambda architecture library Batch mode: Scalding on Hadoop MapReduce Realtime mode: Storm Word counting by Summingbird (scala): def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store) https://github.com/twitter/summingbird https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
  • 21. What Lambda Architecture Provides Replayable queries Redo queries anytime if results of speed layer are broken Accurate results on demand Prompt reports in speed layer with arrival time Fixed reports in batch layer with recorded time ... And many more benefits of stream processing
  • 22. Why All of Us Don’t Use It? Storm doesn’t fit well with many uses Storm requires computer resources too big to deploy Summingbird requires many steps to deploy Many directors/analysts don’t write Scala/Java Summingbird DSL is not enough easy for non-professional people
  • 24. Existing Hadoop Platform new data HDFS hive query Fluentd presto query
  • 25. Norikra Schema-less stream processing with SQL “Norikra is a open source server software provides "Stream Processing" with SQL, written in JRuby, runs on JVM, licensed under GPLv2.” SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM AccessLog.win:time_batch(10 min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path http://norikra.github.io/
  • 26. Added-on Lambda Architecture Platform new data presto query HDFS hive query norikra query
  • 27. “Pseudo Lambda” Architecture Using SQL Lambda architecture platform with almost same queries SELECT path, COUNT(IF(status=200,1,NULL)) AS success_count, COUNT(IF(status=500,1,NULL)) AS server_error_count, COUNT(*) AS count FROM AccessLog WHERE service='myservice' AND path LIKE '/api/%' AND timestamp >= ‘2014-09-13 10:40:00’ AND timestamp < ‘2014-09-13 10:50:00’ GROUP BY path SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM AccessLog.win:time_batch(10 min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path
  • 28. “Pseudo Lambda” Architecture Using SQL SQL dialects are easy to learn! Standard SQL, Hive, Presto, Impala, Drill, ... + Norikra For non-professional people too! SQL queries are very easy to write twice!
  • 29. Use Cases in LINE Prompt reports for Ads service Short-term prompt reports by Norikra Daily fixed reports by Hive Summary of application server error log Aggregate error log for alerting by Norikra Check details with Hive, Presto (or grep!) See you later for details!
  • 30. TMTOWTDI “There’s more than one way to do it.” - Perl programming language
  • 31. SHARE What I want & What I’m doing! - tagomoris
  • 32. Q & A