SlideShare a Scribd company logo
2 0 1 6 . 0 8
F r o m J a v a S t r e a m t o J a v a D a t a F r a m e
P o p c o r n y ( 陸 振 恩 )
Outline
• 動機
• 從Java Stream到DataFrame的歷程
• Poppy簡介
• Demo
動機
• TenMax是一個廣告平台
• 廣告就是要看報表
• 所有發生的event我們稱為rawlog
• rawlog每個小時產生一次Aggregated Data
• 看報表時可以以選擇一個時間區間,根據某些維度(dimensions),可以看出某些數值(metrics)
• 這是常見的OLAP技巧
Raw Log
Aggregated Data
(Cube)
Batch
aggregateIngest
Interactive
Query
如果是單純的RDBMS
RDBMS
(RawLog)
RDBMS
(Cube)
Batch
aggregateIngest
Interactive
Query
RDBMS的困境
• 傳統的RDBMS不適合非常大量的Log Ingestion
• 更適合的有
– DFS: 但是Append-Only的環境比較適合
– Cassandra/Hbase: 除了Insert, 還可以Row-based的update, delete, partition scan
DFS or
Cassandra
RDBMS
Batch
aggregateIngest
Interactive
Query
但是,Aggregation就要自己來了
Aggregation有哪些Solution
• Computation Engine
– Hadoop MapReduce
– Hive
– Spark SQL
– Impala
• 但是都有以下的缺點
– 原本的設計都是針對Cluster環境所設計
– Heavy weight
– 過多的Dependency (如果要把driver包在自己的程式中)
– 只對HDFS-Compatible的data source比較友善
– Job啟動速度
– 如果要定義自己的UDF / UDAF 會很複雜
– 學習門檻
– 維運門檻
– …..
這些對大數據都是很好的Solution,
但是對中數據呢?
中數據
• 資料量
– 一天會新增1G ~ 1T uncompressed data
• 假設
– 一筆record = 1K, 1T資料 = 10億筆資料
– CPU 1 core一秒可以處理1萬筆資料
– 四核一天可以處理34.56億筆資料
• 其實一台機器綽綽有餘
• 更何況雲端機器可以Scale up,到16核都不是問題
• I/O跟Network throughput漸漸不是瓶頸
• 單機跑的solution可以減少很多的overhead
• 單程序跑的solution也好寫好debug
那就自己來寫Aggregation吧
Java8
• 語言特色 Lambda
• 三神器
– Stream
– Optional
– CompletableFuture
Java Stream
• Functional Reactive Programming (FRP)
• Pipeline Style,Input透過一站一站的transformation最後輸出到Output
• Streaming的特性,非常少的Memory Footprint,可以處理非常大量的資料。
forEach()map() filter() flapMap() peek()
那Aggregation呢?
先了解一下SQL吧
From
RawLog
Where
(DayRange)
Group
By
sum(),sum(),sum()
hour=?,dim1=?,dim2=?
val1, val2, val3
sum(),sum(),sum()
hour=?,dim1=?,dim2=?
val1, val2, val3
sum(),sum(),sum()
hour=?,dim1=?,dim2=?
val1, val2, val3
sum(),sum(),sum()
hour=?,dim1=?,dim2=?
val1, val2, val3
sum(),sum(),sum()
hour=?,dim1=?,dim2=?
val1, val2, val3
sum(),sum(),sum()
hour=?,dim1=?,dim2=?
val1, val2, val3
Java Stream Aggregation
From
Where
GroupBy Aggregation
count(), sum()
然後Mapper是
然後Reducer是
From Java Stream to Java DataFrame
Java Stream
• 對於這種應用好像有點複雜
• 不太好用的平行處理
• java.util.stream.Collector對於多metrics的aggregation很麻煩
• 有些時候我們想要的是Column Based的操作,而不是單純的對一個Type操作
所以我們開發了Poppy
http://tenmax.github.io/poppy/
Introduction to Poppy
• Poppy是一個Java的DataFrame Library
• 什麼是Data Frame?
– Column based (Schema)
– 可以做類似RDBMS的相關操作 select, from, where, group by, aggregation, order by
• Poppy還有以下特色
– Stream based (適合較大數據)
– 支援partition以及平行計算
– User Defined Function, User Defined Aggregation Function
– Lightweight
• 其實就是有Schema版本的Java Stream
http://tenmax.github.io/poppy/
Poppy大概長這樣
from
where
group by
aggregation
That’s All!!
Poppy
• Pipeline分成三部分
– Input
– Operations
– Output
http://tenmax.github.io/poppy/
OutputOperation Operation Operation OperationInput
Input
• By Iterable
DataFrame.from(Class<T> clazz, java.util.Iterable... iterables)
• By DataSource
DataFrame.from(io.tenmax.DataSource dataSource)
• 其中DataSource的定義是
http://tenmax.github.io/poppy/
Output
• iterator(), forEach()
• toList(),toMap(), print()
• DataFrame.to(DataSink dataSink)
• 其中DataSink的定義是
http://tenmax.github.io/poppy/
Operations
• project()
• filter()
• Aggregation()
• groupby()
• Sort()
• distinct()
• peek()
• cache()
http://tenmax.github.io/poppy/
Projection (Select)
http://tenmax.github.io/poppy/
Filter (Where, Having)
http://tenmax.github.io/poppy/
Aggregation (Count, Sum, Avg, …)
http://tenmax.github.io/poppy/
Sort (Order by)
http://tenmax.github.io/poppy/
Distinct
http://tenmax.github.io/poppy/
Demo
http://tenmax.github.io/poppy/
User-Defined Function
http://tenmax.github.io/poppy/
• 使用 java.util.function,Function<T,R>
User-Defined Aggregation Function
http://tenmax.github.io/poppy/
• 使用 java.util.stream,Collector<T,A,R>
平行計算
• Partition是平行的基本單位
• 一個DataSource可以提供多個Partition
• 透過dataFrame.parallel(n)來決定平行的thread個數
http://tenmax.github.io/poppy/
Execution Context
• 一個Execution Context代表的是一個thread pool。
• 在其中可能有 n 個threads,以及 m 個partitions
• 通常m >= n,每個thread在處理完一個partition之後,會去拉下一個還未處理的partition
http://tenmax.github.io/poppy/
Execution Context
• 每次呼叫aggregation, sort, distinct會產生一個新的execution context。
http://tenmax.github.io/poppy/
Demo
http://tenmax.github.io/poppy/
Conclusion
• Java Stream對於Column-based的需求不太容易處理。
• 我們提供的DataFrame Library – Poppy提供了更簡單的方法來處理Column-based的資料。
• 可以很輕易的平行化來處理大量的資料。
• 但是又非常的lightweight
http://tenmax.github.io/poppy/
Conclusion
• Java Stream對於Column-based的需求不太容易處理。
• 我們提供的DataFrame Library – Poppy提供了更簡單的方法來處理Column-based的資料。
• 可以很輕易的平行化來處理大量的資料。
• 但是又非常的lightweight
http://tenmax.github.io/poppy/
Reference
• Project Site - http://tenmax.github.io/poppy/
• Poppy User Manual - http://tenmax.github.io/poppy/
• Poppy Javadoc - http://tenmax.github.io/poppy/docs/javadoc/index.html
• Java多執行緒的基本知識 - https://www.gitbook.com/book/popcornylu/java_multithread/details
• pq - https://github.com/tenmax/pq
http://tenmax.github.io/poppy/
如果覺得不錯的話請幫我打一個星星
http://tenmax.github.io/poppy/
Thank you! Question?
http://tenmax.github.io/poppy/

More Related Content

What's hot (20)

PPTX
淘宝Hadoop数据分析实践
Min Zhou
 
PDF
MapReduce 簡單介紹與練習
孜羲 顏
 
PPT
Hadoop Map Reduce 程式設計
Wei-Yu Chen
 
PDF
Google LevelDB Study Discuss
everestsun
 
PDF
Leveldb background
宗志 陈
 
PDF
分布式Key Value Store漫谈
Tim Y
 
PDF
分布式流数据实时计算平台 Iprocess
babel_qi
 
PDF
Hadoop ecosystem - hadoop 生態系
Wei-Yu Chen
 
PPTX
Hadoop hive
Wei-Yu Chen
 
PDF
Hantuo openstack
OpenCity Community
 
PPTX
HDFS與MapReduce架構研討
Billy Yang
 
PPTX
Elastic stack day-2
YI-CHING WU
 
PPT
Hadoop introduction
Tianwei Liu
 
PDF
Level db
宗志 陈
 
PDF
SACC2015 ”互联网+“任重而道远-白金&高春辉
ptcracker
 
PDF
ClickHouse北京Meetup ClickHouse Best Practice @Sina
Jack Gao
 
PDF
Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Community
 
PDF
基于Spring batch的大数据量并行处理
Jacky Chi
 
PDF
诗檀软件 Oracle开发优化基础
maclean liu
 
PPT
淘宝分布式数据处理实践
isnull
 
淘宝Hadoop数据分析实践
Min Zhou
 
MapReduce 簡單介紹與練習
孜羲 顏
 
Hadoop Map Reduce 程式設計
Wei-Yu Chen
 
Google LevelDB Study Discuss
everestsun
 
Leveldb background
宗志 陈
 
分布式Key Value Store漫谈
Tim Y
 
分布式流数据实时计算平台 Iprocess
babel_qi
 
Hadoop ecosystem - hadoop 生態系
Wei-Yu Chen
 
Hadoop hive
Wei-Yu Chen
 
Hantuo openstack
OpenCity Community
 
HDFS與MapReduce架構研討
Billy Yang
 
Elastic stack day-2
YI-CHING WU
 
Hadoop introduction
Tianwei Liu
 
Level db
宗志 陈
 
SACC2015 ”互联网+“任重而道远-白金&高春辉
ptcracker
 
ClickHouse北京Meetup ClickHouse Best Practice @Sina
Jack Gao
 
Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Community
 
基于Spring batch的大数据量并行处理
Jacky Chi
 
诗檀软件 Oracle开发优化基础
maclean liu
 
淘宝分布式数据处理实践
isnull
 

Viewers also liked (20)

PDF
Spring Booted, But... @JCConf 16', Taiwan
Pei-Tang Huang
 
PDF
手把手教你如何串接 Log 到各種網路服務
Mu Chun Wang
 
PDF
Ionic2
Jiayun Zhou
 
PDF
Apache Zeppelin 소개
KSLUG
 
PDF
Design Patterns這樣學就會了:入門班 Day1 教材
teddysoft
 
PPTX
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
Luke Han
 
KEY
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
PDF
[113]apache zeppelin 이문수
NAVER D2
 
PDF
那些 Functional Programming 教我的事
Wen-Tien Chang
 
PPTX
Intro to Spark with Zeppelin
Hortonworks
 
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
PDF
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Chen-en Lu
 
PDF
101 ways to configure kafka - badly (Kafka Summit)
Henning Spjelkavik
 
PDF
Java 8, Streams & Collectors, patterns, performances and parallelization
José Paumard
 
PDF
Java 8 Stream API and RxJava Comparison
José Paumard
 
PDF
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Taejun Kim
 
PDF
MLDM Monday -- Optimization Series Talk
Jerry Wu
 
PDF
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Spring Booted, But... @JCConf 16', Taiwan
Pei-Tang Huang
 
手把手教你如何串接 Log 到各種網路服務
Mu Chun Wang
 
Ionic2
Jiayun Zhou
 
Apache Zeppelin 소개
KSLUG
 
Design Patterns這樣學就會了:入門班 Day1 教材
teddysoft
 
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
Luke Han
 
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
[113]apache zeppelin 이문수
NAVER D2
 
那些 Functional Programming 教我的事
Wen-Tien Chang
 
Intro to Spark with Zeppelin
Hortonworks
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Chen-en Lu
 
101 ways to configure kafka - badly (Kafka Summit)
Henning Spjelkavik
 
Java 8, Streams & Collectors, patterns, performances and parallelization
José Paumard
 
Java 8 Stream API and RxJava Comparison
José Paumard
 
Spark & Zeppelin을 활용한 머신러닝 실전 적용기
Taejun Kim
 
MLDM Monday -- Optimization Series Talk
Jerry Wu
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Ad

From Java Stream to Java DataFrame