SlideShare a Scribd company logo
Building a Unified Data 
Aaron Davidson 
Slides adapted from Matei Zaharia 
spark.apache.org 
Pipeline in 
Spark で構築する統合データパイプライン
What is Apache Spark? 
Fast and general cluster computing system 
interoperable with Hadoop 
Improves efficiency through: 
»In-memory computing primitives 
»General computation graphs 
Improves usability through: 
»Rich APIs in Java, Scala, Python 
»Interactive shell 
Up to 100× faster 
(2-10× on disk) 
2-5× less code 
Hadoop互換のクラスタ計算システム 
計算性能とユーザビリティを改善
Project History 
Started at UC Berkeley in 2009, open 
sourced in 2010 
50+ companies now contributing 
»Databricks, Yahoo!, Intel, Cloudera, IBM, … 
Most active project in Hadoop ecosystem 
UC バークレー生まれ 
OSSとして50社以上が開発に参加
A General Stack 
Spark 
Spark 
Streaming 
real-time 
Spark 
SQL 
structured 
GraphX 
graph 
MLlib 
machine 
learning 
… 
構造化クエリ、リアルタイム分析、グラフ処理、機械学習
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
Sparkの紹介とユースケース
Why a New Programming 
Model? 
MapReduce greatly simplified big data 
analysis 
But once started, users wanted more: 
»More complex, multi-pass analytics (e.g. ML, 
graph) 
»More interactive ad-hoc queries 
»More real-time stream processing 
All 3 need faster data sharing in parallel 
aMpappRseduceの次にユーザが望むもの: 
より複雑な分析、対話的なクエリ、リアルタイム処理
Data Sharing in MapReduce 
iter. 1 iter. 2 . . . 
Input 
HDFS 
read 
HDFS 
write 
HDFS 
read 
HDFS 
write 
Input 
query 1 
query 2 
query 3 
result 1 
result 2 
result 3 
. . . 
HDFS 
read 
Slow due to replication, serialization, and disk IO 
MapReduce のデータ共有が遅いのはディスクIOのせい
What We’d Like 
iter. 1 iter. 2 . . . 
Input 
Distributed 
memory 
Input 
query 1 
query 2 
query 3 
. . . 
one-time 
processing 
10-100× faster than network and disk 
ネットワークやディスクより10~100倍くらい高速化したい
Spark Model 
Write programs in terms of transformations 
on distributed datasets 
Resilient Distributed Datasets (RDDs) 
»Collections of objects that can be stored in 
memory or disk across a cluster 
»Built via parallel transformations (map, filter, …) 
»Automatically rebuilt on failure 
自己修復する分散データセット(RDD) 
RDDはmap やfilter 等のメソッドで並列に変換できる
Example: Log Mining 
Load error messages from a log into memory, 
then interactively search for various patterns 
BaseT RraDnDsformed RDD 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(‘t’)[2]) 
messages.cache() Block 1 
Result: full-scaled text to search 1 TB data of Wikipedia in 5-7 sec 
in 
<1 (sec vs 170 (vs 20 sec sec for for on-on-disk disk data) 
data) 
Block 2 
Action 
Block 3 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “foo” in s).count() 
messages.filter(lambda s: “bar” in s).count() 
. . . 
results 
tasks 
Cache 1 
Cache 2 
Cache 3 
様々なパターンで対話的に検索。1 TBの処理時間が170 -> 5~7秒に
Fault Tolerance 
RDDs track lineage info to rebuild lost data 
file.map(lambda rec: (rec.type, 1)) 
.reduceByKey(lambda x, y: x + y) 
.filter(lambda (type, count): count > 10) 
map reduce filter 
Input file 
**系統** 情報を追跡して失ったデータを再構築
Fault Tolerance 
RDDs track lineage info to rebuild lost data 
file.map(lambda rec: (rec.type, 1)) 
map reduce filter 
Input file 
.reduceByKey(lambda x, y: x + y) 
.filter(lambda (type, count): count > 10) 
**系統** 情報を追跡して失ったデータを再構築
Example: Logistic 
Regression 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
1 5 10 20 30 
Running Time (s) 
Number of Iterations 
110 s / iteration 
Hadoop 
Spark 
first iteration 80 s 
further iterations 1 s 
ロジスティック回帰
Behavior with Less RAM 
68.8 
58.1 
40.7 
29.7 
11.5 
100 
80 
60 
40 
20 
0 
Cache 
disabled 
25% 50% 75% Fully 
cached 
Iteration time (s) 
% of working set in memory 
キャッシュを減らした場合の振る舞い
Spark in Scala and Java 
// Scala: 
val lines = sc.textFile(...) 
lines.filter(s => s.contains(“ERROR”)).count() 
// Java: 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(new Function<String, Boolean>() { 
Boolean call(String s) { 
return s.contains(“error”); 
} 
}).count();
Spark in Scala and Java 
// Scala: 
val lines = sc.textFile(...) 
lines.filter(s => s.contains(“ERROR”)).count() 
// Java 8: 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(s -> s.contains(“ERROR”)).count();
Supported Operators 
map 
filter 
groupBy 
sort 
union 
join 
leftOuterJoin 
rightOuterJoin 
reduce 
count 
fold 
reduceByKey 
groupByKey 
cogroup 
cross 
zip 
sample 
take 
first 
partitionBy 
mapWith 
pipe 
save 
...
Spark Community 
250+ developers, 50+ companies contributing 
Most active open source project in big data 
MapReduce 
YARN 
HDFS 
Storm 
Spark 
1400 
1200 
1000 
800 
600 
400 
200 
0 
commits past 6 months 
ビッグデータ分野で最も活発なOSSプロジェクト
Continuing Growth 
source: ohloh.net 
Contributors per month to Spark 
貢献者は増加し続けている
Get Started 
Visit spark.apache.org for docs & tutorials 
Easy to run on just your laptop 
Free training materials: spark-summit.org 
ラップトップ一台から始められます
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
Spark 上に構築されたモジュール
The Spark Stack 
Spark 
Spark 
Streaming 
real-time 
Spark 
SQL 
structured 
GraphX 
graph 
MLlib 
machine 
learning 
… 
Spark スタック
Evolution of the Shark project 
Allows querying structured data in Spark 
From Hive: 
c = HiveContext(sc) 
rows = c.sql(“select text, year from hivetable”) 
rows.filter(lambda r: r.year > 2013).collect() 
{“text”: “hi”, 
“user”: { 
“name”: “matei”, 
“id”: 123 
}} 
From JSON: 
c.jsonFile(“tweets.json”).registerAsTable(“tweets”) 
c.sql(“select text, user.name from tweets”) 
tweets.json 
Spark SQL 
Shark の後継。Spark で構造化データをクエリする。
Spark SQL 
Integrates closely with Spark’s language APIs 
c.registerFunction(“hasSpark”, lambda text: “Spark” in text) 
c.sql(“select * from tweets where hasSpark(text)”) 
Uniform interface for data access 
Python Scala Java 
Hive Parquet JSON 
Cassan-dra 
… 
SQL 
Spark 言語APIとの統合 
様々なデータソースに対して統一インタフェースを提供
Spark Streaming 
Stateful, fault-tolerant stream processing 
with the same API as batch jobs 
sc.twitterStream(...) 
.map(tweet => (tweet.language, 1)) 
.reduceByWindow(“5s”, _ + _) 
Storm 
Spark 
35 
30 
25 
20 
15 
10 
5 
0 
Throughput … 
ステートフルで耐障害性のあるストリーム処理 
バッチジョブと同じAPI
MLlib 
Built-in library of machine learning 
algorithms 
»K-means clustering 
»Alternating least squares 
»Generalized linear models (with L1 / L2 reg.) 
»SVD and PCA 
»Naïve Bayes 
points = sc.textFile(...).map(parsePoint) 
model = KMeans.train(points, 10) 
組み込みの機械学習ライブラリ
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
統合されたスタックのパワー
Big Data Systems Today 
MapReduce 
Pregel 
Dremel 
GraphLab 
Storm 
Giraph 
Drill 
Tez 
Impala 
S4 
… 
Specialized systems 
(iterative, interactive and 
streaming apps) 
General batch 
processing 
現状: 特化型のビッグデータシステムが乱立
Spark’s Approach 
Instead of specializing, generalize MapReduce 
to support new apps in same engine 
Two changes (general task DAG & data 
sharing) are enough to express previous 
models! 
Unification has big benefits 
»For the engine 
»For users Spark 
Streaming 
GraphX 
… 
Shark 
MLbase 
Spark のアプローチ: 特化しない 
汎用的な同一の基盤で、新たなアプリをサポートする
What it Means for Users 
Separate frameworks: 
… 
HDFS 
read 
HDFS 
write 
ETL 
HDFS 
read 
HDFS 
write 
train 
HDFS 
read 
HDFS 
write 
query 
Spark: Interactive 
HDFS 
HDFS 
read 
ETL 
train 
query 
analysis 
全ての処理がSpark 上で完結。さらに対話型分析も
Combining Processing 
Types 
// Load data using SQL 
val points = ctx.sql( 
“select latitude, longitude from historic_tweets”) 
// Train a machine learning model 
val model = KMeans.train(points, 10) 
// Apply it to a stream 
sc.twitterStream(...) 
.map(t => (model.closestCenter(t.location), 1)) 
.reduceByWindow(“5s”, _ + _) 
SQL、機械学習、ストリームへの適用など、 
異なる処理タイプを組み合わせる
This Talk 
Spark introduction & use cases 
Modules built on Spark 
The power of unification 
Demo 
デモ
The Plan 
Raw JSON 
Tweets 
SQL 
Streaming 
Machine 
Learning 
訓生S特p練徴aJSrしkベO SNたクQ をモLト でデHルDツルをFイSで抽 かー、出らトツし読本イてみ文ーk込を-トmむ抽スea出トns リでーモムデをルクをラ訓ス練タすリるングする
Demo!
Summary: What We Did 
Raw JSON 
SQL 
Streaming 
Machine 
Learning 
-生JSON をHDFS から読み込む 
-Spark SQL でツイート本文を抽出 
-特徴ベクトルを抽出してk-means でモデルを訓練する 
-訓練したモデルで、ツイートストリームをクラスタリングする
import org.apache.spark.sql._ 
val ctx = new org.apache.spark.sql.SQLContext(sc) 
val tweets = sc.textFile("hdfs:/twitter") 
val tweetTable = JsonTable.fromRDD(sqlContext, tweets, Some(0.1)) 
tweetTable.registerAsTable("tweetTable") 
ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println) 
ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable  
GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println) 
val texts = sql("SELECT text FROM tweetTable").map(_.head.toString) 
def featurize(str: String): Vector = { ... } 
val vectors = texts.map(featurize).cache() 
val model = KMeans.train(vectors, 10, 10) 
sc.makeRDD(model.clusterCenters, 10).saveAsObjectFile("hdfs:/model") 
val ssc = new StreamingContext(new SparkConf(), Seconds(1)) 
val model = new KMeansModel( 
ssc.sparkContext.objectFile(modelFile).collect()) 
// Streaming 
val tweets = TwitterUtils.createStream(ssc, /* auth */) 
val statuses = tweets.map(_.getText) 
val filteredTweets = statuses.filter { 
t => model.predict(featurize(t)) == clusterNumber 
} 
filteredTweets.print() 
ssc.start()
Conclusion 
Big data analytics is evolving to include: 
»More complex analytics (e.g. machine learning) 
»More interactive ad-hoc queries 
»More real-time stream processing 
Spark is a fast platform that unifies these 
apps 
Learn more: spark.apache.org 
ビッグデータ分析は、複雑で、対話的で、リアルタイムな方向へと進化 
Sparkはこれらのアプリを統合した最速のプラットフォーム

More Related Content

What's hot (20)

KEY
The Why and How of Scala at Twitter
Alex Payne
 
PDF
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 
PDF
Scala @ TechMeetup Edinburgh
Stuart Roebuck
 
ODP
Refactoring to Scala DSLs and LiftOff 2009 Recap
Dave Orme
 
PPTX
From Ruby to Scala
tod esking
 
PDF
Build Cloud Applications with Akka and Heroku
Salesforce Developers
 
PPTX
Java 7 Whats New(), Whats Next() from Oredev
Mattias Karlsson
 
PDF
Martin Odersky: What's next for Scala
Marakana Inc.
 
KEY
JavaOne 2011 - JVM Bytecode for Dummies
Charles Nutter
 
PDF
Scala Days NYC 2016
Martin Odersky
 
PDF
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Sunghyouk Bae
 
PDF
Spring data requery
Sunghyouk Bae
 
PPTX
Akka Actor presentation
Gene Chang
 
PDF
Alternatives of JPA/Hibernate
Sunghyouk Bae
 
PDF
Requery overview
Sunghyouk Bae
 
PDF
Scala profiling
Filippo Pacifici
 
PDF
Short intro to scala and the play framework
Felipe
 
PDF
Scala coated JVM
Stuart Roebuck
 
ZIP
Above the clouds: introducing Akka
nartamonov
 
PDF
Scala : language of the future
AnsviaLab
 
The Why and How of Scala at Twitter
Alex Payne
 
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 
Scala @ TechMeetup Edinburgh
Stuart Roebuck
 
Refactoring to Scala DSLs and LiftOff 2009 Recap
Dave Orme
 
From Ruby to Scala
tod esking
 
Build Cloud Applications with Akka and Heroku
Salesforce Developers
 
Java 7 Whats New(), Whats Next() from Oredev
Mattias Karlsson
 
Martin Odersky: What's next for Scala
Marakana Inc.
 
JavaOne 2011 - JVM Bytecode for Dummies
Charles Nutter
 
Scala Days NYC 2016
Martin Odersky
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Sunghyouk Bae
 
Spring data requery
Sunghyouk Bae
 
Akka Actor presentation
Gene Chang
 
Alternatives of JPA/Hibernate
Sunghyouk Bae
 
Requery overview
Sunghyouk Bae
 
Scala profiling
Filippo Pacifici
 
Short intro to scala and the play framework
Felipe
 
Scala coated JVM
Stuart Roebuck
 
Above the clouds: introducing Akka
nartamonov
 
Scala : language of the future
AnsviaLab
 

Viewers also liked (20)

PDF
sbt, past and future / sbt, 傾向と対策
scalaconfjp
 
PDF
Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
scalaconfjp
 
PDF
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
scalaconfjp
 
PDF
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
TIS Inc.
 
PPTX
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
gree_tech
 
PDF
GitBucket: The perfect Github clone by Scala
takezoe
 
PDF
Node.js vs Play Framework (with Japanese subtitles)
Yevgeniy Brikman
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PDF
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
PDF
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Tom Lous
 
PPTX
Building data pipelines
Jonathan Holloway
 
PDF
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
PDF
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
PDF
Oracle12 - The Top12 Features by NAYA Technologies
NAYATech
 
PPTX
Introduce to Spark sql 1.3.0
Bryan Yang
 
PDF
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
scalaconfjp
 
PPTX
Spark etl
Imran Rashid
 
PPTX
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
Getting value from IoT, Integration and Data Analytics
 
PPTX
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
sbt, past and future / sbt, 傾向と対策
scalaconfjp
 
Xitrum Web Framework Live Coding Demos / Xitrum Web Framework ライブコーディング
scalaconfjp
 
What's a macro?: Learning by Examples / Scalaのマクロに実用例から触れてみよう!
scalaconfjp
 
Scalable Generator: Using Scala in SIer Business (ScalaMatsuri)
TIS Inc.
 
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
gree_tech
 
GitBucket: The perfect Github clone by Scala
takezoe
 
Node.js vs Play Framework (with Japanese subtitles)
Yevgeniy Brikman
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Tom Lous
 
Building data pipelines
Jonathan Holloway
 
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
Oracle12 - The Top12 Features by NAYA Technologies
NAYATech
 
Introduce to Spark sql 1.3.0
Bryan Yang
 
芸者東京とScala〜おみせやさんから脳トレクエストまでの軌跡〜
scalaconfjp
 
Spark etl
Imran Rashid
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
Getting value from IoT, Integration and Data Analytics
 
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
Ad

Similar to Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一 (20)

PDF
Intro to Spark and Spark SQL
jeykottalam
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
PDF
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
PPTX
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
PDF
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
PDF
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Radu Chilom
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
PDF
Spark streaming state of the union
Databricks
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
Spark Programming
Taewook Eom
 
PDF
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
PDF
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
PDF
Apache Spark RDDs
Dean Chen
 
Intro to Spark and Spark SQL
jeykottalam
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Radu Chilom
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 
Spark streaming state of the union
Databricks
 
20170126 big data processing
Vienna Data Science Group
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Unified Big Data Processing with Apache Spark
C4Media
 
Spark Programming
Taewook Eom
 
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Apache Spark RDDs
Dean Chen
 
Ad

More from scalaconfjp (20)

PDF
脆弱性対策のためのClean Architecture ~脆弱性に対するレジリエンスを確保せよ~
scalaconfjp
 
PDF
Alp x BizReach SaaS事業を営む2社がお互い気になることをゆるゆる聞いてみる会
scalaconfjp
 
PDF
GraalVM Overview Compact version
scalaconfjp
 
PDF
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
scalaconfjp
 
PPTX
Monitoring Reactive Architecture Like Never Before / 今までになかったリアクティブアーキテクチャの監視...
scalaconfjp
 
PPTX
Scala 3, what does it means for me? / Scala 3って、私にはどんな影響があるの? by Joan Goyeau
scalaconfjp
 
PDF
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
scalaconfjp
 
PDF
Scala ♥ Graal by Flavio Brasil
scalaconfjp
 
PPTX
Introduction to GraphQL in Scala
scalaconfjp
 
PDF
Safety Beyond Types
scalaconfjp
 
PDF
Reactive Kafka with Akka Streams
scalaconfjp
 
PDF
Reactive microservices with play and akka
scalaconfjp
 
PDF
Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
scalaconfjp
 
PDF
DWANGO by ドワンゴ
scalaconfjp
 
PDF
OCTOPARTS by M3, Inc.
scalaconfjp
 
PDF
Try using Aeromock by Marverick, Inc.
scalaconfjp
 
PDF
統計をとって高速化する
Scala開発 by CyberZ,Inc.
scalaconfjp
 
PDF
Short Introduction of Implicit Conversion by TIS, Inc.
scalaconfjp
 
PPTX
ビズリーチ x ScalaMatsuri by BIZREACH, Inc.
scalaconfjp
 
PDF
Solid and Sustainable Development in Scala
scalaconfjp
 
脆弱性対策のためのClean Architecture ~脆弱性に対するレジリエンスを確保せよ~
scalaconfjp
 
Alp x BizReach SaaS事業を営む2社がお互い気になることをゆるゆる聞いてみる会
scalaconfjp
 
GraalVM Overview Compact version
scalaconfjp
 
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
scalaconfjp
 
Monitoring Reactive Architecture Like Never Before / 今までになかったリアクティブアーキテクチャの監視...
scalaconfjp
 
Scala 3, what does it means for me? / Scala 3って、私にはどんな影響があるの? by Joan Goyeau
scalaconfjp
 
Functional Object-Oriented Imperative Scala / 関数型オブジェクト指向命令型 Scala by Sébasti...
scalaconfjp
 
Scala ♥ Graal by Flavio Brasil
scalaconfjp
 
Introduction to GraphQL in Scala
scalaconfjp
 
Safety Beyond Types
scalaconfjp
 
Reactive Kafka with Akka Streams
scalaconfjp
 
Reactive microservices with play and akka
scalaconfjp
 
Scalaに対して意識の低いエンジニアがScalaで何したかの話, by 芸者東京エンターテインメント
scalaconfjp
 
DWANGO by ドワンゴ
scalaconfjp
 
OCTOPARTS by M3, Inc.
scalaconfjp
 
Try using Aeromock by Marverick, Inc.
scalaconfjp
 
統計をとって高速化する
Scala開発 by CyberZ,Inc.
scalaconfjp
 
Short Introduction of Implicit Conversion by TIS, Inc.
scalaconfjp
 
ビズリーチ x ScalaMatsuri by BIZREACH, Inc.
scalaconfjp
 
Solid and Sustainable Development in Scala
scalaconfjp
 

Recently uploaded (20)

PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Import Data Form Excel to Tally Services
Tally xperts
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 

Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一

  • 1. Building a Unified Data Aaron Davidson Slides adapted from Matei Zaharia spark.apache.org Pipeline in Spark で構築する統合データパイプライン
  • 2. What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves efficiency through: »In-memory computing primitives »General computation graphs Improves usability through: »Rich APIs in Java, Scala, Python »Interactive shell Up to 100× faster (2-10× on disk) 2-5× less code Hadoop互換のクラスタ計算システム 計算性能とユーザビリティを改善
  • 3. Project History Started at UC Berkeley in 2009, open sourced in 2010 50+ companies now contributing »Databricks, Yahoo!, Intel, Cloudera, IBM, … Most active project in Hadoop ecosystem UC バークレー生まれ OSSとして50社以上が開発に参加
  • 4. A General Stack Spark Spark Streaming real-time Spark SQL structured GraphX graph MLlib machine learning … 構造化クエリ、リアルタイム分析、グラフ処理、機械学習
  • 5. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo Sparkの紹介とユースケース
  • 6. Why a New Programming Model? MapReduce greatly simplified big data analysis But once started, users wanted more: »More complex, multi-pass analytics (e.g. ML, graph) »More interactive ad-hoc queries »More real-time stream processing All 3 need faster data sharing in parallel aMpappRseduceの次にユーザが望むもの: より複雑な分析、対話的なクエリ、リアルタイム処理
  • 7. Data Sharing in MapReduce iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to replication, serialization, and disk IO MapReduce のデータ共有が遅いのはディスクIOのせい
  • 8. What We’d Like iter. 1 iter. 2 . . . Input Distributed memory Input query 1 query 2 query 3 . . . one-time processing 10-100× faster than network and disk ネットワークやディスクより10~100倍くらい高速化したい
  • 9. Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) »Collections of objects that can be stored in memory or disk across a cluster »Built via parallel transformations (map, filter, …) »Automatically rebuilt on failure 自己修復する分散データセット(RDD) RDDはmap やfilter 等のメソッドで並列に変換できる
  • 10. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns BaseT RraDnDsformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Result: full-scaled text to search 1 TB data of Wikipedia in 5-7 sec in <1 (sec vs 170 (vs 20 sec sec for for on-on-disk disk data) data) Block 2 Action Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . results tasks Cache 1 Cache 2 Cache 3 様々なパターンで対話的に検索。1 TBの処理時間が170 -> 5~7秒に
  • 11. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file **系統** 情報を追跡して失ったデータを再構築
  • 12. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) map reduce filter Input file .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) **系統** 情報を追跡して失ったデータを再構築
  • 13. Example: Logistic Regression 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Running Time (s) Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s ロジスティック回帰
  • 14. Behavior with Less RAM 68.8 58.1 40.7 29.7 11.5 100 80 60 40 20 0 Cache disabled 25% 50% 75% Fully cached Iteration time (s) % of working set in memory キャッシュを減らした場合の振る舞い
  • 15. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 16. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java 8: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count();
  • 17. Supported Operators map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 18. Spark Community 250+ developers, 50+ companies contributing Most active open source project in big data MapReduce YARN HDFS Storm Spark 1400 1200 1000 800 600 400 200 0 commits past 6 months ビッグデータ分野で最も活発なOSSプロジェクト
  • 19. Continuing Growth source: ohloh.net Contributors per month to Spark 貢献者は増加し続けている
  • 20. Get Started Visit spark.apache.org for docs & tutorials Easy to run on just your laptop Free training materials: spark-summit.org ラップトップ一台から始められます
  • 21. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo Spark 上に構築されたモジュール
  • 22. The Spark Stack Spark Spark Streaming real-time Spark SQL structured GraphX graph MLlib machine learning … Spark スタック
  • 23. Evolution of the Shark project Allows querying structured data in Spark From Hive: c = HiveContext(sc) rows = c.sql(“select text, year from hivetable”) rows.filter(lambda r: r.year > 2013).collect() {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”) c.sql(“select text, user.name from tweets”) tweets.json Spark SQL Shark の後継。Spark で構造化データをクエリする。
  • 24. Spark SQL Integrates closely with Spark’s language APIs c.registerFunction(“hasSpark”, lambda text: “Spark” in text) c.sql(“select * from tweets where hasSpark(text)”) Uniform interface for data access Python Scala Java Hive Parquet JSON Cassan-dra … SQL Spark 言語APIとの統合 様々なデータソースに対して統一インタフェースを提供
  • 25. Spark Streaming Stateful, fault-tolerant stream processing with the same API as batch jobs sc.twitterStream(...) .map(tweet => (tweet.language, 1)) .reduceByWindow(“5s”, _ + _) Storm Spark 35 30 25 20 15 10 5 0 Throughput … ステートフルで耐障害性のあるストリーム処理 バッチジョブと同じAPI
  • 26. MLlib Built-in library of machine learning algorithms »K-means clustering »Alternating least squares »Generalized linear models (with L1 / L2 reg.) »SVD and PCA »Naïve Bayes points = sc.textFile(...).map(parsePoint) model = KMeans.train(points, 10) 組み込みの機械学習ライブラリ
  • 27. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo 統合されたスタックのパワー
  • 28. Big Data Systems Today MapReduce Pregel Dremel GraphLab Storm Giraph Drill Tez Impala S4 … Specialized systems (iterative, interactive and streaming apps) General batch processing 現状: 特化型のビッグデータシステムが乱立
  • 29. Spark’s Approach Instead of specializing, generalize MapReduce to support new apps in same engine Two changes (general task DAG & data sharing) are enough to express previous models! Unification has big benefits »For the engine »For users Spark Streaming GraphX … Shark MLbase Spark のアプローチ: 特化しない 汎用的な同一の基盤で、新たなアプリをサポートする
  • 30. What it Means for Users Separate frameworks: … HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query Spark: Interactive HDFS HDFS read ETL train query analysis 全ての処理がSpark 上で完結。さらに対話型分析も
  • 31. Combining Processing Types // Load data using SQL val points = ctx.sql( “select latitude, longitude from historic_tweets”) // Train a machine learning model val model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _) SQL、機械学習、ストリームへの適用など、 異なる処理タイプを組み合わせる
  • 32. This Talk Spark introduction & use cases Modules built on Spark The power of unification Demo デモ
  • 33. The Plan Raw JSON Tweets SQL Streaming Machine Learning 訓生S特p練徴aJSrしkベO SNたクQ をモLト でデHルDツルをFイSで抽 かー、出らトツし読本イてみ文ーk込を-トmむ抽スea出トns リでーモムデをルクをラ訓ス練タすリるングする
  • 34. Demo!
  • 35. Summary: What We Did Raw JSON SQL Streaming Machine Learning -生JSON をHDFS から読み込む -Spark SQL でツイート本文を抽出 -特徴ベクトルを抽出してk-means でモデルを訓練する -訓練したモデルで、ツイートストリームをクラスタリングする
  • 36. import org.apache.spark.sql._ val ctx = new org.apache.spark.sql.SQLContext(sc) val tweets = sc.textFile("hdfs:/twitter") val tweetTable = JsonTable.fromRDD(sqlContext, tweets, Some(0.1)) tweetTable.registerAsTable("tweetTable") ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println) ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println) val texts = sql("SELECT text FROM tweetTable").map(_.head.toString) def featurize(str: String): Vector = { ... } val vectors = texts.map(featurize).cache() val model = KMeans.train(vectors, 10, 10) sc.makeRDD(model.clusterCenters, 10).saveAsObjectFile("hdfs:/model") val ssc = new StreamingContext(new SparkConf(), Seconds(1)) val model = new KMeansModel( ssc.sparkContext.objectFile(modelFile).collect()) // Streaming val tweets = TwitterUtils.createStream(ssc, /* auth */) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter { t => model.predict(featurize(t)) == clusterNumber } filteredTweets.print() ssc.start()
  • 37. Conclusion Big data analytics is evolving to include: »More complex analytics (e.g. machine learning) »More interactive ad-hoc queries »More real-time stream processing Spark is a fast platform that unifies these apps Learn more: spark.apache.org ビッグデータ分析は、複雑で、対話的で、リアルタイムな方向へと進化 Sparkはこれらのアプリを統合した最速のプラットフォーム

Editor's Notes

  • #4: TODO: Apache incubator logo
  • #8: Each iteration is, for example, a MapReduce job
  • #11: Add “variables” to the “functions” in functional programming
  • #14: 100 GB of data on 50 m1.xlarge EC2 machines
  • #19: Alibaba, tenzent At Berkeley, we have been working on a solution since 2009. This solution consists of a software stack for data analytics, called the Berkeley Data Analytics Stack. The centerpiece of this stack is Spark. Spark has seen significant adoption with hundreds of companies using it, out of which around sixteen companies have contributed back the code. In addition, Spark has been deployed on clusters that exceed 1,000 nodes.
  • #20: Despite Hadoop having been around for 7 years, the Spark community is still growing; to us this shows that there’s still a huge gap in making big data easy to use and contributors are excited about Spark’s approach here