Tutorial(release)

MODERN
TECHNOLOGIES IN
DATA SCIENCE
CHIH-CHIEH HUNG
CHU-CHENG HSIEH

HELLO!
Chih-Chieh Hung 洪智傑

WE WILL COVER…
• 9:00-10:40
1. Introduction of Big Data System
2. Brief Introduction of NoSQL
3. Good Practice of Cassandra Modeling
4. Introduction of Kafka
• 11:10-12:30
1. Map-Reduce
2. Introduction of Spark
3. Spark Ecosystem

RAKUTEN INC.
• The biggest e-commerce company in
Japan
• Business model:

SYSTEM
REQUIREMENTS
• General system requirements
• Handle millions of user
• Support collection and storage of complex
data
• Real-time system requirements
• Quickly retrieve subsets of single user’s
data
• Aggregate/derive new analytics results
per user

LAMBDA ARCHITECTURE
FOR BIG DATA SYSTEMS

LAMBDA ARCHITECTURE WITH
STATE-OF-ART TECHNOLOGIES

HELLO! NOSQL!
• NoSQL = Not Only SQL
• NoSQL = data storage
• Schema-less
• Dumb in joining data

CAP DEFINITION (1/3)
• Consistency
• Data is consistent and the same for all
nodes

• Availability
• Every request to non-failing node should
be processed and receive response

• Partition-Tolerance
• If some nodes crash/communication fails,
service will performs as expected

YARN ARCHITECTURE
(MR V2)
Master Node
Slave Nodes

RUNNING AN
APPLICATION IN YARN
(1/3)

RUNNING AN
APPLICATION IN YARN
(2/3)

RUNNING AN
APPLICATION IN YARN
(3/3)

C* ARCHITECTURE
• Cluster: Ring
• Peer-to-Peer model
• Gossip protocol
Coordination node
DATA

POWER OF C*
• Elastic
• R/W throughput increases linearly as new
machines are added
• Decentralized
• Fault tolerant with no single point of
failure; no “master” node

GOOD PRACTICE OF
CASSANDRA MODELING
*. Some ideas are from:
1. http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-
part-1
2. http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-
part-2

CASSANDRA IN LAMBDA
ARCHITECTURE

INTRO TO C* DATA
MODEL
Column Family
Key Space

FOR NEW COMERS
• We could think in the this way:
Column Family
(Table)
Key Space (Database)
(Primary Key)

PRINCIPLE #1:
• Sorted map = efficient key lookup
• E.g., Getting the employees from id=5~id=10.
Age
27
Name
Steve
State
California
Age
29
Name
Chris
State
Montana
Age
37
Name
Ken
State
California
1
2
150

PRINCIPLE #2
• Storing value in column name is perfectly
ok
• Column key maximum: 64KB
• Use wide row for ordering, grouping and
filtering
• e.g.: Items sold per state per city
CA|San Diego
3000
CA|San Jose
207
NV|Las Vegas
10000
NV|Reno
3227
9527

PRINCIPLE #3
• Choose a proper row key
• E.g.: Query: date, state  # of sales.
Which is better?
CA|San Diego
3000
UT|Ogden
10000
UT|Salt Lake
3227
San Diego
3000
Sunnyvale
10000
20150409
20150409|CA
A
B

PRINCIPLE #4
• Order of composite column name matters
• E.g., Two composite namings make two
group ways
20150401|Buy
5
20150401|Sell
2
20150409|Buy
0
20150409|Sell
10
123
Buy|20150401
5
Buy|20150409
0
Sell|20150401
2
Sell|20150409
10
123
A
B

PRINCIPLE #5
• Make sure the column key and row key
are unique
• Otherwise, data could easily get
accidentally overwritten.

PRINCIPLE #6
• Split hot & cold data in separate column
families
• E.g., Better to split into two cf
*. Examples are from: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-
best-practices-part-1

PRINCIPLE #7
• Keep all data in a single column family of
the same type
• E.g., better to break into multiple cf
User|Name
Oshin
Buy|1234
1
View|1234
4
20150409
User|Email
o@a.b

PRINCIPLE #8 (1/7)
• De-normalize and duplicate for read
performance
• In relational world:
PRO CON
1. Less data duplication
2. Fewer data
modification anomalies
3. Conceptual clear
4. Easy to maintain
1. Queries might perform
slowly if many table are
joined.
10x worse in C*

PRINCIPLE #8 (2/7)
• Example: ‘Like’ relationship between
User and Item
*. Examples are from: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-
best-practices-part-1

PRICIPLE #8 (3/7)
• Want to perform 4 queries:
1. Get user info by user ID
2. Get item info by item ID
3. Get all items that a given user likes
4. Get all users who like a given item

PRINCIPLE #8 (4/7)
• Option 1:
3. Get all items that a given
user likes
4. Get all users who like a
given item

PRINCIPLE #8 (5/7)
• Option 2:
user likes
given item

PRINCIPLE #8 (6/7)
• Option 3:
user likes
given item

PRINCIPLE #8 (7/7)
• Option 4:
user likes
given item

SUMMARY
• Think about query patterns and indexing
from beginning

WHAT IS KAFKA?
IN ONE SENTENCE
Kafka is a distributed publish-subscribe
system designed for web-scale stream
processing
 Created by Linkedin, contributed to Apache in 2011
 Written in Scala
 Multi-language support for Consumer API
(Scala, Java, Ruby, Python, C++, Go, php…)

WHY WE NEED KAFKA?
Story starts with just one data pipeline

WHY WE NEED KAFKA?
Reuse of data pipelines for new providers

WHY WE NEED KAFKA?
Reuse of existing providers for new
consumers

WHY WE NEED KAFKA?
Eventually the solution becomes the
problem

WHY WE NEED KAFKA?
Kafka decouples data-pipelines

PHYSICAL
COMPONENTS
• Producer
• Consumer
• Broker
• Zookeeper
ZooKeeper is a centralized
service for maintaining:
1. config info
2. naming
3. distributed
synchronization,
4. providing group
services.

LOGICAL
COMPONENTS
• Topics
• The feed name where msg are published
• Partitions
• 1 topic = n partitions
• Message
• key/value pair

MESSAGE FROM
PRODUCER TO BROKER
• Kafka can guarantee messages are
handled in FIFO order.
• Three modes:
• At most once (Async)
• At least once (Sync)
• Exactly once (not support until v0.9)

TOPIC
• Feed name to which messages are
published
• Example: zerg.hydra

PARTITION,
CONSUMER (1/3)
• Queue model
• 1 Topic
• 1 Partition
• 1 Consumer

PARTITION,
CONSUMER (2/3)
• Pub/Sub Model
• 1 Topic
• 1 Partition
• N Consumer

PARTITION,
CONSUMER (3/3)
• Pub/Sub Model
• 1 Topic
• N Partition
• N Consumer

TOPICS, PARTITIONS,
REPLICAS
*. Thanks: http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-
kafka-cluster-on-a-single-node/

PUT IT TOGETHER
*. Thanks: http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-
apache-kafka-cluster-on-a-single-node/
• Partition: 3
• Replica: 2

CONSUMER GROUP
• Multiple high-level
consumers can
participate in a single
consumer group.
• Coordinated by
zookeeper
• Each partition will be
consumed by exactly
one consumer.
• Share offsets

LOG COMPACTION
• Many data system retain a latest state for
data by some key
• Log compaction:

CODE PIECES
• Create topic
$ bin/kafka-topics.sh --zookeeper localhost:2181
--create --topic zerg.hydra --partitions 3 --replication-
factor 2

CODE PIECES
• Inspect the config of a topic

CODE PIECES
• Start a producer
• After that, type some messages in console:
$ bin/kafka-console-producer.sh --broker-list
localhost:9092,localhost:9093,localhost:9094
--sync --topic zerg.hydra
Hello world!
Rock: Nerf Paper. Scissors is fine.

CODE PIECES
• Start a consumer
• After starting a consumer, in the end of
the output:
$ bin/kafka-console-consumer.sh --zookeeper
localhost:2181 --topic zerg.hydra --from-
beginning
Hello world!
Rock: Nerf Paper. Scissors is fine.

Tutorial(release)

More Related Content

What's hot (14)

Viewers also liked (16)

Similar to Tutorial(release) (20)

Tutorial(release)

Editor's Notes