Demonstration

Outline
● Some comments on what we're trying to
show
○ high level cluster configuration
○ an example application that might use this config
■ based on a Gowalla data set
● Launch cluster nodes on EC2
● Launch/configure Cassandra on cluster
● Demonstrate use of Cassandra
○ cassandra-cli, pycassa scripts to interact with db
● Demonstrate use of Hadoop
● Demonstrate use of Pig on the real data

Cluster configuration
● Four EC2 nodes
○ m1.medium instances
■ realistically a bit small for real world
● 3 nodes part of Cassandra
○ data can be input dynamically into db via Thrift API
● All nodes run Hadoop Tasktracker
● MapReduce runs close to (Cassandra) data
● JobTracker on separate node

Cluster config

Job Tracker Cassandra

Task Tracker

Cassandra Cassandra

Task Tracker Task Tracker

All nodes m1.small for demo

Let's get the cluster up...
...over to Lamine!

Let's get Cassandra
running...
...and show the basic cli...

Application data
● Used Gowalla data in this test application
● Gowalla provide anonymized data for
test/research purposes:
○ Graph of UID connections
○ List of checkins - UID, LocID
● Size of data set:
○ 400MB checkins
■ 6.4m checkins
○ ~200k users
● Also generated simpler variant of this data
for demonstration
○ more real user information
○ more real location information

Application data - User Graph

Simple graph structure -
unidirectional graph with
UIDs as nodes

Application Data - Checkin info

How this data can be used
● Application interested in:
○ my checkins
○ list my friends
○ checkins at given location
○ my friends checkins
● Analytics:
○ top ten most active users - most checkins
○ aggregate checkins per week
○ aggregate checkins per week per city

Cassandra data models
● The following data models were used:
○ User
○ Location
○ Checkin
○ FriendRels
■ graph of friend relationships
○ UserCheckins
■ checkins by user
○ LocationCheckins
■ checkins by location
○ FriendCheckins
■ checkins by friends

Cassandra data models
● Use of valueless columns
○ FriendRels, UserCheckins, LocationCheckins,
FriendCheckins are just sets of valueless columns
● FriendRel:
○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...}
■ row_key is a uid
● UserCheckins:
○ row_key: {checkinid1: '', checkinid2: '', ...}
■ row_key is uid
● LocationCheckins use LocID as row key
● FriendCheckins use my UID to get my
friend's checkins

Let's import the data into
Cassandra...

Using Hadoop and Pig
...and we can do some analytics...

Demonstration

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Demonstration (20)

Recently uploaded (20)

Demonstration