SlideShare a Scribd company logo
Demonstration
Outline
● Some comments on what we're trying to
  show
  ○ high level cluster configuration
  ○ an example application that might use this config
    ■ based on a Gowalla data set
● Launch cluster nodes on EC2
● Launch/configure Cassandra on cluster
● Demonstrate use of Cassandra
  ○ cassandra-cli, pycassa scripts to interact with db
● Demonstrate use of Hadoop
● Demonstrate use of Pig on the real data
Cluster configuration
● Four EC2 nodes
  ○ m1.medium instances
    ■ realistically a bit small for real world
● 3 nodes part of Cassandra
  ○ data can be input dynamically into db via Thrift API
● All nodes run Hadoop Tasktracker
● MapReduce runs close to (Cassandra) data
● JobTracker on separate node
Cluster config


        Job Tracker                           Cassandra

                                             Task Tracker




                               Cassandra                     Cassandra

                              Task Tracker                  Task Tracker



All nodes m1.small for demo
Let's get the cluster up...
       ...over to Lamine!
Let's get Cassandra
      running...
  ...and show the basic cli...
Application data
● Used Gowalla data in this test application
● Gowalla provide anonymized data for
  test/research purposes:
  ○ Graph of UID connections
  ○ List of checkins - UID, LocID
● Size of data set:
  ○ 400MB checkins
    ■ 6.4m checkins
  ○ ~200k users
● Also generated simpler variant of this data
  for demonstration
  ○ more real user information
  ○ more real location information
Application data - User Graph




 Simple graph structure -
 unidirectional graph with
 UIDs as nodes
Application Data - Checkin info
How this data can be used
● Application interested in:
   ○   my checkins
   ○   list my friends
   ○   checkins at given location
   ○   my friends checkins
● Analytics:
   ○ top ten most active users - most checkins
   ○ aggregate checkins per week
   ○ aggregate checkins per week per city
Cassandra data models
● The following data models were used:
  ○ User
  ○ Location
  ○ Checkin
  ○ FriendRels
    ■ graph of friend relationships
  ○ UserCheckins
    ■ checkins by user
  ○ LocationCheckins
    ■ checkins by location
  ○ FriendCheckins
    ■ checkins by friends
Cassandra data models
● Use of valueless columns
  ○ FriendRels, UserCheckins, LocationCheckins,
    FriendCheckins are just sets of valueless columns
● FriendRel:
  ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...}
    ■ row_key is a uid
● UserCheckins:
  ○ row_key: {checkinid1: '', checkinid2: '', ...}
    ■ row_key is uid
● LocationCheckins use LocID as row key
● FriendCheckins use my UID to get my
  friend's checkins
Let's import the data into
       Cassandra...
You deserve a coffee...
Using Hadoop and Pig
 ...and we can do some analytics...

More Related Content

PPTX
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Anant Corporation
 
PDF
Time Series Data with Apache Cassandra
Eric Evans
 
PDF
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
PPTX
Cassandra Lunch #59 Functions in Cassandra
Anant Corporation
 
PDF
Time series storage in Cassandra
Eric Evans
 
PDF
It's not you, it's me: Ending a 15 year relationship with RRD
Eric Evans
 
PDF
Time Series Data with Apache Cassandra
Eric Evans
 
PDF
Benchmark MinHash+LSH algorithm on Spark
Xiaoqian Liu
 
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Anant Corporation
 
Time Series Data with Apache Cassandra
Eric Evans
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
Cassandra Lunch #59 Functions in Cassandra
Anant Corporation
 
Time series storage in Cassandra
Eric Evans
 
It's not you, it's me: Ending a 15 year relationship with RRD
Eric Evans
 
Time Series Data with Apache Cassandra
Eric Evans
 
Benchmark MinHash+LSH algorithm on Spark
Xiaoqian Liu
 

What's hot (20)

PDF
Elasticsearch avoiding hotspots
Christophe Marchal
 
PPT
Object multifunctional indexing with an open API
akvalex
 
PDF
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
NoSQLmatters
 
PDF
Streaming data to s3 using akka streams
Mikhail Girkin
 
PPT
NBITSearch. Features.
Novosib-BIT LLC
 
PDF
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
Srinath Perera
 
PDF
DB reading group may 16, 2018
Keisuke Suzuki
 
PDF
Locality Sensitive Hashing By Spark
Spark Summit
 
PPTX
Introduction to MongoDB
Raghunath A
 
PDF
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
Rob Skillington
 
PPTX
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Ontico
 
PPTX
Data Step Hash Object vs SQL Join
Geoff Ness
 
PDF
Pain points with M3, some things to address them and how replication works
Rob Skillington
 
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Taiwan User Group
 
PDF
Mongo nyc nyt + mongodb
Deep Kapadia
 
PPTX
MongoDB Workshop Universidad de Huelva
Juan Antonio Roy Couto
 
PDF
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
NETWAYS
 
PPTX
Amazon Web Services lection 4
Binary Studio
 
PPTX
R user group 2011 09
MapR Technologies
 
PDF
Data Lessons Learned at Scale
Charlie Reverte
 
Elasticsearch avoiding hotspots
Christophe Marchal
 
Object multifunctional indexing with an open API
akvalex
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
NoSQLmatters
 
Streaming data to s3 using akka streams
Mikhail Girkin
 
NBITSearch. Features.
Novosib-BIT LLC
 
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
Srinath Perera
 
DB reading group may 16, 2018
Keisuke Suzuki
 
Locality Sensitive Hashing By Spark
Spark Summit
 
Introduction to MongoDB
Raghunath A
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
Rob Skillington
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Ontico
 
Data Step Hash Object vs SQL Join
Geoff Ness
 
Pain points with M3, some things to address them and how replication works
Rob Skillington
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Taiwan User Group
 
Mongo nyc nyt + mongodb
Deep Kapadia
 
MongoDB Workshop Universidad de Huelva
Juan Antonio Roy Couto
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
NETWAYS
 
Amazon Web Services lection 4
Binary Studio
 
R user group 2011 09
MapR Technologies
 
Data Lessons Learned at Scale
Charlie Reverte
 
Ad

Viewers also liked (7)

PDF
No sql course introduction
Sean Murphy
 
PPTX
Rocco pres-v1
Sean Murphy
 
PPTX
Rss announcements
Sean Murphy
 
PDF
Rss talk
Sean Murphy
 
PDF
Overview of no sql
Sean Murphy
 
PDF
Hadoop pig
Sean Murphy
 
PPTX
Introduction to cassandra
Tarun Garg
 
No sql course introduction
Sean Murphy
 
Rocco pres-v1
Sean Murphy
 
Rss announcements
Sean Murphy
 
Rss talk
Sean Murphy
 
Overview of no sql
Sean Murphy
 
Hadoop pig
Sean Murphy
 
Introduction to cassandra
Tarun Garg
 
Ad

Similar to Demonstration (20)

PDF
Real-time analytics with Druid at Appsflyer
Michael Spector
 
PDF
Native container monitoring
Rohit Jnagal
 
PDF
Native Container Monitoring
Anushree Narasimha
 
PDF
Druid
Dori Waldman
 
PDF
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
PDF
Cassandra NoSQL Tutorial
Michelle Darling
 
PDF
Running Cassandra in AWS
DataStax Academy
 
PDF
Streamsets and spark in Retail
Hari Shreedharan
 
PDF
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Databricks
 
PDF
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
PPTX
Presentation
Dimitris Stripelis
 
PDF
OpenSearch.pdf
Abhi Jain
 
PDF
MongoDB FabLab León
Juan Antonio Roy Couto
 
PDF
Data Science in the Cloud @StitchFix
C4Media
 
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
PDF
Avoiding Pitfalls for Cassandra.pdf
Cédrick Lunven
 
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
PPTX
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
ScyllaDB
 
PDF
Peer sim (p2p network)
Hein Min Htike
 
PDF
Mongo db improve the performance of your application codemotion2016
Juan Antonio Roy Couto
 
Real-time analytics with Druid at Appsflyer
Michael Spector
 
Native container monitoring
Rohit Jnagal
 
Native Container Monitoring
Anushree Narasimha
 
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Cassandra NoSQL Tutorial
Michelle Darling
 
Running Cassandra in AWS
DataStax Academy
 
Streamsets and spark in Retail
Hari Shreedharan
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Databricks
 
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
Presentation
Dimitris Stripelis
 
OpenSearch.pdf
Abhi Jain
 
MongoDB FabLab León
Juan Antonio Roy Couto
 
Data Science in the Cloud @StitchFix
C4Media
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
Avoiding Pitfalls for Cassandra.pdf
Cédrick Lunven
 
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
ScyllaDB
 
Peer sim (p2p network)
Hein Min Htike
 
Mongo db improve the performance of your application codemotion2016
Juan Antonio Roy Couto
 

Recently uploaded (20)

PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 

Demonstration

  • 2. Outline ● Some comments on what we're trying to show ○ high level cluster configuration ○ an example application that might use this config ■ based on a Gowalla data set ● Launch cluster nodes on EC2 ● Launch/configure Cassandra on cluster ● Demonstrate use of Cassandra ○ cassandra-cli, pycassa scripts to interact with db ● Demonstrate use of Hadoop ● Demonstrate use of Pig on the real data
  • 3. Cluster configuration ● Four EC2 nodes ○ m1.medium instances ■ realistically a bit small for real world ● 3 nodes part of Cassandra ○ data can be input dynamically into db via Thrift API ● All nodes run Hadoop Tasktracker ● MapReduce runs close to (Cassandra) data ● JobTracker on separate node
  • 4. Cluster config Job Tracker Cassandra Task Tracker Cassandra Cassandra Task Tracker Task Tracker All nodes m1.small for demo
  • 5. Let's get the cluster up... ...over to Lamine!
  • 6. Let's get Cassandra running... ...and show the basic cli...
  • 7. Application data ● Used Gowalla data in this test application ● Gowalla provide anonymized data for test/research purposes: ○ Graph of UID connections ○ List of checkins - UID, LocID ● Size of data set: ○ 400MB checkins ■ 6.4m checkins ○ ~200k users ● Also generated simpler variant of this data for demonstration ○ more real user information ○ more real location information
  • 8. Application data - User Graph Simple graph structure - unidirectional graph with UIDs as nodes
  • 9. Application Data - Checkin info
  • 10. How this data can be used ● Application interested in: ○ my checkins ○ list my friends ○ checkins at given location ○ my friends checkins ● Analytics: ○ top ten most active users - most checkins ○ aggregate checkins per week ○ aggregate checkins per week per city
  • 11. Cassandra data models ● The following data models were used: ○ User ○ Location ○ Checkin ○ FriendRels ■ graph of friend relationships ○ UserCheckins ■ checkins by user ○ LocationCheckins ■ checkins by location ○ FriendCheckins ■ checkins by friends
  • 12. Cassandra data models ● Use of valueless columns ○ FriendRels, UserCheckins, LocationCheckins, FriendCheckins are just sets of valueless columns ● FriendRel: ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...} ■ row_key is a uid ● UserCheckins: ○ row_key: {checkinid1: '', checkinid2: '', ...} ■ row_key is uid ● LocationCheckins use LocID as row key ● FriendCheckins use my UID to get my friend's checkins
  • 13. Let's import the data into Cassandra...
  • 14. You deserve a coffee...
  • 15. Using Hadoop and Pig ...and we can do some analytics...