PYCON INDIA 2012

                         Pycassa – Python
                          Cassandrified
28-30th September 2012            Ramesh Rajini
   Dharmaram Vidya               Infosys Limited,
       Kshetram          Education & Research, Bangalore
 Bangalore, Karnataka
Session Plan
•   Need & Introduction to NoSQL DB
•   Cassandra Introduction
•   Data model creation
•   Pycassa in action
Heard of NO - SQL?
•   Stands for Not Only SQL
•   Class of non-relational data storage systems
•   No fixed table schema
•   No Joins!
•   Relax one or more of the ACID properties & will
    implement BASE & CAP Theorem!
Do we “REALLY” need them ?



                    •   RDBMS …So strong
                    •   so crisp
                    •   so vast
                    •   And WE know it well!
Trends shrends!


 – Gartner‟s 10 key IT trends for 2012
    • unstructured data will grow some 80% over the
      course of the next five years




                       5
What made some apps go No-SQLized?
•   Explosion of social media sites with large data needs
•   Open-source community
•   Upsurge of cloud-based solutions
•   Migration to dynamically-typed languages
RDBMS..hmmm
• Normalization => Joins => Slow Queries /Complications
• Consistency => locks /transactions => Performance issues in
  distributed environments
• Scalability becomes a mess as our apps grow in size and
  demand
Current Approach to Scalability
•   Add hardware
•   Upgrade hardware
•   More machines
•   Turn off unwanted services
•   Caching
•   De-normalize…
RDBMS ..tends to



   Massive [terabytes]

   Elastic scalability

   Easily achieve Fault tolerance

   Tunable Consistency
But Why..


 • ACID
 • - transaction slow under heavy load
 • - in distributed /replicated environment = 2 phase
   commit => infinite wait by either NODE or Coordinator
But RDBMS is still holding up!!
•   Yes..it is
•   Will continue to Co-exist with NOSQL
•   What if data is no more a problem to me!
•   What new problems will I like to have?
Seeds of NoSQL
• Three major papers
   – BigTable (Google)
   – Dynamo (Amazon)
      • Gossip protocol (discovery and error detection)
      • Distributed key-value data store
      • Eventual consistency
   – CAP Theorem
Brewer’s CAP Theorem
• Properties of a system:
   – Consistency
   – Availability
   – Partitions
Brewer’s CAP Theorem
• You can have it good, you can have it fast, you can have
  it cheap: pick two




                          14
BASE Vs ACID - Eventual Consistency
• No updates for a long duration => eventually all updates
  will propagate through the system => all the nodes will
  be consistent
• Any given accepted update and a given node, eventually
  either the update reaches the node or the node is
  removed from service
• Known as BASE (Basically Available, Soft state,
  Eventual consistency)
What kinds of NoSQL
• 2 Major areas:
   – Key/Value or „the big hash table‟.
      • Dynamo
      • Voldemort
      • Scalaris
   – Schema-less
      • column-based, document-based or graph-based.
          –   Cassandra (column-based)
          –   CouchDB (document-based)
          –   Neo4J (graph-based)
          –   HBase (column-based)
Any users?
Cassandra to the Rescue!
  – , source,
  Open

  Distributed, Decentralized,

  Elastically scalable

  Highly available / fault-tolerant

  Tune ably consistent

  Column-oriented database

  Automatic sharding

  Gossip Architecture




                         18
Distributed and Decentralized




        Can be running             Decentralized
        on multiple                • that there is no single
        machines                     point of failure.
        • appearing to users as    • All the nodes in
          single instance            cluster function
                                     exactly the same
                                     [server symmetry]




                              19
Elastic Scalability


• Vertical scaling :
   – more hardware capacity /memory


• Horizontal scaling :
      • More machines that have all or some
        of the data
      • So that no machine is bearing the
        complete load



                         20
Elastic Scalability , No single point failure
• Elastic scalability :
   – Cluster will be able to scale up & down
• Master Slave issue




                          21
Scale UP & Scale down

• Add nodes and they can start serving clients!
   – NO server restart / NO query change / NO
     balancing
   – JUST add an another machine.
• Just unplug the system.
   – Since cassandra has multiple copies of the same
     data in more than one node [configurable] there
     wont be any loss of data.
High Availability and Fault Tolerance
• High availability + central server based system = problem
   – Internal Hard ware redundancy
   – Sounds cool but Extremely Costly




                         23
High Availability and Fault Tolerance
  – Cassandra allows to :
     • replace failed nodes in with no downtime
     • replicate data to multiple data centers to prevent
       downtime [automatic]
Tuneable Consistency
• Consistency : All Reads return the most recently written
  value
   – Cassandra is “eventually consistent” model by
     default.




                          25
But then!


 • Amazon, Facebook, Google, Twitter which uses this
   model.
    – DATA is their main sales item
    – High performance!
Setting up Apache Cassandra
• From the DataStax community Project
   – www.datastax.com/download
• From the Apache Cassandra project:
   – http://cassandra.apache.org/


                  Believe it.. It‟s easy to
                     install & set up!
Keyspace & Column Family creation



 Column family 1
Key1            ColumnName1           ColumnName2
                Value                 Value
Key2            ColumnName1           ColumnName2
                Value                 Value
Key3            ColumnName1           ColumnName2     ColumnName3
                Value                 Value           Value

 Column family 2
 Key1   ColumnName1           ColumnName2     ColumnName3
        Value                 Value           Value
Data makes sense..



 Column family Close Friends
 010051         Mail id                tweets
                Ramesh_Rajini          Hello
 010052         Mail id                tweets
                Vinz_Raj               I‟m logged in!
 010053         Mail id                tweet1              tweet2
                Ragh_Rao               Hey, how r u ?      Movie..

  Column family Colleagues
 020061   Mail id               City               Likes
          Puru_lal              Bangalore          Ladoos!
Cassandra Data Structure



 key space

   Ex:
            column family
  Colony
  Name,
 UserIDs,
              Ex:
            Address,    column
 EmpIDs     Tweets,
             Likes,      name    value   timestamp
            Skill Set
Key-in the Key space..




                         31
Pycassa in action!
Multi-level Dictionary

  {“FriendsInfo”:          Keyspace
         {“closefriends”:          Column Family
     Key        {010053: OrderedDict(
                       [(“MailId”:“Ragh_Rao”),
  Columns              (“tweet1”:“Hey, how r u ?”),
                       (“tweet2”: “Movie..”)])

                OrderedDict(
                     ..
   }}                    ColumnKeys         ColumnValues
Can I insert in bulk?
• Yes, luckily as an ordered dict..
 col_fam.batch_insert(
{'010054': {'Name': 'Vinayak', 'Id': „9308'},
 '010057': {'Name': 'Poorvi'}
})
__________________________________
for i in range(1000, 1010):
... col_fam.insert('EmpIDs', {str(i): 'Hello'})

                                      34
Is the data stored?
• With Key , get all details:
 col_fam.get('010052')
        OrderedDict
        ([('Maild', 'Vinz_Raj'), ('tweets', 'Im loggedin!')])

• With Key, get specific details:
 col_fam.get('010053', columns=['MaiID', 'tweet2'])
         OrderedDict([('tweet2', 'Movie..')])
• Specifying start & end columns:
  col_fam.get('EmpIDs', column_start='1002', column_finish='1006')
           OrderedDict([('1002', 'Hello'), ('1003', 'Hello'), ('1004', 'Hello'),
           ('1005', 'Hello'), ('1006', 'Hello')])


                                                   35
Can the columns be sliced?
• Specifying the reverse way
    col_fam.get('EmpIDs', column_reversed=True, column_count=3)
    OrderedDict([('1009', 'Hello'), ('1008', 'Hello'), ('1007', 'Hello')])
• Fetching multiple rows
    col_fam.multiget(['010053', '010051'])
    OrderedDict(
    [('010053',
    OrderedDict([('Maild', 'Ragh_Rao'), ('tweet1', 'Hey, how r u?'),
                      ('tweet2', 'Movie..')])),
    ('010051',
    OrderedDict([('Mailid', 'Ramesh_Rajini'), ('tweets', 'Hello')]))])

                                                   36
Counting..
• get_count()
   Count the number of columns in the row with key .
• multiget_count()
   Perform a column count in parallel on a set of rows.
   Similar parameters as for multiget(), except that a list
    of keys may be used.
   A dictionary of the form {key: int} is returned.




                                       37
What Next?
• Explore more on Pycassa modules..
   – http://pycassa.github.com/pycassa/api/index.html
• Start using it.. I‟m sure you‟ll enjoy because it is simply
  superb!




                                         38
Recap
•   Need & Introduction to NoSQL DB
•   Cassandra Introduction
•   Data model creation
•   Pycassa in action




                                      39
References
• Cassandra, The Definitive Guide – O‟reilly
  Publication,Eben Hewitt
• http://www.datastax.com/
• http://pycassa.github.com/pycassa/
• https://github.com/twissandra/twissandra
• https://groups.google.com/forum/?fromgroups#!forum/py
  cassa-discuss




                                    40
Time for R&R?
                - Requests & Responses
Thank you!




      - R&R
                                  Ramesh Rajini


Disclaimer : All logos and images belong to the creator and companies which own them

Slide presentation pycassa_upload

  • 1.
    PYCON INDIA 2012 Pycassa – Python Cassandrified 28-30th September 2012 Ramesh Rajini Dharmaram Vidya Infosys Limited, Kshetram Education & Research, Bangalore Bangalore, Karnataka
  • 2.
    Session Plan • Need & Introduction to NoSQL DB • Cassandra Introduction • Data model creation • Pycassa in action
  • 3.
    Heard of NO- SQL? • Stands for Not Only SQL • Class of non-relational data storage systems • No fixed table schema • No Joins! • Relax one or more of the ACID properties & will implement BASE & CAP Theorem!
  • 4.
    Do we “REALLY”need them ? • RDBMS …So strong • so crisp • so vast • And WE know it well!
  • 5.
    Trends shrends! –Gartner‟s 10 key IT trends for 2012 • unstructured data will grow some 80% over the course of the next five years 5
  • 6.
    What made someapps go No-SQLized? • Explosion of social media sites with large data needs • Open-source community • Upsurge of cloud-based solutions • Migration to dynamically-typed languages
  • 7.
    RDBMS..hmmm • Normalization =>Joins => Slow Queries /Complications • Consistency => locks /transactions => Performance issues in distributed environments • Scalability becomes a mess as our apps grow in size and demand
  • 8.
    Current Approach toScalability • Add hardware • Upgrade hardware • More machines • Turn off unwanted services • Caching • De-normalize…
  • 9.
    RDBMS ..tends to Massive [terabytes] Elastic scalability Easily achieve Fault tolerance Tunable Consistency
  • 10.
    But Why.. •ACID • - transaction slow under heavy load • - in distributed /replicated environment = 2 phase commit => infinite wait by either NODE or Coordinator
  • 11.
    But RDBMS isstill holding up!! • Yes..it is • Will continue to Co-exist with NOSQL • What if data is no more a problem to me! • What new problems will I like to have?
  • 12.
    Seeds of NoSQL •Three major papers – BigTable (Google) – Dynamo (Amazon) • Gossip protocol (discovery and error detection) • Distributed key-value data store • Eventual consistency – CAP Theorem
  • 13.
    Brewer’s CAP Theorem •Properties of a system: – Consistency – Availability – Partitions
  • 14.
    Brewer’s CAP Theorem •You can have it good, you can have it fast, you can have it cheap: pick two 14
  • 15.
    BASE Vs ACID- Eventual Consistency • No updates for a long duration => eventually all updates will propagate through the system => all the nodes will be consistent • Any given accepted update and a given node, eventually either the update reaches the node or the node is removed from service • Known as BASE (Basically Available, Soft state, Eventual consistency)
  • 16.
    What kinds ofNoSQL • 2 Major areas: – Key/Value or „the big hash table‟. • Dynamo • Voldemort • Scalaris – Schema-less • column-based, document-based or graph-based. – Cassandra (column-based) – CouchDB (document-based) – Neo4J (graph-based) – HBase (column-based)
  • 17.
  • 18.
    Cassandra to theRescue! – , source, Open Distributed, Decentralized, Elastically scalable Highly available / fault-tolerant Tune ably consistent Column-oriented database Automatic sharding Gossip Architecture 18
  • 19.
    Distributed and Decentralized Can be running Decentralized on multiple • that there is no single machines point of failure. • appearing to users as • All the nodes in single instance cluster function exactly the same [server symmetry] 19
  • 20.
    Elastic Scalability • Verticalscaling : – more hardware capacity /memory • Horizontal scaling : • More machines that have all or some of the data • So that no machine is bearing the complete load 20
  • 21.
    Elastic Scalability ,No single point failure • Elastic scalability : – Cluster will be able to scale up & down • Master Slave issue 21
  • 22.
    Scale UP &Scale down • Add nodes and they can start serving clients! – NO server restart / NO query change / NO balancing – JUST add an another machine. • Just unplug the system. – Since cassandra has multiple copies of the same data in more than one node [configurable] there wont be any loss of data.
  • 23.
    High Availability andFault Tolerance • High availability + central server based system = problem – Internal Hard ware redundancy – Sounds cool but Extremely Costly 23
  • 24.
    High Availability andFault Tolerance – Cassandra allows to : • replace failed nodes in with no downtime • replicate data to multiple data centers to prevent downtime [automatic]
  • 25.
    Tuneable Consistency • Consistency: All Reads return the most recently written value – Cassandra is “eventually consistent” model by default. 25
  • 26.
    But then! •Amazon, Facebook, Google, Twitter which uses this model. – DATA is their main sales item – High performance!
  • 27.
    Setting up ApacheCassandra • From the DataStax community Project – www.datastax.com/download • From the Apache Cassandra project: – http://cassandra.apache.org/ Believe it.. It‟s easy to install & set up!
  • 28.
    Keyspace & ColumnFamily creation Column family 1 Key1 ColumnName1 ColumnName2 Value Value Key2 ColumnName1 ColumnName2 Value Value Key3 ColumnName1 ColumnName2 ColumnName3 Value Value Value Column family 2 Key1 ColumnName1 ColumnName2 ColumnName3 Value Value Value
  • 29.
    Data makes sense.. Column family Close Friends 010051 Mail id tweets Ramesh_Rajini Hello 010052 Mail id tweets Vinz_Raj I‟m logged in! 010053 Mail id tweet1 tweet2 Ragh_Rao Hey, how r u ? Movie.. Column family Colleagues 020061 Mail id City Likes Puru_lal Bangalore Ladoos!
  • 30.
    Cassandra Data Structure key space Ex: column family Colony Name, UserIDs, Ex: Address, column EmpIDs Tweets, Likes, name value timestamp Skill Set
  • 31.
    Key-in the Keyspace.. 31
  • 32.
  • 33.
    Multi-level Dictionary {“FriendsInfo”: Keyspace {“closefriends”: Column Family Key {010053: OrderedDict( [(“MailId”:“Ragh_Rao”), Columns (“tweet1”:“Hey, how r u ?”), (“tweet2”: “Movie..”)]) OrderedDict( .. }} ColumnKeys ColumnValues
  • 34.
    Can I insertin bulk? • Yes, luckily as an ordered dict.. col_fam.batch_insert( {'010054': {'Name': 'Vinayak', 'Id': „9308'}, '010057': {'Name': 'Poorvi'} }) __________________________________ for i in range(1000, 1010): ... col_fam.insert('EmpIDs', {str(i): 'Hello'}) 34
  • 35.
    Is the datastored? • With Key , get all details: col_fam.get('010052') OrderedDict ([('Maild', 'Vinz_Raj'), ('tweets', 'Im loggedin!')]) • With Key, get specific details: col_fam.get('010053', columns=['MaiID', 'tweet2']) OrderedDict([('tweet2', 'Movie..')]) • Specifying start & end columns: col_fam.get('EmpIDs', column_start='1002', column_finish='1006') OrderedDict([('1002', 'Hello'), ('1003', 'Hello'), ('1004', 'Hello'), ('1005', 'Hello'), ('1006', 'Hello')]) 35
  • 36.
    Can the columnsbe sliced? • Specifying the reverse way col_fam.get('EmpIDs', column_reversed=True, column_count=3) OrderedDict([('1009', 'Hello'), ('1008', 'Hello'), ('1007', 'Hello')]) • Fetching multiple rows col_fam.multiget(['010053', '010051']) OrderedDict( [('010053', OrderedDict([('Maild', 'Ragh_Rao'), ('tweet1', 'Hey, how r u?'), ('tweet2', 'Movie..')])), ('010051', OrderedDict([('Mailid', 'Ramesh_Rajini'), ('tweets', 'Hello')]))]) 36
  • 37.
    Counting.. • get_count()  Count the number of columns in the row with key . • multiget_count()  Perform a column count in parallel on a set of rows.  Similar parameters as for multiget(), except that a list of keys may be used.  A dictionary of the form {key: int} is returned. 37
  • 38.
    What Next? • Exploremore on Pycassa modules.. – http://pycassa.github.com/pycassa/api/index.html • Start using it.. I‟m sure you‟ll enjoy because it is simply superb! 38
  • 39.
    Recap • Need & Introduction to NoSQL DB • Cassandra Introduction • Data model creation • Pycassa in action 39
  • 40.
    References • Cassandra, TheDefinitive Guide – O‟reilly Publication,Eben Hewitt • http://www.datastax.com/ • http://pycassa.github.com/pycassa/ • https://github.com/twissandra/twissandra • https://groups.google.com/forum/?fromgroups#!forum/py cassa-discuss 40
  • 41.
    Time for R&R? - Requests & Responses
  • 42.
    Thank you! - R&R Ramesh Rajini Disclaimer : All logos and images belong to the creator and companies which own them