analyze(NoSQL,BigData);
/* history, hype, opportunities */




              // By: Vishy Poosala
          // Head of Bell Labs, India
       // poosala@alcatel-lucent.com
                   // @vishyp
                                        1
The dark ages of COBOL




                         2
..then Codd said
let there be tables

              Rows &
              Columns




                        Normal
        SQL
                        Forms




               ACID


                                 3
www.data-for-humans.com


                        SET-
             WHAT
                       VALUED
            COLUMNS
                      ATTRIBUT
               ?
                         ES



                      Schema
              XML
                      Evolution




                                  4
Billions of Keys & Values

                        GFS



                       Google
                      Big Table



                       Hadoop



                      Cassandra
                       Dynamo


                                  5
How would you build a super-fast,
 FB-scale chat service, in 2012?

          (for example)



                                    6
I want my own DB!
           • Memcached
 Main
Memory     • redis


 Distr.
           • MongoDB
 K-V



Versions   • CouchDB



Social
Graphs     • Neo4j


                                    7
BIG
             KB       GB       TB           PB


Data                           Semi-
            FILES   TABLES                 Variety
                             Structured
                                          Dynamic

Analytics            OLAP
            STATS              Apps        Mahout
                     Cube


Language
            COBOL     SQL      XML         NoSQL




            60’s    80-96    96-’07         ‘07-

                                                 8
Following *AMAZING* Slides Courtesy: Gregory Piatesky-Shapiro, kdnuggets.com

You can find all the slides from his talk at:

http://www.slideshare.net/gpiatetskyshapiro/analytics-and-data-mining-industry-overview

                                                                                          9
Data Tsunami
• In 2010 enterprises
  stored 7 exabytes
  =7,000,000,000 GB
of new data (McKinsey)
• 90 percent of the
  world's data has been
                          Image with apologies to KDD-2011
  generated in the past
  two years (IBM)
                                                             10
Pre-history




Statistics is the biggest term in 20th century, but
data mining           and analytics          appears in late
1990s
From Google Ngram viewer – English language books
Note: Our analysis uses only English language data.
Other languages, especially Chinese , need to be considered for full picture
                                                                               11
Recent History:
Analytics, Data Mining, Knowledge Discovery




Analytics has been used since 1800, but started to rise in 2005
Data Mining jumps around 1996 (soon after first KDD conference) but declines after
2003 (TIA controversy, associated with gov. invasion of privacy).
Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000
                                                                           12
Google Trends:
After 2006, Data Mining < Analytics




                                  13
Google Insights: searches for
data mining, analytics -google
are most popular in India, US




                                 14
Analytics > Data Mining > Data
            Science




                                 15
Data Science, Big Data




                         16
Data Types Analyzed/Mined




www.KDnuggets.com/polls/2011/data-types-analyzed-mined.html   17
Largest Dataset Analyzed?
                                               2011 median dataset
                                               size ~10-20 GB,
                                               vs 8-10 GB in 2010.

                                               Increase in
                                               10 GB to 1 PB range




www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html
                                                                 18
Which methods/algorithms did you
  use for data analysis in 2011
                                    % analysts who used it
                                    0%   10%   20%   30%   40%   50%   60%   70%

                 Decision Trees
                     Regression
                     Clustering
                       Statistics
                   Visualization
  Time series/Sequence analysis
           Support Vector (SVM)
               Association rules
             Ensemble methods
                    Text Mining
                    Neural Nets
                       Boosting
                      Bayesian
                       Bagging
                Factor Analysis
    Anomaly/Deviation detection
        Social Network Analysis
               Survival Analysis
             Genetic algorithms
                 Uplift modeling



 www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
                                                                  19
Cloud Analytics is not common
             (yet)




www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html
                                                                 20
Shortage of Skills
• McKinsey: shortage by 2018 in the US of
  – 140-190,000 people with deep analytical skills

  – 1.5 M managers/analysts with the know-how
    to use the analysis of big data to make
    effective decisions.

  Source:
   www.mckinsey.com/mgi/publications/big_data
   /                                        21
Job data: Data Scientist




                           22
Jobs: Data Mining >> Data
        Scientist




                            23
“Ground” Analytics (LinkedIn
          Skills)
                 ~ 75,000 with Data Mining skill

                  ~ 7,000 with Predictive Modeling



                  Also
                  ~ 20,000 with Predictive
                  Analytics
                  (not related with Predictive
                  Modeling ??




                                             24
Analytics LinkedIn Skills




  Predictive Analytics Machine Learning


 Text
 Mining                                   MapReduce



                                                      25
Big Data Bubble?

Big Data




            Gartner Hype Cycle

                                 26
27

NoSQL & Big Data Analytics: History, Hype, Opportunities

  • 1.
    analyze(NoSQL,BigData); /* history, hype,opportunities */ // By: Vishy Poosala // Head of Bell Labs, India // [email protected] // @vishyp 1
  • 2.
    The dark agesof COBOL 2
  • 3.
    ..then Codd said letthere be tables Rows & Columns Normal SQL Forms ACID 3
  • 4.
    www.data-for-humans.com SET- WHAT VALUED COLUMNS ATTRIBUT ? ES Schema XML Evolution 4
  • 5.
    Billions of Keys& Values GFS Google Big Table Hadoop Cassandra Dynamo 5
  • 6.
    How would youbuild a super-fast, FB-scale chat service, in 2012? (for example) 6
  • 7.
    I want myown DB! • Memcached Main Memory • redis Distr. • MongoDB K-V Versions • CouchDB Social Graphs • Neo4j 7
  • 8.
    BIG KB GB TB PB Data Semi- FILES TABLES Variety Structured Dynamic Analytics OLAP STATS Apps Mahout Cube Language COBOL SQL XML NoSQL 60’s 80-96 96-’07 ‘07- 8
  • 9.
    Following *AMAZING* SlidesCourtesy: Gregory Piatesky-Shapiro, kdnuggets.com You can find all the slides from his talk at: http://www.slideshare.net/gpiatetskyshapiro/analytics-and-data-mining-industry-overview 9
  • 10.
    Data Tsunami • In2010 enterprises stored 7 exabytes =7,000,000,000 GB of new data (McKinsey) • 90 percent of the world's data has been Image with apologies to KDD-2011 generated in the past two years (IBM) 10
  • 11.
    Pre-history Statistics is thebiggest term in 20th century, but data mining and analytics appears in late 1990s From Google Ngram viewer – English language books Note: Our analysis uses only English language data. Other languages, especially Chinese , need to be considered for full picture 11
  • 12.
    Recent History: Analytics, DataMining, Knowledge Discovery Analytics has been used since 1800, but started to rise in 2005 Data Mining jumps around 1996 (soon after first KDD conference) but declines after 2003 (TIA controversy, associated with gov. invasion of privacy). Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000 12
  • 13.
    Google Trends: After 2006,Data Mining < Analytics 13
  • 14.
    Google Insights: searchesfor data mining, analytics -google are most popular in India, US 14
  • 15.
    Analytics > DataMining > Data Science 15
  • 16.
  • 17.
  • 18.
    Largest Dataset Analyzed? 2011 median dataset size ~10-20 GB, vs 8-10 GB in 2010. Increase in 10 GB to 1 PB range www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html 18
  • 19.
    Which methods/algorithms didyou use for data analysis in 2011 % analysts who used it 0% 10% 20% 30% 40% 50% 60% 70% Decision Trees Regression Clustering Statistics Visualization Time series/Sequence analysis Support Vector (SVM) Association rules Ensemble methods Text Mining Neural Nets Boosting Bayesian Bagging Factor Analysis Anomaly/Deviation detection Social Network Analysis Survival Analysis Genetic algorithms Uplift modeling www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html 19
  • 20.
    Cloud Analytics isnot common (yet) www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html 20
  • 21.
    Shortage of Skills •McKinsey: shortage by 2018 in the US of – 140-190,000 people with deep analytical skills – 1.5 M managers/analysts with the know-how to use the analysis of big data to make effective decisions. Source: www.mckinsey.com/mgi/publications/big_data / 21
  • 22.
    Job data: DataScientist 22
  • 23.
    Jobs: Data Mining>> Data Scientist 23
  • 24.
    “Ground” Analytics (LinkedIn Skills) ~ 75,000 with Data Mining skill ~ 7,000 with Predictive Modeling Also ~ 20,000 with Predictive Analytics (not related with Predictive Modeling ?? 24
  • 25.
    Analytics LinkedIn Skills Predictive Analytics Machine Learning Text Mining MapReduce 25
  • 26.
    Big Data Bubble? BigData Gartner Hype Cycle 26
  • 27.