Why Data is Drowning the (IT) World?



                                Sanjeev Kumar
                     VP & MD, Informatica India

                        Infovision 2012 Summit
                                  October 2012


1
Agenda

• Why the Data Deluge?
• Trends Affecting Data Growth
• New Use-cases Enabled by Big Data




                                      2
Agenda

• Why the Data Deluge?
• Trends Affecting Data Growth
• New Use-cases Enabled by Big Data
• Trends Underlying Big Data
• Building-blocks for Managing Big Data
• Q&A




                                          3
Data is the New Plastic




                          4
Where Are We? Computing Circa 2012!




                                      5
Where Are We? Computing Circa 2012!

•   Six decades into the Computer Revolution




                                               6
Where Are We? Computing Circa 2012!

•   Six decades into the Computer Revolution

•   Four decades since the invention of Microprocessor




                                                         7
Where Are We? Computing Circa 2012!

•   Six decades into the Computer Revolution

•   Four decades since the invention of Microprocessor

•   Two decades into the rise of modern Internet




                                                         8
Where Are We? Computing Circa 2012!

•   Six decades into the Computer Revolution

•   Four decades since the invention of Microprocessor

•   Two decades into the rise of modern Internet

•   Two billion people using the broadband Internet




                                                         9
Where Are We? Computing Circa 2012!

•   Six decades into the Computer Revolution

•   Four decades since the invention of Microprocessor

•   Two decades into the rise of modern Internet

•   Two billion people using the broadband Internet


      Major businesses and industries running on
       software and delivered as online services*
                       *”Why software is eating the world” Marc Andreessen, WSJ Aug 2011




                                                                                       10
Trends: Exploding Data Volumes, “Big Data”




       Complex, Unstructured




     Relational

                                                                Kilo – Mega – Giga – Terra –
                                                                Peta – Exa – Zetta - Yotta

 • 2,500 Exabytes of new information in 2012 with Internet as primary driver
 • Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2
 “Zettabytes” this year     Source: An   IDC White Paper - sponsored by EMC. As the Economy Contracts, the
                                                                        Digital Universe Expands. May 2009.
                                                                                                          .
                                                                                                              11
Big Data Buzz!
•   16 Big Data “V”s; Original 3: Volume, Variety & Velocity




                                                               12
Big Data Buzz!
•   16 Big Data “V”s; Original 3: Volume, Variety & Velocity
•   120+ Twitter accounts relating to Big Data




                                                               13
Big Data Buzz!
•   16 Big Data “V”s; Original 3: Volume, Variety & Velocity
•   120+ Twitter accounts relating to Big Data
•   9000 job search results for “data scientists”




                                                               14
Big Data Buzz!
•   16 Big Data “V”s; Original 3: Volume, Variety & Velocity
•   120+ Twitter accounts relating to Big Data
•   9000 job search results for “data scientists”
•   70,000 Wikipedia “big data” hits per month




                                                               15
Big Data Buzz!
•   16 Big Data “V”s; Original 3: Volume, Variety & Velocity
•   120+ Twitter accounts relating to Big Data
•   9000 job search results for “data scientists”
•   70,000 Wikipedia “big data” hits per month
•   2,000,000 PDFs from search on “big data white paper”




                                                               16
Big Data Buzz!
•   16 Big Data “V”s; Original 3: Volume, Variety & Velocity
•   120+ Twitter accounts relating to Big Data
•   9000 job search results for “data scientists”
•   70,000 Wikipedia “big data” hits per month
•   2,000,000 PDFs from search on “big data white paper”
•   112,000,000 Blog posts discussing big data




                                                               17
Big Data Buzz!
•   16 Big Data “V”s; Original 3: Volume, Variety & Velocity
•   120+ Twitter accounts relating to Big Data
•   9000 job search results for “data scientists”
•   70,000 Wikipedia “big data” hits per month
•   2,000,000 PDFs from search on “big data white paper”
•   112,000,000 Blog posts discussing big data
•   1,350,000,000 Google results for “What is big data?”


                                                    Source IBM 2012




                                                                  18
Why Now? Exploding Data Volumes
           Proliferation of           Increased consumption
        web connected devices             of digital content




Explosion in user generated content     Internet of things




                                                               19
Trends: Changing Data Economics

Return on Byte = value to be extracted from that
byte / cost of storing that byte.




                                     High ROB


                                     Low ROB



                                                   20
Trends : Data Seen as a Strategic Asset
•   Companies leveraging data assets to
     •   Create new and differentiated products
           •   Product recommendation engines
     •   Increase revenues
           •   Optimize ad placement to improve click-thru
     •   Improve customer satisfaction / retention
           •   Analyze CDRs for dropped calls
The sexy job in the next ten years will be statisticians. The ability to take data—
to be able to understand it, to process it, to extract value from it, to visualize it, to
communicate it—that’s going to be a hugely important skill. Hal Varian : Chief
Economist, Google.



                                                                                            21
Big Data in the Enterprise




                             22
Why Now? Big Data Use-cases – User Behavior

• Location & Proximity Tracking
  • GPS in operational apps, security analysis, navigation & social media
  • New business opportunities for sales and services in proximity




                                                                        23
Why Now? Big Data Use-cases – User Behavior

• Location & Proximity Tracking
  • GPS in operational apps, security analysis, navigation & social media
  • New business opportunities for sales and services in proximity

• Ad Tracking
  • Dynamic changes in ad placement, color, size and wording
  • Improved click-through behavior




                                                                        24
Why Now? Big Data Use-cases – User Behavior

• Location & Proximity Tracking
  • GPS in operational apps, security analysis, navigation & social media
  • New business opportunities for sales and services in proximity

• Ad Tracking
  • Dynamic changes in ad placement, color, size and wording
  • Improved click-through behavior

• Social CRM
  • Text analytics on huge array of unstructured social media
  • KPI’s: share of voice, audience engagement, conversation reach, …




                                                                        25
Why Now? Big Data Use-cases – User Behavior

• Location & Proximity Tracking
   • GPS in operational apps, security analysis, navigation & social media
   • New business opportunities for sales and services in proximity

• Ad Tracking
   • Dynamic changes in ad placement, color, size and wording
   • Improved click-through behavior

• Social CRM
   • Text analytics on huge array of unstructured social media
   • KPI’s: share of voice, audience engagement, conversation reach, …

• Causal Factor Discovery in Retail
   • Deviations based on competition, weather, promos, holidays, events


                                                                          26
Why Now? “Hadoop-able” Use-cases – Sensors
• Building Sensors
  • Temperature, humidity, vibration and noise
  • Energy usage, security violations, failures in a/c, heat, plumbing




                                                                         27
Why Now? “Hadoop-able” Use-cases – Sensors
• Building Sensors
   • Temperature, humidity, vibration and noise
   • Energy usage, security violations, failures in a/c, heat, plumbing

• In-flight Aircraft Sensors
   • Variables on engines, hydraulics, fuel & electrical systems
   • Real-time adaptive control, fuel usage, part failure prediction




                                                                          28
Why Now? “Hadoop-able” Use-cases – Sensors
• Building Sensors
   • Temperature, humidity, vibration and noise
   • Energy usage, security violations, failures in a/c, heat, plumbing

• In-flight Aircraft Sensors
   • Variables on engines, hydraulics, fuel & electrical systems
   • Real-time adaptive control, fuel usage, part failure prediction

• Smart Utility Meters – Electric Grid
   • One read-out per second per meter across entire customer base
   • Dynamic load balancing on grid, failure response, adaptive pricing




                                                                          29
Why Now? “Hadoop-able” Use-cases – Sensors
• Building Sensors
   • Temperature, humidity, vibration and noise
   • Energy usage, security violations, failures in a/c, heat, plumbing

• In-flight Aircraft Sensors
   • Variables on engines, hydraulics, fuel & electrical systems
   • Real-time adaptive control, fuel usage, part failure prediction

• Smart Utility Meters – Electric Grid
   • One read-out per second per meter across entire customer base
   • Dynamic load balancing on grid, failure response, adaptive pricing

• Mobile Cell Tower Networks
   • Analyze call-data-records(CDRs) to optimize cell tower placement
   • Improved user experience and network monetization

                                                                          30
“Hadoop-able” Use-cases – Computing Delta’s
• Commercial Seed Gene Sequencing
  • Analyzing the sequence, identifying genes and gene families
  • Baseline reference for the larger cotton crop genome




                                                                  31
“Hadoop-able” Use-cases – Computing Delta’s
• Commercial Seed Gene Sequencing
  • Analyzing the sequence, identifying genes and gene families
  • Baseline reference for the larger cotton crop genome

• Satellite Image Comparison
  • Overlay of images to create “hot spot” maps to show differences
  • Construction, destruction, changes due to disasters, encroachment




                                                                        32
“Hadoop-able” Use-cases – Computing Delta’s
• Commercial Seed Gene Sequencing
  • Analyzing the sequence, identifying genes and gene families
  • Baseline reference for the larger cotton crop genome

• Satellite Image Comparison
  • Overlay of images to create “hot spot” maps to show differences
  • Construction, destruction, changes due to disasters, encroachment

• CAT Scan Comparison
  • Images taken as “slices” of human body
  • Automatic diagnosis of medical issues and their prevalence




                                                                        33
“Hadoop-able” Use-cases – Computing Delta’s
• Commercial Seed Gene Sequencing
  • Analyzing the sequence, identifying genes and gene families
  • Baseline reference for the larger cotton crop genome

• Satellite Image Comparison
  • Overlay of images to create “hot spot” maps to show differences
  • Construction, destruction, changes due to disasters, encroachment

• CAT Scan Comparison
  • Images taken as “slices” of human body
  • Automatic diagnosis of medical issues and their prevalence

• Document Similarity Testing
  • Latent semantic analysis: “documents that agree with my doc”
  • Threat discovery, sentiment analysis and opinion polls

                                                                        34
Agenda

• Why the Data Deluge?
• Trends Affecting Data Growth
• New Use-cases Enabled by Big Data
• Trends Underlying Big Data
• Building-blocks for Managing Big Data
• Q&A




                                          35
Big Data
Confluence of Big Transaction, Big Interaction and Big Data Processing



        BIG TRANSACTION DATA                   BIG INTERACTION DATA


       Online        Online Analytical        Social         Device
       Transaction   Processing               Media Data     Sensor Data
       Processing    (OLAP) &
       (OLTP)        DW Appliances
                                                              Call detail
                                                              records, image,
                                                              click stream data

                                                              Scientific, genomic


                                                           Machine/Device




                                  BIG DATA PROCESSING



                                                                                    36
Big Transaction Data
OLTP and Analytic Databases


                   BIG TRANSACTION DATA


                 Online      Online Analytical
               Transaction     Processing
               Processing        (OLAP) &
                 (OLTP)       DW Appliances

               Oracle         Teradata
               DB2            Redbrick
               Britton-Lee    EssBase
               Ingres         Sybase IQ
               Informix       Netezza
               Sybase         Greenplum
               SQLServer      DataAllegro
                              Asterdata
                              Vertica
                              Paraccel
                              Hana




                                                 37
Big Transaction Data
Changing Economics of Computing From Buy To Rent




                                                                           CRM
                                                                           Application


         Custom                          Custom              Custom
         Application                     Application         Application
                         Mainframe
                   Custom        HR            Custom
                   Application   Application   Application




                                                                                         38
Big Interaction Data
Changing Role Of Computing From Transactions to Interactions




         BIG INTERACTION DATA


         Social
       Media Data
                         Device
                       Sensor Data            Social Media
                       Clickstream

                       Image/Text
                        Scientific
                        • Genomic/Pharma
                        • Medical

                    Machine/Device
                    • Sensors/Meters/
                                           Device Sensor Data
                      RFID Tags
                    • CDR/Mobile




                                                                39
Big Interaction Data
From Operational Efficiency To Organizational Effectiveness


Business Management                    Brand Management
• Business Analysis                    • Sentiment Analysis
• Operational Automation               • Proactive Customer
                                         Engagement




       Relational                            Social
      Transactions                        Interactions
     1970 - Current                      2008 - Current
                                                              40
Big Interaction Data
How Do You Leverage Device Sensor Data?



                                      • Geo Encoding

                                      • Cell-phone Towers

                                      • Medical Sensors

                                      • RFID Tags

                                      • Edge Networks




                                                            41
Big Data Processing
Highly Scalable Processing Of All Data



        BIG TRANSACTION DATA                   BIG INTERACTION DATA


       Online        Online Analytical        Social         Device
       Transaction   Processing               Media Data     Sensor Data
       Processing    (OLAP) &
       (OLTP)        DW Appliances
                                                              Call detail
                                                              records, image,
                                                              click stream data

                                                              Scientific, genomic


                                                           Machine/Device




                                  BIG DATA PROCESSING



                                                                                    42
Big Data Processing
What is Hadoop?




    SCRIPTING         SQL QUERY




                        PARALLEL




                      PERSISTENCE




                                    43
Big Data Processing
What does Hadoop do?


• Cost effective scalability
   • Scale out on commodity hardware

• Support for processing all data types
   • Structured, Semi-structured and Unstructured data

• Extensibility
   • Open APIs to implement custom data processing logic

• Hadoop Challenges
   • Data movement into/out of Hadoop / HDFS
   • Requires specialized development skills
       •   Java, Hive, PIG etc.




                                                           44
Ingest Data Into HDFS
             Support over 100
             different data
             sources




 Integrated                     Perform any pre    Native HDFS
 development                      processing        Source and
 environment with                needed before    Target Support
 metadata and                      ingestion
 preview support




                                                                   45
Design and Execute Data Integration Logic on
Hadoop
                       Design integration
                       logic for Hadoop in a
                       graphical and
                       metadata driven
                       environment




                               Configure where the
                               integration logic
                               should run – Hadoop
                               or Native


                                                     46
Design and Execute Data Quality on Hadoop
Big Data Cleansing, Dedup, Unstructured Parsing                                   Probabilistic or Deterministic
                                                                                            Matching


            Address Validation and
          Geocoding enrichment across
                 260 countries




                                                                                  Standardization and Reference
                                                                                       Data Management


                                           Address       Matching
                                           Validation



                                                         Standardize

  Parsing of Unstructured
 Data/Text Fields of all data                  Parsing
  types of data (customer/
    product/ social/ logs)


                                        DQ logic pushed down/run natively ON Hadoop




                                                                                                               47
Extract data from HDFS and Hive
           Extract from
           HDFS as a native
           source




                              Perform any post   Persist and write
                                 processing      hadoop data into
     Extract from Hive                           DW, HDFS or
     as a native                needed after
                                 extraction      any target
     source                                      systems




                                                                     48
                                                                      48
Processing Big Data : What is missing?
• Support for graph/networked data
   • How does one visualize complex relationships?

• Data with dynamic schemas
   • Do the current patterns scale for very large number of columns?

• Are mappings the right paradigm?
• Ability to extract entities from unstructured data




                                                                       49
                                                                       49
References
• Why Software is Eating the World
   • Marc Andreessen, WSJ Aug 2011

• Evolving Role of EDW in Era of Big Data Analytics
   • Ralph Kimball, Kimball Group 2011

• Data Scientist: Sexiest Job of the 21st Century
   • Thomas H. Davenport & D.J.Patil, HBR Sept 2012

• Newly Emerging Best Practices for Big Data
   • Ralph Kimball, Kimball Group Oct 2012




                                                      50
Questions




            51
Informatica & Data
      Verbs on Data – We do things to data!


    INFA = Data + [
      Archival | As a Service | Cleansing | Clustering | Consolidation |
      Conversion | De-duping | Exchange | Extraction | Federation |
      Hub | Identity | Integration | Life-cycle Management |
      Loading | Masking | Mastering | Matching | Migration | On Demand |
      Privacy | Profiling | Provisioning | Quality | Quality Assessment |
      Registry | Replication | Retirement | Services | Stewardship |
      Sub-setting | Synchronization | Test Management | Transformation |
      Validation | Virtualization | Warehousing |
]




                                                                            52
53

Why Data is Drowning the (IT) World?

  • 1.
    Why Data isDrowning the (IT) World? Sanjeev Kumar VP & MD, Informatica India Infovision 2012 Summit October 2012 1
  • 2.
    Agenda • Why theData Deluge? • Trends Affecting Data Growth • New Use-cases Enabled by Big Data 2
  • 3.
    Agenda • Why theData Deluge? • Trends Affecting Data Growth • New Use-cases Enabled by Big Data • Trends Underlying Big Data • Building-blocks for Managing Big Data • Q&A 3
  • 4.
    Data is theNew Plastic 4
  • 5.
    Where Are We?Computing Circa 2012! 5
  • 6.
    Where Are We?Computing Circa 2012! • Six decades into the Computer Revolution 6
  • 7.
    Where Are We?Computing Circa 2012! • Six decades into the Computer Revolution • Four decades since the invention of Microprocessor 7
  • 8.
    Where Are We?Computing Circa 2012! • Six decades into the Computer Revolution • Four decades since the invention of Microprocessor • Two decades into the rise of modern Internet 8
  • 9.
    Where Are We?Computing Circa 2012! • Six decades into the Computer Revolution • Four decades since the invention of Microprocessor • Two decades into the rise of modern Internet • Two billion people using the broadband Internet 9
  • 10.
    Where Are We?Computing Circa 2012! • Six decades into the Computer Revolution • Four decades since the invention of Microprocessor • Two decades into the rise of modern Internet • Two billion people using the broadband Internet Major businesses and industries running on software and delivered as online services* *”Why software is eating the world” Marc Andreessen, WSJ Aug 2011 10
  • 11.
    Trends: Exploding DataVolumes, “Big Data” Complex, Unstructured Relational Kilo – Mega – Giga – Terra – Peta – Exa – Zetta - Yotta • 2,500 Exabytes of new information in 2012 with Internet as primary driver • Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “Zettabytes” this year Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. . 11
  • 12.
    Big Data Buzz! • 16 Big Data “V”s; Original 3: Volume, Variety & Velocity 12
  • 13.
    Big Data Buzz! • 16 Big Data “V”s; Original 3: Volume, Variety & Velocity • 120+ Twitter accounts relating to Big Data 13
  • 14.
    Big Data Buzz! • 16 Big Data “V”s; Original 3: Volume, Variety & Velocity • 120+ Twitter accounts relating to Big Data • 9000 job search results for “data scientists” 14
  • 15.
    Big Data Buzz! • 16 Big Data “V”s; Original 3: Volume, Variety & Velocity • 120+ Twitter accounts relating to Big Data • 9000 job search results for “data scientists” • 70,000 Wikipedia “big data” hits per month 15
  • 16.
    Big Data Buzz! • 16 Big Data “V”s; Original 3: Volume, Variety & Velocity • 120+ Twitter accounts relating to Big Data • 9000 job search results for “data scientists” • 70,000 Wikipedia “big data” hits per month • 2,000,000 PDFs from search on “big data white paper” 16
  • 17.
    Big Data Buzz! • 16 Big Data “V”s; Original 3: Volume, Variety & Velocity • 120+ Twitter accounts relating to Big Data • 9000 job search results for “data scientists” • 70,000 Wikipedia “big data” hits per month • 2,000,000 PDFs from search on “big data white paper” • 112,000,000 Blog posts discussing big data 17
  • 18.
    Big Data Buzz! • 16 Big Data “V”s; Original 3: Volume, Variety & Velocity • 120+ Twitter accounts relating to Big Data • 9000 job search results for “data scientists” • 70,000 Wikipedia “big data” hits per month • 2,000,000 PDFs from search on “big data white paper” • 112,000,000 Blog posts discussing big data • 1,350,000,000 Google results for “What is big data?” Source IBM 2012 18
  • 19.
    Why Now? ExplodingData Volumes Proliferation of Increased consumption web connected devices of digital content Explosion in user generated content Internet of things 19
  • 20.
    Trends: Changing DataEconomics Return on Byte = value to be extracted from that byte / cost of storing that byte. High ROB Low ROB 20
  • 21.
    Trends : DataSeen as a Strategic Asset • Companies leveraging data assets to • Create new and differentiated products • Product recommendation engines • Increase revenues • Optimize ad placement to improve click-thru • Improve customer satisfaction / retention • Analyze CDRs for dropped calls The sexy job in the next ten years will be statisticians. The ability to take data— to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill. Hal Varian : Chief Economist, Google. 21
  • 22.
    Big Data inthe Enterprise 22
  • 23.
    Why Now? BigData Use-cases – User Behavior • Location & Proximity Tracking • GPS in operational apps, security analysis, navigation & social media • New business opportunities for sales and services in proximity 23
  • 24.
    Why Now? BigData Use-cases – User Behavior • Location & Proximity Tracking • GPS in operational apps, security analysis, navigation & social media • New business opportunities for sales and services in proximity • Ad Tracking • Dynamic changes in ad placement, color, size and wording • Improved click-through behavior 24
  • 25.
    Why Now? BigData Use-cases – User Behavior • Location & Proximity Tracking • GPS in operational apps, security analysis, navigation & social media • New business opportunities for sales and services in proximity • Ad Tracking • Dynamic changes in ad placement, color, size and wording • Improved click-through behavior • Social CRM • Text analytics on huge array of unstructured social media • KPI’s: share of voice, audience engagement, conversation reach, … 25
  • 26.
    Why Now? BigData Use-cases – User Behavior • Location & Proximity Tracking • GPS in operational apps, security analysis, navigation & social media • New business opportunities for sales and services in proximity • Ad Tracking • Dynamic changes in ad placement, color, size and wording • Improved click-through behavior • Social CRM • Text analytics on huge array of unstructured social media • KPI’s: share of voice, audience engagement, conversation reach, … • Causal Factor Discovery in Retail • Deviations based on competition, weather, promos, holidays, events 26
  • 27.
    Why Now? “Hadoop-able”Use-cases – Sensors • Building Sensors • Temperature, humidity, vibration and noise • Energy usage, security violations, failures in a/c, heat, plumbing 27
  • 28.
    Why Now? “Hadoop-able”Use-cases – Sensors • Building Sensors • Temperature, humidity, vibration and noise • Energy usage, security violations, failures in a/c, heat, plumbing • In-flight Aircraft Sensors • Variables on engines, hydraulics, fuel & electrical systems • Real-time adaptive control, fuel usage, part failure prediction 28
  • 29.
    Why Now? “Hadoop-able”Use-cases – Sensors • Building Sensors • Temperature, humidity, vibration and noise • Energy usage, security violations, failures in a/c, heat, plumbing • In-flight Aircraft Sensors • Variables on engines, hydraulics, fuel & electrical systems • Real-time adaptive control, fuel usage, part failure prediction • Smart Utility Meters – Electric Grid • One read-out per second per meter across entire customer base • Dynamic load balancing on grid, failure response, adaptive pricing 29
  • 30.
    Why Now? “Hadoop-able”Use-cases – Sensors • Building Sensors • Temperature, humidity, vibration and noise • Energy usage, security violations, failures in a/c, heat, plumbing • In-flight Aircraft Sensors • Variables on engines, hydraulics, fuel & electrical systems • Real-time adaptive control, fuel usage, part failure prediction • Smart Utility Meters – Electric Grid • One read-out per second per meter across entire customer base • Dynamic load balancing on grid, failure response, adaptive pricing • Mobile Cell Tower Networks • Analyze call-data-records(CDRs) to optimize cell tower placement • Improved user experience and network monetization 30
  • 31.
    “Hadoop-able” Use-cases –Computing Delta’s • Commercial Seed Gene Sequencing • Analyzing the sequence, identifying genes and gene families • Baseline reference for the larger cotton crop genome 31
  • 32.
    “Hadoop-able” Use-cases –Computing Delta’s • Commercial Seed Gene Sequencing • Analyzing the sequence, identifying genes and gene families • Baseline reference for the larger cotton crop genome • Satellite Image Comparison • Overlay of images to create “hot spot” maps to show differences • Construction, destruction, changes due to disasters, encroachment 32
  • 33.
    “Hadoop-able” Use-cases –Computing Delta’s • Commercial Seed Gene Sequencing • Analyzing the sequence, identifying genes and gene families • Baseline reference for the larger cotton crop genome • Satellite Image Comparison • Overlay of images to create “hot spot” maps to show differences • Construction, destruction, changes due to disasters, encroachment • CAT Scan Comparison • Images taken as “slices” of human body • Automatic diagnosis of medical issues and their prevalence 33
  • 34.
    “Hadoop-able” Use-cases –Computing Delta’s • Commercial Seed Gene Sequencing • Analyzing the sequence, identifying genes and gene families • Baseline reference for the larger cotton crop genome • Satellite Image Comparison • Overlay of images to create “hot spot” maps to show differences • Construction, destruction, changes due to disasters, encroachment • CAT Scan Comparison • Images taken as “slices” of human body • Automatic diagnosis of medical issues and their prevalence • Document Similarity Testing • Latent semantic analysis: “documents that agree with my doc” • Threat discovery, sentiment analysis and opinion polls 34
  • 35.
    Agenda • Why theData Deluge? • Trends Affecting Data Growth • New Use-cases Enabled by Big Data • Trends Underlying Big Data • Building-blocks for Managing Big Data • Q&A 35
  • 36.
    Big Data Confluence ofBig Transaction, Big Interaction and Big Data Processing BIG TRANSACTION DATA BIG INTERACTION DATA Online Online Analytical Social Device Transaction Processing Media Data Sensor Data Processing (OLAP) & (OLTP) DW Appliances Call detail records, image, click stream data Scientific, genomic Machine/Device BIG DATA PROCESSING 36
  • 37.
    Big Transaction Data OLTPand Analytic Databases BIG TRANSACTION DATA Online Online Analytical Transaction Processing Processing (OLAP) & (OLTP) DW Appliances Oracle Teradata DB2 Redbrick Britton-Lee EssBase Ingres Sybase IQ Informix Netezza Sybase Greenplum SQLServer DataAllegro Asterdata Vertica Paraccel Hana 37
  • 38.
    Big Transaction Data ChangingEconomics of Computing From Buy To Rent CRM Application Custom Custom Custom Application Application Application Mainframe Custom HR Custom Application Application Application 38
  • 39.
    Big Interaction Data ChangingRole Of Computing From Transactions to Interactions BIG INTERACTION DATA Social Media Data Device Sensor Data Social Media Clickstream Image/Text Scientific • Genomic/Pharma • Medical Machine/Device • Sensors/Meters/ Device Sensor Data RFID Tags • CDR/Mobile 39
  • 40.
    Big Interaction Data FromOperational Efficiency To Organizational Effectiveness Business Management Brand Management • Business Analysis • Sentiment Analysis • Operational Automation • Proactive Customer Engagement Relational Social Transactions Interactions 1970 - Current 2008 - Current 40
  • 41.
    Big Interaction Data HowDo You Leverage Device Sensor Data? • Geo Encoding • Cell-phone Towers • Medical Sensors • RFID Tags • Edge Networks 41
  • 42.
    Big Data Processing HighlyScalable Processing Of All Data BIG TRANSACTION DATA BIG INTERACTION DATA Online Online Analytical Social Device Transaction Processing Media Data Sensor Data Processing (OLAP) & (OLTP) DW Appliances Call detail records, image, click stream data Scientific, genomic Machine/Device BIG DATA PROCESSING 42
  • 43.
    Big Data Processing Whatis Hadoop? SCRIPTING SQL QUERY PARALLEL PERSISTENCE 43
  • 44.
    Big Data Processing Whatdoes Hadoop do? • Cost effective scalability • Scale out on commodity hardware • Support for processing all data types • Structured, Semi-structured and Unstructured data • Extensibility • Open APIs to implement custom data processing logic • Hadoop Challenges • Data movement into/out of Hadoop / HDFS • Requires specialized development skills • Java, Hive, PIG etc. 44
  • 45.
    Ingest Data IntoHDFS Support over 100 different data sources Integrated Perform any pre Native HDFS development processing Source and environment with needed before Target Support metadata and ingestion preview support 45
  • 46.
    Design and ExecuteData Integration Logic on Hadoop Design integration logic for Hadoop in a graphical and metadata driven environment Configure where the integration logic should run – Hadoop or Native 46
  • 47.
    Design and ExecuteData Quality on Hadoop Big Data Cleansing, Dedup, Unstructured Parsing Probabilistic or Deterministic Matching Address Validation and Geocoding enrichment across 260 countries Standardization and Reference Data Management Address Matching Validation Standardize Parsing of Unstructured Data/Text Fields of all data Parsing types of data (customer/ product/ social/ logs) DQ logic pushed down/run natively ON Hadoop 47
  • 48.
    Extract data fromHDFS and Hive Extract from HDFS as a native source Perform any post Persist and write processing hadoop data into Extract from Hive DW, HDFS or as a native needed after extraction any target source systems 48 48
  • 49.
    Processing Big Data: What is missing? • Support for graph/networked data • How does one visualize complex relationships? • Data with dynamic schemas • Do the current patterns scale for very large number of columns? • Are mappings the right paradigm? • Ability to extract entities from unstructured data 49 49
  • 50.
    References • Why Softwareis Eating the World • Marc Andreessen, WSJ Aug 2011 • Evolving Role of EDW in Era of Big Data Analytics • Ralph Kimball, Kimball Group 2011 • Data Scientist: Sexiest Job of the 21st Century • Thomas H. Davenport & D.J.Patil, HBR Sept 2012 • Newly Emerging Best Practices for Big Data • Ralph Kimball, Kimball Group Oct 2012 50
  • 51.
  • 52.
    Informatica & Data Verbs on Data – We do things to data! INFA = Data + [ Archival | As a Service | Cleansing | Clustering | Consolidation | Conversion | De-duping | Exchange | Extraction | Federation | Hub | Identity | Integration | Life-cycle Management | Loading | Masking | Mastering | Matching | Migration | On Demand | Privacy | Profiling | Provisioning | Quality | Quality Assessment | Registry | Replication | Retirement | Services | Stewardship | Sub-setting | Synchronization | Test Management | Transformation | Validation | Virtualization | Warehousing | ] 52
  • 53.

Editor's Notes

  • #20 Proliferation of web connected devicesSmartphone interactions with the webExplosion in user generated contente.g. Blogs, Twitter, Facebook etc.Increased consumption of digital contentNetflix, HULU, Pandora etc.Internet of thingsSmart-grid and smart-meters
  • #37 Big Data means all data, including both transaction and interaction data, in sets whose size or complexity exceeds the ability of commonly used technologies to capture, manage and process at a reasonable cost and timeframe. In fact Big Data is the confluence of three technology trends:Big Transaction Data: Massive growth of transaction data volumes - clarify transaction/analytical data from head onBig Interaction Data: Explosion of interaction data such as social media, sensor technologies, call detail records, and other sourcesBig Data Processing: New very large scale processing with Hadoop For the last 40 years, the IT industry has been focused on automating business processes by using relational databases to process transaction data. This data has become fragmented and locked within operational and analytical systems, both on premise and in the cloud. Data integration technology integrated these transactional data silos. Over time the volume of this transaction data has grown to outpace the capabilities of IT to effectively manage and process what has become “Big Transaction Data”.Today organizations are also confronted with an explosion of a new type of data called “Big Interaction Data” which poses new challenges and new opportunities. Gaining access to this data is critical to the empowerment of the enterprise to take advantage of new business opportunities. However, IT organizations are not adequately prepared to access, process, integrate and deliver this data. Combining Big Interaction Data with Big Transaction Data will unleash great new opportunities for the data-centric enterprise and drive competitive advantage.
  • #40 Need descriptions
  • #42 Enterprises want to leverage Machine interaction data for predictive analytics (e.g. Analyzing dropped calls in CDRs to predict if a customer is likely to leave for a competitive carrier). Analyze RFID data to do proactive inventory and logistics management and improve operation efficiency. Similarly the utilities want to leverage the smart meter data to actively manage the power grid.The technologies needed to leverage this data include the ability to parse and standardize the incoming data (DQ), augment it with customer, product and location master data (MDM and Data Services) to provide the context and correlate it with other events in order to proactive monitor it (CEP)
  • #44 Hadoop is a Parallel Data Computing Platform that can be scaled incrementally in a cost effective fashion.The core ideas for Hadoop originated from Google as it needed a cost effective and highly scalable infrastructure to deals with high volume of web data and search queries.Hadoop is an Apache project that was started by Yahoo and is used extensively by Yahoo, Facebook, Linked-in etc to deal with big data processing.As compared to parallel databases, it does a better job of handling semi-structured and unstructured data.
  • #45 In order to process big data, one needs a platform that can scale incrementally as opposed to fork-lift upgrade. The platform should be able to ingest the data without requiring it be preprocessed first in order to provide agility. It should be able to handle data of all kinds/shape and provide an open/extensible way to allow uses to express their data processing logic.
  • #46 1. Developer first loads customer (CRM), transaction (ERP) and Social Media (Facebook) data into HDFS.
  • #47 3. Developer designs mapreduce logic to understand customer mobile device purchase information and sentiment by age.
  • #51 In order to process big data, one needs a platform that can scale incrementally as opposed to fork-lift upgrade. The platform should be able to ingest the data without requiring it be preprocessed first in order to provide agility. It should be able to handle data of all kinds/shape and provide an open/extensible way to allow uses to express their data processing logic.