Big Data & Analytics
Keshav Tripathy, Bharti Consulting Inc.
Outline
• Big Data
• Gartner Hype Cycle 2012
• Large scale data processing
• Visual Analytics
• Chances and Challenges
• Discussions
Big Data V3
• Volume:Gigabyte(109), Terabyte(1012), Petabyte(1015), Exabyte(1018),
Zettabytes(1021)
• Variety: Structured,semi-structured, unstructured; Text, image, audio, video,
record
• Velocity(Dynamic, sometimes time-varying)
Big Data refers to datasets that grow so large that it is difficult to capture, store, manage, share, analyze and
visualize with the typical database software tools.
Numbers
• How many data in the world?
• 800 Terabytes, 2000
• 160 Exabytes, 2006
• 500 Exabytes(Internet), 2009
• 2.7 Zettabytes, 2012
• 35 Zettabytes by 2020
• How many data generated ONE day?
• 7 TB, Twitter
• 10 TB, Facebook
Big data: The next frontier for innovation, competition, and productivity
McKinsey Global Institute 2011
Why Is Big Data Important?
Gartner Hype Cycle 2012
Large Scale Visual Analytics
• Definition: Visual analytics is the science of analytical reasoning facilitated by
interactive visual interfaces.
• People use visual analytics tools and techniques to
• Synthesize information and derive insight from massive, dynamic,
ambiguous, and often conflicting data
• Detect the expected and discover the unexpected
• Provide timely, defensible, and understandable assessments
• Communicate assessment effectively for action.
Inforviz Reference Model to Visual Analytics
Applications
• Terrorism and Responses
• Multimedia Visual Analytics
• Situation Surveillance and Awareness in Investigative Analysis
• Disease visual analytics for Disease outbreak Prediction
• Financial Visual Analytics
• Cybersecurity Visual Analytics
• Visual Analytics for Investigative Analysis on Text Documents
Techniques and Technologies
• A wide variety of techniques and technologies has been developed and adapted for
• Data aggregation
• Data manipulation
• Data analysis
• Data visualization
• These techniques and technologies draw from several fields including
• Statistics
• Computer science
• Applied mathematics
• Economics.
Techniques and Applications
• Statistics: A/B testing(split testing/bucket testing ),Spatial analysis , Predictive modeling :Regression
• Machine Learning
• Unsupervised learning: cluster analysis
• Supervised learning: classification, support vector machines(SVM), ensemble learning
• Association rule learning
• Data Mining and Pattern Recognition: neural network, classification, clustering
• Natural language processing(NLP): Sentiment analysis
• Dimension Reduction: PCA, MDS, SVD
• Data fusion and data integration: Visual Word
• Time series analysis: Combination of statistics and signal processing
• Simulation: Monte Carlo simulations, MRF
• Optimization: Genetic algorithms
• Visualization: Scientific Viz, Inforviz, Visual Analtytics
Technologies
• Database and Data warehouse
• Google File System and MapReduce: Big Table
• Hadoop: HBase and MapReduce, open source Apache project
• Cassandra: An open source (free) DBMS, originally developed at Facebook and now an Apache Software foundation project.
• Data warehouse: ETL (extract, transform, and load) tools and business intelligence tools.
• Business intelligence (BI): data warehouse, reporting, real-time management dashboards
• Cloud computing: Services, SOA, etc.
• Metadata: XML
• Stream processing
• R, SAS and SPSS
• Visualization:Tag cloud,Clustergram,History flow, Themeriver, Treemap
Origin of Information Visualization
InforViz Techniques
• Scatterplot and Scatterplot Matrix
• Hierarchies Visualization:Node-Link Diagrams, Sunburst,Treemap, Circle-
packing layouts
• Network Visualization:Force-Directed Layout,Arc Diagrams,Matrix Views
• Multidimensional Visualization/Parallel Coordinates
• Stacked Graphs
• Flow Maps
Scatterplot and Scatterplot Matrix
Tree Visualization(1)
Node-Link Diagrams
sunburst
Tree Visualization(2)
Treemap
Circle-packing layouts
Network Visualization
Force-Directed Layout
Arc Diagrams
Matrix Views
Parallel Coordinates
Stacked Graphs
Flow Maps
Examples
Fraud Detection of Bank Wire Transactions
Displays and Views
A classical VA tool
GapMinder [Demo]
Smart Money Map [Demo]
A recent project
Chances and Challenges
• The basic techniques for large scale simulation and computing are ready
• However, large and time-consuming computing tasks need steering or
visualize the intermediate computing results.
• Most simulation and computing tasks have to tune hundreds of parameters.
• Smart/intelligent data mining/data processing algorithms are ready
• However, most data mining algorithms have high computational complexity: N2
rather than Nlog(N), or N
• How to combine automatic computing(machine) and high-level intelligence to gain
insight(Human), and involve human in the computing?
Recent Research Topics
• Unified Visual Analytics by Heterogeneous Data Sources(esp. Text)
• Structured and semi-structured data fusion framework
• Data indexing and similarity rank
• Visual analytics for high-dimensional heterogeneous data
• Domain Risk Management and Preventive Control by Sensor Data Collection and Data Mining
• Sensor techniques
• Data Warehouse
• Coordinated Views integrate visual analytic techniques
• Parallel/Distributed Computing Steering by Parameter Optimization and Visualization
• Parameter tuning and computing optimization
• Intermediate results visualization and task steering
• Markov Chain Monte Carlo(MCMC) Simulation
Questions and Thanks!

Bigdata analytics

  • 1.
    Big Data &Analytics Keshav Tripathy, Bharti Consulting Inc.
  • 2.
    Outline • Big Data •Gartner Hype Cycle 2012 • Large scale data processing • Visual Analytics • Chances and Challenges • Discussions
  • 3.
    Big Data V3 •Volume:Gigabyte(109), Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabytes(1021) • Variety: Structured,semi-structured, unstructured; Text, image, audio, video, record • Velocity(Dynamic, sometimes time-varying) Big Data refers to datasets that grow so large that it is difficult to capture, store, manage, share, analyze and visualize with the typical database software tools.
  • 4.
    Numbers • How manydata in the world? • 800 Terabytes, 2000 • 160 Exabytes, 2006 • 500 Exabytes(Internet), 2009 • 2.7 Zettabytes, 2012 • 35 Zettabytes by 2020 • How many data generated ONE day? • 7 TB, Twitter • 10 TB, Facebook Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute 2011
  • 5.
    Why Is BigData Important?
  • 6.
  • 7.
    Large Scale VisualAnalytics • Definition: Visual analytics is the science of analytical reasoning facilitated by interactive visual interfaces. • People use visual analytics tools and techniques to • Synthesize information and derive insight from massive, dynamic, ambiguous, and often conflicting data • Detect the expected and discover the unexpected • Provide timely, defensible, and understandable assessments • Communicate assessment effectively for action.
  • 8.
    Inforviz Reference Modelto Visual Analytics
  • 9.
    Applications • Terrorism andResponses • Multimedia Visual Analytics • Situation Surveillance and Awareness in Investigative Analysis • Disease visual analytics for Disease outbreak Prediction • Financial Visual Analytics • Cybersecurity Visual Analytics • Visual Analytics for Investigative Analysis on Text Documents
  • 10.
    Techniques and Technologies •A wide variety of techniques and technologies has been developed and adapted for • Data aggregation • Data manipulation • Data analysis • Data visualization • These techniques and technologies draw from several fields including • Statistics • Computer science • Applied mathematics • Economics.
  • 11.
    Techniques and Applications •Statistics: A/B testing(split testing/bucket testing ),Spatial analysis , Predictive modeling :Regression • Machine Learning • Unsupervised learning: cluster analysis • Supervised learning: classification, support vector machines(SVM), ensemble learning • Association rule learning • Data Mining and Pattern Recognition: neural network, classification, clustering • Natural language processing(NLP): Sentiment analysis • Dimension Reduction: PCA, MDS, SVD • Data fusion and data integration: Visual Word • Time series analysis: Combination of statistics and signal processing • Simulation: Monte Carlo simulations, MRF • Optimization: Genetic algorithms • Visualization: Scientific Viz, Inforviz, Visual Analtytics
  • 12.
    Technologies • Database andData warehouse • Google File System and MapReduce: Big Table • Hadoop: HBase and MapReduce, open source Apache project • Cassandra: An open source (free) DBMS, originally developed at Facebook and now an Apache Software foundation project. • Data warehouse: ETL (extract, transform, and load) tools and business intelligence tools. • Business intelligence (BI): data warehouse, reporting, real-time management dashboards • Cloud computing: Services, SOA, etc. • Metadata: XML • Stream processing • R, SAS and SPSS • Visualization:Tag cloud,Clustergram,History flow, Themeriver, Treemap
  • 13.
    Origin of InformationVisualization
  • 14.
    InforViz Techniques • Scatterplotand Scatterplot Matrix • Hierarchies Visualization:Node-Link Diagrams, Sunburst,Treemap, Circle- packing layouts • Network Visualization:Force-Directed Layout,Arc Diagrams,Matrix Views • Multidimensional Visualization/Parallel Coordinates • Stacked Graphs • Flow Maps
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 24.
    Fraud Detection ofBank Wire Transactions
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    Chances and Challenges •The basic techniques for large scale simulation and computing are ready • However, large and time-consuming computing tasks need steering or visualize the intermediate computing results. • Most simulation and computing tasks have to tune hundreds of parameters. • Smart/intelligent data mining/data processing algorithms are ready • However, most data mining algorithms have high computational complexity: N2 rather than Nlog(N), or N • How to combine automatic computing(machine) and high-level intelligence to gain insight(Human), and involve human in the computing?
  • 31.
    Recent Research Topics •Unified Visual Analytics by Heterogeneous Data Sources(esp. Text) • Structured and semi-structured data fusion framework • Data indexing and similarity rank • Visual analytics for high-dimensional heterogeneous data • Domain Risk Management and Preventive Control by Sensor Data Collection and Data Mining • Sensor techniques • Data Warehouse • Coordinated Views integrate visual analytic techniques • Parallel/Distributed Computing Steering by Parameter Optimization and Visualization • Parameter tuning and computing optimization • Intermediate results visualization and task steering • Markov Chain Monte Carlo(MCMC) Simulation
  • 32.