Chapter 1
Introduction to Big Data Analytics
HeDs5221 - Big Data
Analytics in Health
By Eshete Derb Emiru(PhD)
esheted2010@gmail.com
Chapter 1:Introduction to Big
Data Analytics
•What is Big Data?
•What makes Big Data different from other
related “buzzwords”?
•What are we going to cover in this course?
What’s Big Data?
Big Data Analytics 3
AlphaGo
Big Data Analytics 4
Google Translate
Big Data Analytics 5
Big Data: Some Examples
• Topic detection and tracking
• Trend analysis
• Social network analysis
• PageRank
• Predictive analytics
• Many others: healthcare, natural resources,
education, public sector, insurance,
transportation, finance and crime detection,
…
Big Data Analytics 6
Google News
Big Data Analytics 7
Google Trends
Big Data Analytics 8
What is Big Data?
• Big data is a term for data sets that are so
large or complex that traditional data
processing application software's are
inadequate to deal with them.
• Challenges include capture, storage,
analysis, data curation, search, sharing,
transfer, visualization, querying, updating
and information privacy. [source:
Wikipedia]
Big Data Analytics 9
What is Big Data?
• “Big data is data whose scale, distribution,
diversity, and/or timeliness require the
use of new technical architectures and
analytics to enable insights that unlock
new sources of business value.”
• [source: C. Manyika, Big Data: The next
frontier for innovation, competition, and
productivity, McKinsey Global Institute,
2011]
Big Data Analytics 10
Characteristics of Big Data
• META Group (now Gartner) defined data
growth challenges and opportunities as
being three-dimensional, i.e. increasing
volume, velocity, and variety [Doug
Laney, 2001]
– Volume: the quantity of generated and stored
data
– Velocity: the speed at which data is generated
and processed
– Variety: the type and nature of data
Big Data Analytics 11
Big Data Analytics 12
Characteristics of Big Data
• “Big data is high volume, high velocity,
and/or high variety information assets
that require new forms of processing to
enable enhanced decision making, insight
discovery and process optimization.”
[Gartner, 2012]
The Four V’s of Big Data
Big Data Analytics 13
[source: IBM]
Characteristics of Big Data
• Five V’s:
– Volume: scale
• 2.5 EB per day (300 TB, Library of Congress)
– Velocity: streaming
• In 60 seconds: 350,000 tweets, 300 hours of YouTube video,
171 million emails, 350 GB sensor data from a jet engine
– Variety: different forms
• Structured, Semi-structured, Unstructured
– Veracity: uncertainty
• Quality or fidelity
– Value
• Higher veracity, lower processing time -> higher value
Big Data Analytics 14
Characteristics of Big Data
Big Data Analytics 15
Data Structures
• Variety: different forms
– Structured: databases, spreadsheets, …
– Semi-structured: textual files such as Web
pages, XML, …
– Unstructured: text documents, images,
videos, …
• Data growth is increasingly unstructured
– Social media: Facebook, Twitter, …
Big Data Analytics 16
Differences from traditional
data analysis
• Distinct requirements
– Combining of multiple unrelated datasets
– Processing of large amounts of unstructured data
– Harvesting of hidden information in a time-sensitive
manner
• Newer techniques that leverage computational
resources
• Interdisciplinary
– Mathematics, statistics, computer science, subject matter
expertise
• Benefits
– Optimization, predictions, fault or fraud detection,
improved decision making, discoveries
Big Data Analytics 17
Data analysis vs. Data analytics
• “Analysis is the separation of a whole into its
component parts, and analytics is the method of
logical analysis.” [source: Merriam-Webster
dictionary]
• “Analysis is really a heuristic activity, where
scanning through all the data the analyst gains
some insight. “ [source: Quora.com]
• “Analytics is about applying a mechanical or
algorithmic process to derive the insights for
example running through various data sets
looking for meaningful correlations between
them. ” [source: Quora.com]
Big Data Analytics 18
Related Terms
• Data science, predictive analytics
• Business intelligence, FinTech
• IoT, CPS, Industry 4.0
• Smart homes, smart cities
• Data mining, machine learning, artificial
intelligence
• Cloud computing, data-intensive computing,
parallel computing, distributed computing
• …
Big Data Analytics 19
What is Data Science?
Big Data Analytics 20
Data Engineering vs. Data
Analysis
• Data engineering: designing and building
infrastructure for integrating and managing data from
various resources
– MySQL, NoSQL, Hadoop, MapReduce
• Data analysis: querying and processing data,
providing reports, summarizing and visualizing data
– Statistics, visualization, Excel, SAS, SPSS, …
• Data science: applying statistics, machine learning and
analytic approaches to solve critical business
problems, and turning data into valuable and
actionable insights
– Advanced data analysis
– Data mining tools, machine learning, statistics, …
Big Data Analytics 21
Big data vs. Business
Intelligence
• Business Intelligence (BI) are the set of
strategies, processes, applications, data,
products, technologies and technical
architectures which are used to support
the collection, analysis, presentation and
dissemination of business information.
[source: Wikipedia]
Big Data Analytics 22
23
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
[source: Han 2011]
Big Data Analytics
Big Data vs. Financial
Technology
Big Data Analytics 24
[source: Slideshare]
Big Data vs. IoT
• The Internet of things (IoT) is the
internetworking of physical devices,
vehicles (also referred to as "connected
devices" and "smart devices"), buildings,
and other items—embedded with
electronics, software, sensors, actuators,
and network connectivity that enable
these objects to collect and exchange data.
[source: Wikipedia]
Big Data Analytics 25
Big Data vs. Industry 4.0
Big Data Analytics 26
[source: Roland Berger]
Big Data vs. CPS
• A cyber-physical system (CPS) is a mechanism
controlled or monitored by computer-based
algorithms, tightly integrated with the internet and its
users.
• In cyber-physical systems, physical and software
components are deeply intertwined, each operating on
different spatial and temporal scales, exhibiting
multiple and distinct behavioral modalities, and
interacting with each other in a myriad of ways that
change with context.
– E.g.: smart grid, autonomous automobile systems, medical
monitoring, process control systems, robotics systems, and
automatic pilot avionics
– [source: Wikipedia]
Big Data Analytics 27
Big Data vs. Data Mining
• Data mining is the computational process
of discovering patterns in large data sets
involving methods at the intersection of
artificial intelligence, machine learning,
statistics, and database systems. [source:
Wikipedia]
Big Data Analytics 28
29
Knowledge Discovery (KDD) Process
• This is a view from typical
database systems and data
warehousing communities
• Data mining plays an essential
role in the knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
[source: Han 2011]
Big Data Analytics
Big Data vs. Data-Intensive
Computing
• Data-intensive computing is a class of
parallel computing applications which use
a data parallel approach to process large
volumes of data typically terabytes or
petabytes in size and typically referred to
as big data. [source: Wikipedia]
Big Data Analytics 30
Google Trends Comparison:
Big data, IoT, Cloud computing
Big Data Analytics 31
Big data vs. AI, ML, and DL
Big Data Analytics 32
Limits of Predictions
• Predictive analytics: technology that learns
from experience (data) to predict the future
behavior of individuals in order to drive
better decisions
• Accurate prediction is generally not possible
• But predictions need not be accurate to bring
value
– E.g. direct mail marketing
• The prediction effect: predicting better than
pure guess delivers value
Big Data Analytics 33
Is Big Data the End of Theory?
• Critiques
– Big data requires “big judgement”
– If the systems dynamics of the future change,
the past can say little about the future
– Privacy
– Bias, subjective, shallow
– “If you believe in Big Data analytics, it’s time
to begin planning for a Hillary Clinton
presidency and all that entails.”
Big Data Analytics 34
What are we going to cover in
this course?
• Data mining
– Frequent pattern mining
– Classification
– Clustering
• Parallel programming in distributed
platforms
– Scalability
– Hadoop, Spark
– MapReduce programming
Big Data Analytics 35
36
KDD Process: A Typical View from ML and Statistics
Input Data Data
Mining
Data Pre-
Processing
Post-
Processin
g
• This is a view from typical machine learning and statistics communities
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association &
correlation
Classification
Clustering
Outlier analysis
… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
[source: Han 2011
Big Data Analytics
37
Data Mining Function: Association
and Correlation Analysis
• Frequent patterns (or frequent itemsets)
– What items are frequently purchased together in
your Walmart?
• Association, correlation vs. causality
– A typical association rule
• Diaper  Beer [0.5%, 75%] (support, confidence)
– Are strongly associated items also strongly
correlated?
• How to mine such patterns and rules efficiently in
large datasets?
• How to use such patterns for classification,
clustering, and other applications?
Big Data Analytics
38
Data Mining Function:
Classification
• Classification and label prediction
– Construct models (functions) based on some training examples
– Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
– Predict some unknown class labels
• Typical methods
– Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification,
pattern-based classification, logistic regression, …
• Typical applications:
– Credit card fraud detection, direct marketing,
classifying stars, diseases, web-pages, …
Big Data Analytics
39
Data Mining Function: Cluster
Analysis
• Unsupervised learning (i.e., Class label is
unknown)
• Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns
• Principle: Maximizing intra-class similarity
& minimizing interclass similarity
• Many methods and applications
Big Data Analytics
Limitations of Data Analysis
• CPU: computing time to execute the
analysis
• I/O: how much data can be put in
memory per time unit
• Memory: how much data can be processed
at a time
Big Data Analytics 40
Data to be analyzed
• Tall data: large number of cases
• Wide data: large number of features
• Tall and wide data: large number of both
cases and features
• Sparse data: large number of zero entries
Big Data Analytics 41
Algorithm to be used
• How complex is your algorithm
• How many parameters in your model
• Are the optimization processes
parallelizable
• Does your algorithm learn from all data or
small batches of data
Big Data Analytics 42
Possible solutions
• Scale up: performance improvement on a single
machine
– More memory, faster CPU, faster storage, using
GPUs, …
– E.g. CUDA, TensorFlow, Keras, …
• Scale out: performance improvement by
distributing computations
– Using outside resources: other CPUs, GPUs, storage
– E.g. Hadoop, Spark, …
• Scale up and out
– E.g. Distributed TensorFlow, …
Big Data Analytics 43
Hadoop Architecture
Big Data Analytics 44
MapReduce
Big Data Analytics 45
Functional Programming
Big Data Analytics 46
Spark and Hadoop
Big Data Analytics 47
TensorFlow
• An open-source library for machine
learning
– High-level API: Keras, …
– Low-level API: TensorFlow Core
• Prerequisites: algebra, Python
• Google Colab (Colaboratory) is an easy
way to learn and use TensorFlow
Big Data Analytics 48
TensorFlow
Big Data Analytics 49
Typical Workflows in
TensorFlow
Big Data Analytics 50
Input data Tensors
Model
Model training
and testing
Dataflow Graph
device device
sessions
Big Data Analytics 51
Thanks for Your Attention!

01-introduction.ppt the paper that you can unless you want to join me because everyone else has been on the paper

  • 1.
    Chapter 1 Introduction toBig Data Analytics HeDs5221 - Big Data Analytics in Health By Eshete Derb Emiru(PhD) [email protected]
  • 2.
    Chapter 1:Introduction toBig Data Analytics •What is Big Data? •What makes Big Data different from other related “buzzwords”? •What are we going to cover in this course?
  • 3.
    What’s Big Data? BigData Analytics 3
  • 4.
  • 5.
  • 6.
    Big Data: SomeExamples • Topic detection and tracking • Trend analysis • Social network analysis • PageRank • Predictive analytics • Many others: healthcare, natural resources, education, public sector, insurance, transportation, finance and crime detection, … Big Data Analytics 6
  • 7.
  • 8.
  • 9.
    What is BigData? • Big data is a term for data sets that are so large or complex that traditional data processing application software's are inadequate to deal with them. • Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. [source: Wikipedia] Big Data Analytics 9
  • 10.
    What is BigData? • “Big data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value.” • [source: C. Manyika, Big Data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, 2011] Big Data Analytics 10
  • 11.
    Characteristics of BigData • META Group (now Gartner) defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume, velocity, and variety [Doug Laney, 2001] – Volume: the quantity of generated and stored data – Velocity: the speed at which data is generated and processed – Variety: the type and nature of data Big Data Analytics 11
  • 12.
    Big Data Analytics12 Characteristics of Big Data • “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” [Gartner, 2012]
  • 13.
    The Four V’sof Big Data Big Data Analytics 13 [source: IBM]
  • 14.
    Characteristics of BigData • Five V’s: – Volume: scale • 2.5 EB per day (300 TB, Library of Congress) – Velocity: streaming • In 60 seconds: 350,000 tweets, 300 hours of YouTube video, 171 million emails, 350 GB sensor data from a jet engine – Variety: different forms • Structured, Semi-structured, Unstructured – Veracity: uncertainty • Quality or fidelity – Value • Higher veracity, lower processing time -> higher value Big Data Analytics 14
  • 15.
    Characteristics of BigData Big Data Analytics 15
  • 16.
    Data Structures • Variety:different forms – Structured: databases, spreadsheets, … – Semi-structured: textual files such as Web pages, XML, … – Unstructured: text documents, images, videos, … • Data growth is increasingly unstructured – Social media: Facebook, Twitter, … Big Data Analytics 16
  • 17.
    Differences from traditional dataanalysis • Distinct requirements – Combining of multiple unrelated datasets – Processing of large amounts of unstructured data – Harvesting of hidden information in a time-sensitive manner • Newer techniques that leverage computational resources • Interdisciplinary – Mathematics, statistics, computer science, subject matter expertise • Benefits – Optimization, predictions, fault or fraud detection, improved decision making, discoveries Big Data Analytics 17
  • 18.
    Data analysis vs.Data analytics • “Analysis is the separation of a whole into its component parts, and analytics is the method of logical analysis.” [source: Merriam-Webster dictionary] • “Analysis is really a heuristic activity, where scanning through all the data the analyst gains some insight. “ [source: Quora.com] • “Analytics is about applying a mechanical or algorithmic process to derive the insights for example running through various data sets looking for meaningful correlations between them. ” [source: Quora.com] Big Data Analytics 18
  • 19.
    Related Terms • Datascience, predictive analytics • Business intelligence, FinTech • IoT, CPS, Industry 4.0 • Smart homes, smart cities • Data mining, machine learning, artificial intelligence • Cloud computing, data-intensive computing, parallel computing, distributed computing • … Big Data Analytics 19
  • 20.
    What is DataScience? Big Data Analytics 20
  • 21.
    Data Engineering vs.Data Analysis • Data engineering: designing and building infrastructure for integrating and managing data from various resources – MySQL, NoSQL, Hadoop, MapReduce • Data analysis: querying and processing data, providing reports, summarizing and visualizing data – Statistics, visualization, Excel, SAS, SPSS, … • Data science: applying statistics, machine learning and analytic approaches to solve critical business problems, and turning data into valuable and actionable insights – Advanced data analysis – Data mining tools, machine learning, statistics, … Big Data Analytics 21
  • 22.
    Big data vs.Business Intelligence • Business Intelligence (BI) are the set of strategies, processes, applications, data, products, technologies and technical architectures which are used to support the collection, analysis, presentation and dissemination of business information. [source: Wikipedia] Big Data Analytics 22
  • 23.
    23 Data Mining inBusiness Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems [source: Han 2011] Big Data Analytics
  • 24.
    Big Data vs.Financial Technology Big Data Analytics 24 [source: Slideshare]
  • 25.
    Big Data vs.IoT • The Internet of things (IoT) is the internetworking of physical devices, vehicles (also referred to as "connected devices" and "smart devices"), buildings, and other items—embedded with electronics, software, sensors, actuators, and network connectivity that enable these objects to collect and exchange data. [source: Wikipedia] Big Data Analytics 25
  • 26.
    Big Data vs.Industry 4.0 Big Data Analytics 26 [source: Roland Berger]
  • 27.
    Big Data vs.CPS • A cyber-physical system (CPS) is a mechanism controlled or monitored by computer-based algorithms, tightly integrated with the internet and its users. • In cyber-physical systems, physical and software components are deeply intertwined, each operating on different spatial and temporal scales, exhibiting multiple and distinct behavioral modalities, and interacting with each other in a myriad of ways that change with context. – E.g.: smart grid, autonomous automobile systems, medical monitoring, process control systems, robotics systems, and automatic pilot avionics – [source: Wikipedia] Big Data Analytics 27
  • 28.
    Big Data vs.Data Mining • Data mining is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. [source: Wikipedia] Big Data Analytics 28
  • 29.
    29 Knowledge Discovery (KDD)Process • This is a view from typical database systems and data warehousing communities • Data mining plays an essential role in the knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation [source: Han 2011] Big Data Analytics
  • 30.
    Big Data vs.Data-Intensive Computing • Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. [source: Wikipedia] Big Data Analytics 30
  • 31.
    Google Trends Comparison: Bigdata, IoT, Cloud computing Big Data Analytics 31
  • 32.
    Big data vs.AI, ML, and DL Big Data Analytics 32
  • 33.
    Limits of Predictions •Predictive analytics: technology that learns from experience (data) to predict the future behavior of individuals in order to drive better decisions • Accurate prediction is generally not possible • But predictions need not be accurate to bring value – E.g. direct mail marketing • The prediction effect: predicting better than pure guess delivers value Big Data Analytics 33
  • 34.
    Is Big Datathe End of Theory? • Critiques – Big data requires “big judgement” – If the systems dynamics of the future change, the past can say little about the future – Privacy – Bias, subjective, shallow – “If you believe in Big Data analytics, it’s time to begin planning for a Hillary Clinton presidency and all that entails.” Big Data Analytics 34
  • 35.
    What are wegoing to cover in this course? • Data mining – Frequent pattern mining – Classification – Clustering • Parallel programming in distributed platforms – Scalability – Hadoop, Spark – MapReduce programming Big Data Analytics 35
  • 36.
    36 KDD Process: ATypical View from ML and Statistics Input Data Data Mining Data Pre- Processing Post- Processin g • This is a view from typical machine learning and statistics communities Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization [source: Han 2011 Big Data Analytics
  • 37.
    37 Data Mining Function:Association and Correlation Analysis • Frequent patterns (or frequent itemsets) – What items are frequently purchased together in your Walmart? • Association, correlation vs. causality – A typical association rule • Diaper  Beer [0.5%, 75%] (support, confidence) – Are strongly associated items also strongly correlated? • How to mine such patterns and rules efficiently in large datasets? • How to use such patterns for classification, clustering, and other applications? Big Data Analytics
  • 38.
    38 Data Mining Function: Classification •Classification and label prediction – Construct models (functions) based on some training examples – Describe and distinguish classes or concepts for future prediction • E.g., classify countries based on (climate), or classify cars based on (gas mileage) – Predict some unknown class labels • Typical methods – Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, … • Typical applications: – Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, … Big Data Analytics
  • 39.
    39 Data Mining Function:Cluster Analysis • Unsupervised learning (i.e., Class label is unknown) • Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns • Principle: Maximizing intra-class similarity & minimizing interclass similarity • Many methods and applications Big Data Analytics
  • 40.
    Limitations of DataAnalysis • CPU: computing time to execute the analysis • I/O: how much data can be put in memory per time unit • Memory: how much data can be processed at a time Big Data Analytics 40
  • 41.
    Data to beanalyzed • Tall data: large number of cases • Wide data: large number of features • Tall and wide data: large number of both cases and features • Sparse data: large number of zero entries Big Data Analytics 41
  • 42.
    Algorithm to beused • How complex is your algorithm • How many parameters in your model • Are the optimization processes parallelizable • Does your algorithm learn from all data or small batches of data Big Data Analytics 42
  • 43.
    Possible solutions • Scaleup: performance improvement on a single machine – More memory, faster CPU, faster storage, using GPUs, … – E.g. CUDA, TensorFlow, Keras, … • Scale out: performance improvement by distributing computations – Using outside resources: other CPUs, GPUs, storage – E.g. Hadoop, Spark, … • Scale up and out – E.g. Distributed TensorFlow, … Big Data Analytics 43
  • 44.
  • 45.
  • 46.
  • 47.
    Spark and Hadoop BigData Analytics 47
  • 48.
    TensorFlow • An open-sourcelibrary for machine learning – High-level API: Keras, … – Low-level API: TensorFlow Core • Prerequisites: algebra, Python • Google Colab (Colaboratory) is an easy way to learn and use TensorFlow Big Data Analytics 48
  • 49.
  • 50.
    Typical Workflows in TensorFlow BigData Analytics 50 Input data Tensors Model Model training and testing Dataflow Graph device device sessions
  • 51.
    Big Data Analytics51 Thanks for Your Attention!