SlideShare a Scribd company logo
Data Science
Data Science
 Overview of Data Science
 Definition of Data and Information
 Data Types and Representation
 Data Value Chain
 Data Acquisition
 Data Analysis
 Data Curating
 Data Storage
 Data Usage
 Basic Concepts of Big Data
Overview of Data Science
 Data science is the practice of mining large data sets of raw data, both structured
and unstructured, to identify patterns and extract actionable insight from them.
 Data Science deals with vast volumes of data using modern tools and techniques to
find unseen patterns, derive meaningful information, and make business decisions.
 Data Science is a blend of various fields like Probability, Statistics, Programming,
Analysis, Cloud Computing, etc.;
 Data Science is the extraction of actionable insights from raw data.
Data Information
Data
 Raw facts, figures and statistics
 No contextual meaning
 Data can be in characters,
numbers, images, words
Information
 Processed / Organized Data
 Exact meaning and organized
context
 Organized and presented in
context – Value added to data
Context + Processing
100
100
Miles
Difficult to
walk 100 Miles
but Vehicle
transport is
okay
100 Miles
is a Far
Distance
Measure of Data in Files – File Size
Name Equal To Size(In Bytes)
Bit 1 Bit 1/8
Nibble 4 Bits 1/2 (rare)
Byte 8 Bits 1
Kilobyte 1024 Bytes 1024
Megabyte 1, 024 Kilobytes 1, 048, 576
Gigabyte 1, 024 Megabytes 1, 073, 741, 824
Terrabyte 1, 024 Gigabytes 1, 099, 511, 627, 776
Petabyte 1, 024 Terabytes 1, 125, 899, 906, 842, 624
Exabyte 1, 024 Petabytes 1, 152, 921, 504, 606, 846, 976
Zettabyte 1, 024 Exabytes 1, 180, 591, 620, 717, 411, 303, 424
Yottabyte 1, 024 Zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176
Data
Types of Data and it’s Representation
Structured Data
Semi-Structured Data
Unstructured Data
 Predefined data models
 Stored in Rows and Columns
 Examples: Dates, Phone Number, Names
 No predefined data models
 Stored in various forms – image, audio, video, text
 Examples: Documents, Image Files, Emails & Messages
 Loosely organized into categories using meta tags
 Stored in abstract and figures – HTML, XML, JSON
 Examples: Server Logs, Tweets organized by Hashtags
Data Science
Data Science
Data Science
 Data science enables
businesses to Process huge
amounts of structured and
unstructured Big Data to
detect patterns
 Alexa or Siri for a
recommendation demands
data science
 Operating a self-driving car
 Search Engine
 Chatbot for customer service
Data Science Pre-Requisites
Machine
Learning
Modeling Statistics Programming Databases
Data Science Lifecycle
 Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction - Gathering raw structured
and unstructured data
 Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture - Taking
the raw data and putting it in a form that can be used
 Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization - Data scientists
take the prepared data and examine its patterns, ranges, and biases to determine how useful it will be in
predictive analysis
 Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative Analysis -
- Performing the various analyses on the data
 Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making - Analysts
prepare the analyses in easily readable forms such as charts, graphs, and reports
Data Science Applications
 Healthcare
Healthcare companies are using data science to build sophisticated medical instruments to detect and cure diseases.
 Gaming
Video and computer games are now being created with the help of data science and that has taken the gaming experience to the
next level.
 Image Recognition
Identifying patterns in images and detecting objects in an image is one of the most popular data science applications.
 Recommendation Systems
Netflix and Amazon give movie and product recommendations based on what you like to watch, purchase, or browse on their
platforms.
 Logistics
Data Science is used by logistics companies to optimize routes to ensure faster delivery of products and increase operational
efficiency.
 Fraud Detection
Banking and financial institutions use data science and related algorithms to detect fraudulent transactions.
Data Value Chain
Data Value Chain - The evolution of
data from collection to analysis,
dissemination, and the final impact of
data on decision making
Data Value Chain
 Data Capture & Acquisition
Collection of raw data from both internal and external sources. The first phase of data collection involves identifying what data to collect and
then establishing a process to do so (i.e., conducting a survey or retrieving automated IoT data). Decisions made here will affect the quality and
usability of data throughout its life-cycle
 Data Processing & Cleansing
Cleaning data - identifying and correcting corrupt, inaccurate, or irrelevant data - as well as converting raw data into a format that is usable,
integratable and machine readable.
 Data Curation, Integration and Enrichment
Data curation and integration refers to the collection of processes required to merge data from multiple sources into one, cohesive dataset.
During this process, data is also enriched, meaning that contextual metadata (the data that makes larger datasets discoverable) is added or
updated.
 Data Analysis
Data is analyzed and used to uncover trends, patterns and other insights that can enhance decision making.
 Data ROI & Monetization
The application of data analytics processes to solve real-world problems and, in a business setting, increase revenue.
Big Data Value Chain
Data
Acquisition
Data
Analysis
Data
Curating
Data
Storage
Data Usage
Big Data Value Chain – Data Acquisition
Process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on
which data analysis can be carried out. Data acquisition is
one of the major big data challenges in terms of
infrastructure requirements. The infrastructure required to
support the acquisition of big data must deliver low,
predictable latency in both capturing data and in executing
queries; be able to handle very high transaction volumes,
often in a distributed environment; and support flexible and
dynamic data structures.
Big Data Value Chain – Data Analysis
Concerned with making the raw data acquired amenable to
use in decision-making as well as domain-specific usage. Data
analysis involves exploring, transforming and modelling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential
from a business point of view.
Big Data Value Chain – Data Curation
Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation. Data curation
is responsible for improving the accessibility and quality of
data, ensuring that data are trustworthy, discoverable,
accessible, reusable, and fit their purpose.
Big Data Value Chain – Data Storage
Data Storage is the persistence and management of data in a
scalable way that satisfies the needs of applications that
require fast access to the data. Relational Database
Management Systems (RDBMS) are majorly used. NoSQL
technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on
alternative data models.
Big Data Value Chain – Data Usage
Covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data
analysis within the business activity. Data usage in business
decision-making can enhance competitiveness through
reduction of costs, increased added value, or any other
parameter that can be measured against existing performance
criteria.
Discover / Acquisition Prepare Plan
Model Operationalize Communicate Results
Project
Phase
Basic Concepts of Big Data
 Big Data: High Degrees of Dimensions – Volume, Variety, Velocity, Value, Veracity
• Volume - Amount of the data that is been generated
• Velocity - Speed at which the data is been generated
• Variety - Diversity or different types of the data
• Value – Worth of the data
• Veracity - Quality, accuracy, or trustworthiness of the data
Big Data – Impact of 3V’s
Volume (Amount of Data): Dealing with large scales of data within data
processing (e.g., Global Supply Chains, Global Financial Analysis, Large Hadron
Collider).
Velocity (Speed of Data): Dealing with streams of high frequency of incoming
real-time data (e.g., Sensors, Pervasive Environments, Electronic Trading,
Internet of Things).
Variety (Range of Data Types/Sources): Dealing with data using differing
syntactic formats (e.g., Spreadsheets, XML, DBMS), schemas, and meanings
(e.g., Enterprise Data Integration).
Big Data Processing
The general categories of activities involved with big data processing are:
 Ingesting data into the system
 Persisting the data in storage
 Computing and Analyzing data
 Visualizing the results
Sources of Big Data
Categories:
 from human activities
 from the physical world
 from computers
Example:
 Internet data (emails, social media, and weblogs), network
data, mobile networks or telecoms, machine-to-machine data
or the IoT (sensor data), online transactions, medical records,
and open data (mostly by governments).
 Unstructured (such as text, audio, video) or semi-structured
(such as emails, tweets, weblogs).
Data Analytics
Step 1: Determine the criteria for grouping the data
Step 2: Collecting the data
Step 3: Organizing the data
Step 4: Cleaning the data
Step 5: Analyze and Derive Insights
Big Data Analytics
Big data analytics helps organizations
harness their data and use it to identify
new opportunities. That, in turn, leads
to smarter business moves, more
efficient operations, higher profits and
happier customers.
Big Data Analytics - Techniques
Data Mining / Analytics
Web Mining
Text Mining / Analytics
Predictive Analytics
Visual Analytics
Machine Learning / AI / Deep Learning
Mobile Analytics
Crowdsourcing
Big Data Analytics - Tools
 Hadoop
• For distributed storage of large datasets on computer clusters
• Designed to process large amounts of structured and unstructured data
• Provides large amounts of storage for all sorts of data along with the ability to handle virtually limitless
concurrent tasks
 MapReduce
• Google technology for processing massive amounts of data
• Software framework that enables developers to code programs that can process large amounts of unstructured
data
• It has two components:
 Map which distributes the input data to several clusters for parallel processing
 Reduce which collects all sub-results to provide the result
Big Data Analytics - Tools
 NoSQL
• Used in Big Data application in clustered environments
• Provides high speed access unstructured or semi-structured data
• Provides capabilities to query and retrieve unstructured and semi-structured data
 MongoDB
• For managing data that are frequently changing or unstructured
• flexible, highly scalable database designed for web applications
• used to store data in mobile apps, product catalogs, and real-time applications
 Other tools include Hive, Cassandra, Spark, Tableau, Talend, and cloud computing
Big Data Analytics - Applications
Internet of
Things (IoT)
Smart Grid Science
Healthcare Nursing Business
Industry Manufacturing
Public
Agencies
Big Data Analytics - Benefits
 Lead to making better decisions and improves insights and predictions. This can lead to
greater operational efficiency, productivity, reduced cost, and risk
 Eliminates the biases people have when making decisions based on limited information
 Analysis of data to be built into the process that enables automated decision-making
 Helps in reducing rates of return, producing high-quality products
 Improve overall profitability of business
 Helps social media, public and private agencies to explore behavioral patterns of people
 Potentially be used in driving economic growth in developing world
Big Data Challenges
 Complexities
• Processing, storage, and transfer of a large scale of data
• Challenge to filter out the useless information without discarding useful information
 Privacy
• Risk of Data Leakage
• Privacy concern arises continue from the users who outsource their private data into the cloud storage
 Security
• Concerns over the impact that collecting, storing, and processing large amount of data could have on security
• Security is a concern because of the variety and heterogeneity of Big Data
 Data Migration
• Transferring Big Data for distributed processing and storage
 Shortage of HR (Data Scientist)
References
 http://www.dataeconomy.eu/data-value-chain/#page-content
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_4
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_5
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_6
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_7
 https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_8
 https://www.ibm.com/cloud/blog/structured-vs-unstructured-data
Appendix
Data Science
Data Science
Data Science

More Related Content

PPTX
Introduction to Data Science.pptx
Vrishit Saraswat
 
PPTX
Introduction to Data Science
Laguna State Polytechnic University
 
PPTX
Data science
SwapnilDahake2
 
PDF
Data science presentation
MSDEVMTL
 
PPTX
Introduction to data science.pptx
SadhanaParameswaran
 
PPTX
introduction to data science
bhavesh lande
 
PPTX
Data science
SouravSadhukhan6
 
PDF
Introduction to data science
Tharushi Ruwandika
 
Introduction to Data Science.pptx
Vrishit Saraswat
 
Introduction to Data Science
Laguna State Polytechnic University
 
Data science
SwapnilDahake2
 
Data science presentation
MSDEVMTL
 
Introduction to data science.pptx
SadhanaParameswaran
 
introduction to data science
bhavesh lande
 
Data science
SouravSadhukhan6
 
Introduction to data science
Tharushi Ruwandika
 

What's hot (20)

PPTX
Mining single dimensional boolean association rules from transactional
ramya marichamy
 
PDF
Tools and techniques for data science
Ajay Ohri
 
PPT
Data mining slides
smj
 
PPTX
1. Data Analytics-introduction
krishna singh
 
PPTX
Introduction to Data Mining
DataminingTools Inc
 
PDF
Data Visualization in Data Science
Maloy Manna, PMP®
 
PPTX
Classification in data mining
Sulman Ahmed
 
PPT
Big data ppt
IDBI Bank Ltd.
 
PPTX
Introduction to Data Analytics
Utkarsh Sharma
 
PPT
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
PPTX
Introduction to data science
Sampath Kumar
 
PPTX
Introduction to Data Science
Srishti44
 
PPTX
Introduction of Data Science
Jason Geng
 
PPTX
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
PPTX
Data mining
sayalipatil528
 
PPTX
Introduction to data analytics
Umasree Raghunath
 
PDF
Data mining & data warehousing (ppt)
Harish Chand
 
PPTX
Introduction to ML (Machine Learning)
SwatiTripathi44
 
PPTX
Web Mining & Text Mining
Hemant Sharma
 
PPTX
OLAP & DATA WAREHOUSE
Zalpa Rathod
 
Mining single dimensional boolean association rules from transactional
ramya marichamy
 
Tools and techniques for data science
Ajay Ohri
 
Data mining slides
smj
 
1. Data Analytics-introduction
krishna singh
 
Introduction to Data Mining
DataminingTools Inc
 
Data Visualization in Data Science
Maloy Manna, PMP®
 
Classification in data mining
Sulman Ahmed
 
Big data ppt
IDBI Bank Ltd.
 
Introduction to Data Analytics
Utkarsh Sharma
 
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Introduction to data science
Sampath Kumar
 
Introduction to Data Science
Srishti44
 
Introduction of Data Science
Jason Geng
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
Data mining
sayalipatil528
 
Introduction to data analytics
Umasree Raghunath
 
Data mining & data warehousing (ppt)
Harish Chand
 
Introduction to ML (Machine Learning)
SwatiTripathi44
 
Web Mining & Text Mining
Hemant Sharma
 
OLAP & DATA WAREHOUSE
Zalpa Rathod
 
Ad

Similar to Data Science (20)

PDF
Ch_2.pdf
DawitBirhanu13
 
PDF
Ch~2.pdf
andualemtemesgen3
 
PDF
chapter 2 Data Science.pdf emerging ecnology freshman course
tamratgintamo
 
PPTX
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx
TemesgenAsmamaw4
 
PDF
Big data analytics with Apache Hadoop
Suman Saurabh
 
PPTX
Big Data, NoSQL, NewSQL & The Future of Data Management
Tony Bain
 
PDF
CS3352-Foundations of Data Science Notes.pdf
Builders Engineering College
 
DOCX
Introduction to big data – convergences.
saranya270513
 
DOCX
Big data lecture notes
Mohit Saini
 
PPTX
Emerging Technology Chapter 2 Data Science
SolomonEndalu
 
PDF
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
PDF
Introduction to Big Data Analytics Unit 1 .pdf
MadhumithaN28
 
PPTX
Big data Analytics Fundamentals Chapter 1
karpagavalli38
 
PPTX
Bigdata Hadoop introduction
Sunitha Mutchintala
 
PPTX
Unit 1 Introduction to Data Analytics .pptx
vipulkondekar
 
PPSX
Intro to Data Science Big Data
Indu Khemchandani
 
PPTX
unit1 big data analysis description and defenition .pptx
abikishor767
 
PPTX
Data analytics introduction
amiyadash
 
PPTX
U - 2 Emerging.pptx
MulukenTamrat2
 
PPTX
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
YashiBatra1
 
Ch_2.pdf
DawitBirhanu13
 
chapter 2 Data Science.pdf emerging ecnology freshman course
tamratgintamo
 
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx
TemesgenAsmamaw4
 
Big data analytics with Apache Hadoop
Suman Saurabh
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Tony Bain
 
CS3352-Foundations of Data Science Notes.pdf
Builders Engineering College
 
Introduction to big data – convergences.
saranya270513
 
Big data lecture notes
Mohit Saini
 
Emerging Technology Chapter 2 Data Science
SolomonEndalu
 
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
Introduction to Big Data Analytics Unit 1 .pdf
MadhumithaN28
 
Big data Analytics Fundamentals Chapter 1
karpagavalli38
 
Bigdata Hadoop introduction
Sunitha Mutchintala
 
Unit 1 Introduction to Data Analytics .pptx
vipulkondekar
 
Intro to Data Science Big Data
Indu Khemchandani
 
unit1 big data analysis description and defenition .pptx
abikishor767
 
Data analytics introduction
amiyadash
 
U - 2 Emerging.pptx
MulukenTamrat2
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
YashiBatra1
 
Ad

More from Prakhyath Rai (17)

PPTX
Software Engineering and Project Management - Activity Planning
Prakhyath Rai
 
PPTX
Software Engineering and Project Management - Introduction to Project Management
Prakhyath Rai
 
PPTX
Software Engineering and Project Management - Software Testing + Agile Method...
Prakhyath Rai
 
PPTX
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
PPTX
Software Engineering - Introduction + Process Models + Requirements Engineering
Prakhyath Rai
 
PPTX
Ethics, Professionalism and Other Emerging Technologies
Prakhyath Rai
 
PPTX
Internet of Things (IoT)
Prakhyath Rai
 
PPTX
Artificial Intelligence
Prakhyath Rai
 
PPTX
Emerging Exponential Technologies - History & Introduction
Prakhyath Rai
 
PPTX
Preparation of Project
Prakhyath Rai
 
PPTX
Small Scale Industry
Prakhyath Rai
 
PPTX
Entrepreneurship
Prakhyath Rai
 
PPTX
Directing and Controlling
Prakhyath Rai
 
PPTX
Planning
Prakhyath Rai
 
PPTX
Introduction to Management
Prakhyath Rai
 
PPTX
Text MIning
Prakhyath Rai
 
PPTX
Text Mining Framework
Prakhyath Rai
 
Software Engineering and Project Management - Activity Planning
Prakhyath Rai
 
Software Engineering and Project Management - Introduction to Project Management
Prakhyath Rai
 
Software Engineering and Project Management - Software Testing + Agile Method...
Prakhyath Rai
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Prakhyath Rai
 
Ethics, Professionalism and Other Emerging Technologies
Prakhyath Rai
 
Internet of Things (IoT)
Prakhyath Rai
 
Artificial Intelligence
Prakhyath Rai
 
Emerging Exponential Technologies - History & Introduction
Prakhyath Rai
 
Preparation of Project
Prakhyath Rai
 
Small Scale Industry
Prakhyath Rai
 
Entrepreneurship
Prakhyath Rai
 
Directing and Controlling
Prakhyath Rai
 
Planning
Prakhyath Rai
 
Introduction to Management
Prakhyath Rai
 
Text MIning
Prakhyath Rai
 
Text Mining Framework
Prakhyath Rai
 

Recently uploaded (20)

DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
PDF
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
PPTX
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PDF
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
PPTX
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PPTX
Virus sequence retrieval from NCBI database
yamunaK13
 
PDF
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
Basics and rules of probability with real-life uses
ravatkaran694
 
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
Virus sequence retrieval from NCBI database
yamunaK13
 
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 

Data Science

  • 2. Data Science  Overview of Data Science  Definition of Data and Information  Data Types and Representation  Data Value Chain  Data Acquisition  Data Analysis  Data Curating  Data Storage  Data Usage  Basic Concepts of Big Data
  • 3. Overview of Data Science  Data science is the practice of mining large data sets of raw data, both structured and unstructured, to identify patterns and extract actionable insight from them.  Data Science deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions.  Data Science is a blend of various fields like Probability, Statistics, Programming, Analysis, Cloud Computing, etc.;  Data Science is the extraction of actionable insights from raw data.
  • 4. Data Information Data  Raw facts, figures and statistics  No contextual meaning  Data can be in characters, numbers, images, words Information  Processed / Organized Data  Exact meaning and organized context  Organized and presented in context – Value added to data Context + Processing
  • 5. 100 100 Miles Difficult to walk 100 Miles but Vehicle transport is okay 100 Miles is a Far Distance
  • 6. Measure of Data in Files – File Size Name Equal To Size(In Bytes) Bit 1 Bit 1/8 Nibble 4 Bits 1/2 (rare) Byte 8 Bits 1 Kilobyte 1024 Bytes 1024 Megabyte 1, 024 Kilobytes 1, 048, 576 Gigabyte 1, 024 Megabytes 1, 073, 741, 824 Terrabyte 1, 024 Gigabytes 1, 099, 511, 627, 776 Petabyte 1, 024 Terabytes 1, 125, 899, 906, 842, 624 Exabyte 1, 024 Petabytes 1, 152, 921, 504, 606, 846, 976 Zettabyte 1, 024 Exabytes 1, 180, 591, 620, 717, 411, 303, 424 Yottabyte 1, 024 Zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176
  • 8. Types of Data and it’s Representation Structured Data Semi-Structured Data Unstructured Data  Predefined data models  Stored in Rows and Columns  Examples: Dates, Phone Number, Names  No predefined data models  Stored in various forms – image, audio, video, text  Examples: Documents, Image Files, Emails & Messages  Loosely organized into categories using meta tags  Stored in abstract and figures – HTML, XML, JSON  Examples: Server Logs, Tweets organized by Hashtags
  • 11. Data Science  Data science enables businesses to Process huge amounts of structured and unstructured Big Data to detect patterns  Alexa or Siri for a recommendation demands data science  Operating a self-driving car  Search Engine  Chatbot for customer service
  • 12. Data Science Pre-Requisites Machine Learning Modeling Statistics Programming Databases
  • 13. Data Science Lifecycle  Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction - Gathering raw structured and unstructured data  Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture - Taking the raw data and putting it in a form that can be used  Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization - Data scientists take the prepared data and examine its patterns, ranges, and biases to determine how useful it will be in predictive analysis  Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative Analysis - - Performing the various analyses on the data  Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making - Analysts prepare the analyses in easily readable forms such as charts, graphs, and reports
  • 14. Data Science Applications  Healthcare Healthcare companies are using data science to build sophisticated medical instruments to detect and cure diseases.  Gaming Video and computer games are now being created with the help of data science and that has taken the gaming experience to the next level.  Image Recognition Identifying patterns in images and detecting objects in an image is one of the most popular data science applications.  Recommendation Systems Netflix and Amazon give movie and product recommendations based on what you like to watch, purchase, or browse on their platforms.  Logistics Data Science is used by logistics companies to optimize routes to ensure faster delivery of products and increase operational efficiency.  Fraud Detection Banking and financial institutions use data science and related algorithms to detect fraudulent transactions.
  • 15. Data Value Chain Data Value Chain - The evolution of data from collection to analysis, dissemination, and the final impact of data on decision making
  • 16. Data Value Chain  Data Capture & Acquisition Collection of raw data from both internal and external sources. The first phase of data collection involves identifying what data to collect and then establishing a process to do so (i.e., conducting a survey or retrieving automated IoT data). Decisions made here will affect the quality and usability of data throughout its life-cycle  Data Processing & Cleansing Cleaning data - identifying and correcting corrupt, inaccurate, or irrelevant data - as well as converting raw data into a format that is usable, integratable and machine readable.  Data Curation, Integration and Enrichment Data curation and integration refers to the collection of processes required to merge data from multiple sources into one, cohesive dataset. During this process, data is also enriched, meaning that contextual metadata (the data that makes larger datasets discoverable) is added or updated.  Data Analysis Data is analyzed and used to uncover trends, patterns and other insights that can enhance decision making.  Data ROI & Monetization The application of data analytics processes to solve real-world problems and, in a business setting, increase revenue.
  • 17. Big Data Value Chain Data Acquisition Data Analysis Data Curating Data Storage Data Usage
  • 18. Big Data Value Chain – Data Acquisition Process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage solution on which data analysis can be carried out. Data acquisition is one of the major big data challenges in terms of infrastructure requirements. The infrastructure required to support the acquisition of big data must deliver low, predictable latency in both capturing data and in executing queries; be able to handle very high transaction volumes, often in a distributed environment; and support flexible and dynamic data structures.
  • 19. Big Data Value Chain – Data Analysis Concerned with making the raw data acquired amenable to use in decision-making as well as domain-specific usage. Data analysis involves exploring, transforming and modelling data with the goal of highlighting relevant data, synthesizing and extracting useful hidden information with high potential from a business point of view.
  • 20. Big Data Value Chain – Data Curation Data curation processes can be categorized into different activities such as content creation, selection, classification, transformation, validation, and preservation. Data curation is responsible for improving the accessibility and quality of data, ensuring that data are trustworthy, discoverable, accessible, reusable, and fit their purpose.
  • 21. Big Data Value Chain – Data Storage Data Storage is the persistence and management of data in a scalable way that satisfies the needs of applications that require fast access to the data. Relational Database Management Systems (RDBMS) are majorly used. NoSQL technologies have been designed with the scalability goal in mind and present a wide range of solutions based on alternative data models.
  • 22. Big Data Value Chain – Data Usage Covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity. Data usage in business decision-making can enhance competitiveness through reduction of costs, increased added value, or any other parameter that can be measured against existing performance criteria.
  • 23. Discover / Acquisition Prepare Plan Model Operationalize Communicate Results Project Phase
  • 24. Basic Concepts of Big Data  Big Data: High Degrees of Dimensions – Volume, Variety, Velocity, Value, Veracity • Volume - Amount of the data that is been generated • Velocity - Speed at which the data is been generated • Variety - Diversity or different types of the data • Value – Worth of the data • Veracity - Quality, accuracy, or trustworthiness of the data
  • 25. Big Data – Impact of 3V’s Volume (Amount of Data): Dealing with large scales of data within data processing (e.g., Global Supply Chains, Global Financial Analysis, Large Hadron Collider). Velocity (Speed of Data): Dealing with streams of high frequency of incoming real-time data (e.g., Sensors, Pervasive Environments, Electronic Trading, Internet of Things). Variety (Range of Data Types/Sources): Dealing with data using differing syntactic formats (e.g., Spreadsheets, XML, DBMS), schemas, and meanings (e.g., Enterprise Data Integration).
  • 26. Big Data Processing The general categories of activities involved with big data processing are:  Ingesting data into the system  Persisting the data in storage  Computing and Analyzing data  Visualizing the results
  • 27. Sources of Big Data Categories:  from human activities  from the physical world  from computers Example:  Internet data (emails, social media, and weblogs), network data, mobile networks or telecoms, machine-to-machine data or the IoT (sensor data), online transactions, medical records, and open data (mostly by governments).  Unstructured (such as text, audio, video) or semi-structured (such as emails, tweets, weblogs).
  • 28. Data Analytics Step 1: Determine the criteria for grouping the data Step 2: Collecting the data Step 3: Organizing the data Step 4: Cleaning the data Step 5: Analyze and Derive Insights
  • 29. Big Data Analytics Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers.
  • 30. Big Data Analytics - Techniques Data Mining / Analytics Web Mining Text Mining / Analytics Predictive Analytics Visual Analytics Machine Learning / AI / Deep Learning Mobile Analytics Crowdsourcing
  • 31. Big Data Analytics - Tools  Hadoop • For distributed storage of large datasets on computer clusters • Designed to process large amounts of structured and unstructured data • Provides large amounts of storage for all sorts of data along with the ability to handle virtually limitless concurrent tasks  MapReduce • Google technology for processing massive amounts of data • Software framework that enables developers to code programs that can process large amounts of unstructured data • It has two components:  Map which distributes the input data to several clusters for parallel processing  Reduce which collects all sub-results to provide the result
  • 32. Big Data Analytics - Tools  NoSQL • Used in Big Data application in clustered environments • Provides high speed access unstructured or semi-structured data • Provides capabilities to query and retrieve unstructured and semi-structured data  MongoDB • For managing data that are frequently changing or unstructured • flexible, highly scalable database designed for web applications • used to store data in mobile apps, product catalogs, and real-time applications  Other tools include Hive, Cassandra, Spark, Tableau, Talend, and cloud computing
  • 33. Big Data Analytics - Applications Internet of Things (IoT) Smart Grid Science Healthcare Nursing Business Industry Manufacturing Public Agencies
  • 34. Big Data Analytics - Benefits  Lead to making better decisions and improves insights and predictions. This can lead to greater operational efficiency, productivity, reduced cost, and risk  Eliminates the biases people have when making decisions based on limited information  Analysis of data to be built into the process that enables automated decision-making  Helps in reducing rates of return, producing high-quality products  Improve overall profitability of business  Helps social media, public and private agencies to explore behavioral patterns of people  Potentially be used in driving economic growth in developing world
  • 35. Big Data Challenges  Complexities • Processing, storage, and transfer of a large scale of data • Challenge to filter out the useless information without discarding useful information  Privacy • Risk of Data Leakage • Privacy concern arises continue from the users who outsource their private data into the cloud storage  Security • Concerns over the impact that collecting, storing, and processing large amount of data could have on security • Security is a concern because of the variety and heterogeneity of Big Data  Data Migration • Transferring Big Data for distributed processing and storage  Shortage of HR (Data Scientist)
  • 36. References  http://www.dataeconomy.eu/data-value-chain/#page-content  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_4  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_5  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_6  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_7  https://rd-springer-com.eu1.proxy.openathens.net/chapter/10.1007/978-3-319-21569-3_8  https://www.ibm.com/cloud/blog/structured-vs-unstructured-data