SlideShare a Scribd company logo
BECKER COLLEGE
Introduction to Big Data and
Data Science
Prof Feyzi R. Bagirov
Becker College
Agenda
• What is Big Data?
• What is Data Science?
• Who are Data Scientists?
• What do Data Scientists do?
• What are the job perspectives for Data Scientists?
• How happy are Data Scientists with their jobs
• Becker’s BS in Data Science
• Becker’s Big Data Analytics concentration
What is Big Data?
How much data do we use
• Everyday, people send 150 billion new email messages
• Every 4 minutes, a terabyte of data (72 hours of video) is uploaded to YouTube
• Facebook’s databases ingest 500 terabytes of new data per day
• The CERN Large Hadron Collider generates 1 petabyte per second
• Sensors from a Boeing 787 jet create 40 terabytes of data per hour
• An Oil & Gas off-shore rig operation generates 8 terabytes a day
• A self-driving car generates 1 gigabyte per second
• General Electric gas turbines generates 500 gigabytes per day
• The proposed Square Kilometer Array telescope will generate an exabyte of data
per day
• 90% of the data in the world today has been created in the last two years alone
• 80% of data captured today is unstructured
4,000,000,000,000,000,000,000 bytes
Zeta Mega KiloGigaTeraPetaExa
How much data do we use
According to IBM, 90% of the data in the world today was created in the last 2
years alone.
“Big Data: Getting Ready For The 2013 Big Bang”, Forbes Magazine, May 1, 2013
4,000,000,000,000,000,000,000 bytes
4,000,000,000,000,000,000,000 bytes
Zeta Mega KiloGigaTeraPetaExa
In 2013, the World will produce a 4 zetabytes (or 4 million petabytes) of new data.
Gatner, 2013
Definition of Big Data
• Big Data – tools that process and analyze
complex data at speeds and scales that were
previously not cost-effective.
History of Big Data
Humans use
tally sticks to
record data
for the first
time to track
trading
activity and
record
inventory
18,000
century
BCE
2,400
century
BCE
The abacus
is
developed
and the
first
libraries
are built in
Babylonia
300
century
BCE
The Library
of
Alexandria
is the
World’s
Largest
Storage
Center
100-200
century
BCE
Antikythera –
the first
mechanical
computer is
developed in
Greece
1663
John Graunt
conducts the
first
statistical
analysis
experiments
to curb the
spread of
bubonic
plague in
Europe
1865
The Term
“Business
Intelligence”
is used first
1928
Fritz
Pfleumer
creates a
method of
storing data
magnetically,
which forms
the basis of
modern
digital data
storage
1965
The US Gov
plans the
world’s first
data center
to store 742
million tax
returns and
175 million
sets of
fingerprints
on magnetic
tape
1965
Relational
Database
model
developed by
IBM
mathematici
an Edgar F.
Codd.
Everyone can
have an
ability to use
databases,
not just
computer
scientists.
1969
Early use
of term Big
Data in
magazine
article by
Erik Larson
1991
Birth of the
WWW.
Anyone
can upload
their own
data
Birth of the
ARPANET,
that later
led to the
creation of
Internet
(October
29, 1969
22:30)
1989
History of Big Data
1996
The price
of digital
storage
makes it
more cost-
effective
than paper
1997
Google
launched
the
World’s
most
popular
search
engine
1997
First use of
the term
Big Data in
an
academic
paper
2001
3 Vs of Big
Data –
Volume,
Velocity
and
Variety -
defined by
Dough
Laney
2005
Hadoop –
an open
source Big
Data
framework
is
developed
2009
The
average US
company
with over
1000
employees
is storing
more than
200 Tb of
data,
according
McKinsey
Global
Institute
Every two
days, as
much data
is being
created, as
was from
the
beginning
of human
civilization
to the year
2003 (Eric
Schmidt,
Google)
2010 2011
By 2018,
the US will
face a
shortfall of
140-
190,000
data
scientists
(McKinsey)
2014
Mobile
internet
use
overtakes
desktop
for the first
time
2015
Internet of
Things is
being
adopted
by
industries
2020
Some 30
billion
objects
may be
connected
to the
Internet of
Things
History of Big Data
4 V’s of Big Data
4 V’s of Big Data
• Volume – a Terabyte? a Petabyte? More?...
• Variety – a Web Log? A Tweeter feed? A
YouTube video?
• Velocity – New data comes every hour?
Minute? Second?
• Veracity – how much do I trust this data?
40%? 100%? 0%?
History of Big Data
IBM delivers an HDD, weighing
over a ton, storing 5 Mb of data
(September, 1956)
History of Big Data
How Big is Big?
4,000,000,000,000,000,000,000 bytes
Zeta Mega KiloGigaTeraPetaExa
Unstructured vs Structured
Unstructured Data
• Refers to information that does not have a
pre-defined data model or is not organized in
a pre-defined manner.
• Examples: social network feeds, customer
reviews or comments, YouTube videos, etc.
Structured Data
• Refers to information that does not have a
pre-defined data model or is not organized in
a pre-defined manner.
Structured or Unstructured?
Structured or Unstructured?
Structured or Unstructured?
Structured or Unstructured?
What is Data Science?
24
What is Data Science?
*http://en.wikipedia.org/wiki/Data_science
• 1960-The term "data science" (originally used interchangeably with
"datalogy") has existed for over thirty years and was used initially as a
substitute for computer science by Peter Naur in 1960.
• 2002-The International Council for Science: Committee on Data for
Science and Technology started the Data Science Journal
• 2004-Usama Fayyad became the first CDO at Yahoo.
• 2008-DJ Patil and Jeff Hammerbacher coined the term “data scientist”
to define their jobs at Linkedin and Facebook, respectively
25
http://www.datasciencecentral.com/profiles/blogs/data-scientist-core-skills
What is Data Science?
What is Data Science?
Math & Statistics
• Discrete
• Finite
• Linear Algebra
• Multivariate
Computer Science
• Programming
• Business Intelligence
Soft Skills
• Oral Communications
• Creativity
• Project Management
• Team play
• Presentation
What’s in the name?
Data Science vs Data Analytics vs …
• Business Intelligence – covers data analysis and relies heavily on aggregation, focusing on business information
• Statistics – the study of collection, analysis, interpretation, presentation and organization of data.
• Data Mining – a techniques that focuses on modeling and knowledge discovery for predictive rather than prescriptive
purposes
• Data Analytics – a process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful
information, suggesting conclusions, and supporting decision-making.
• Business Analytics - practices for continuous iterative exploration and investigation of past business performance to gain
insight and drive business planning
– Descriptive Analytics – analyzes the past performance and understands that performance by mining historical data to look for the
reasons behind past success or failure
– Predictive Analytics - encompasses a variety of statistical techniques from predictive modeling, machine learning, and data mining that
analyze current and historical facts to make predictions about future or otherwise unknown events.
– Prescriptive Analytics - automatically synthesizes big data, multiple disciplines of mathematical sciences and computational sciences,
and business rules, to make predictions and then suggests decision options to take advantage of the predictions.
• Data Science – an interdisciplinary field about processes and systems to extract knowledge or insights from data in various
forms, either structured or unstructured, which is a continuation of some of the data analysis fields, such as statistics, data
mining, and predictive analytics.
• https://en.wikipedia.org/wiki/Data_science
• https://en.wikipedia.org/wiki/Data_analysis
Who are Data Scientists?
Who are Data Scientists?
Who are Data Scientists?
What Do Data Scientists Do?
What Do Data Scientists Do?
In a nutshell a data scientist creates data products. This can mean a lot of
things but we can generalize as having the ability to create interfaces for
people and machines that use data of any kind.
Responsibilities vary a lot. It can be running experiments, creating
interfaces using machine learning, providing insights from complex
datasets.
Data scientists work with hypothesis. For instance the experiments we run
at Minclip are becoming full fledged randomised controlled trials but I think
that is the most similar case. I believe the term scientist appeared when
data itself became a field of study. The way machine learning treats data is
highly empirical. The process of improving and validating a model, while
not using the traditional statistical methods of scientific research is,
nevertheless, highly empiric, skeptic and pragmatic. Sometimes more than
some papers that are published.
• Quora http://qr.ae/RUWYc8
What Do Data Scientists Do?
• “There are multiple communities of data scientists throughout the
amazon offices which are easily approachable”
• “They mostly work on the vertical like ad space optimization or marketing.
People have in depth understanding of domain and some of the best
minds in the industry”
• “There is a Data Science Toolkit, which contains almost every kind of tools
for Data Scientists… Biggest data warehouse (Datanet) to play with,
Extedned internal wiki of almost every possible topic in the universe of
Data; mentorship of data science wizards”
– Quora, http://qr.ae/RUPSv4
What Do Data Scientists Do?
• Netflix Prize – was an open competition for
the best collaborative filtering algorithm to
predict user rating for films, based on previous
ratings without any other information about
users or films.
What Do Data Scientists Do?
On 9/21/2009, $1 mln was given to the Pragmatic Chaos team that improved
prediction by 10.06%
What Do Data Scientists Do?
• We work on core ML, on computer vision, on computational photography and on language
technologies.
• In computer vision we have a system that processes every single image and video uploaded
to Facebook, totaling well over 1B items per day. We predict the content of an image for
example in order to generate captions for the blind, or to automatically detect and take down
offensive content, improve media search results, automate visual captcha among many other
use cases.
• In language technology, one thing we are trying to do is eliminate language barriers on
Facebook. In order to do this we translate over 2B posts every single day, with over 1800
language directions representing more than 40 unique languages.
• In core ML, we focus on researching and shipping large scale and realtime ML/AI algorithms
for some of the biggest ML applications in the world. Whenever a users logs into Facebook,
these models are used to rank news feed stories (1B users every day, 1.5K stories per user
per day on average), ads, search results (1B+ queries a day), trending news, friend
recommendations and even rank notifications that a user receives, or rank the comments on
a post.
– Quora (http://qr.ae/RZ3JBx)
What Do Data Scientists Do?
• There are multiple analytics teams at Facebook
• A team of Data Scientists working on Ads and is probably the largest and most centralized
analytics team at Facebook
• Our goal is to come up with data backed insights which will result in informing the product
road-map or move key metrics that our product teams track. We sometimes also build
infrastructure (less common in my world) that are used by other Data Scientists and
engineers. We work in close concert with Engineering and Product and we often wear
Engineering or Product management hats in addition to our Data Scientist responsibilities.
We spend our time in:
– Analyzing and designing experiments to optimize product features or move key metrics
– Data mining/analysis to come up with business opportunities to pursue or product
feature suggestions or sometimes to understand metric movements.
– Building production ML models (though this is mostly done by SW Engineering)
• The multidisciplinary nature of the role, access to one of the largest troves of data, brilliant
colleagues and ability to create a huge impact in a very short time period make this an
exciting job.
– Quora (http://qr.ae/RUPJbx)
What Do Data Scientists Do?
• Predicting the past – let's say you want to determine the gender of Jason Lemkin.
If you are a human, that's easy (hint: he's a man). If you are a computer, it is more
difficult. But you might have a large dataset of genders and first names and see
that 99% of Jasons are men so your algorithm says he is a man. This would be
much more difficult with me ("Auren" is a more gender neutral name) and so you
might not be confident enough to make a gender pronouncement and thus might
need more data (like doing natural language processing on articles about me that
refer to me as "he" and "him).
• Predicting the future – figuring out what posts should be shown to the right
person.
– Quora: http://qr.ae/RUgn33
What Do Data Scientists Do?
• Airbnb wrangles a lot of data—roughly 11 petabytes. Much of it, such as a guest’s
lodging preferences and whether a host likes to be continuously booked or prefers
having a few days free between visitations, helps the online marketplace’s search
algorithm determine the most likely match between guest and host.
• Preferences of this sort fall into one of four data categories:
– Behavioral, which describes user behavior as they interact with the Airbnb website;
– Dimensional, which covers user attributes including access device used, language and location;
– Sentiment, which reflects lodging reviews, ratings and survey results;
– Imputed, which infers user behaviors, such as “this guest always travels to big cities, whereas this
other guest always travels to small coastal towns.”
• To collect, process and analyze all this data, Airbnb relies on a team of about 100
people. These include around 20 engineers who support the computing
infrastructure and Newman's 80-person data science team.
– http://www.information-management.com/news/big-data-analytics/how-airbnb-uses-big-data-to-
better-match-guests-rooms-10028582-1.html
What Do Data Scientists Do?
• Data captured through all its channels – text message, Twitter, Pebble, Android, Amazon Echo – to name
just a fraction – is fed into the Domino’s Information Management Framework. There it’s combined with
enrichment data from a large number of third party sources such as the United States Postal Service as
well as geocode information, demographic and competitor data, to allow in depth customer segmentation.
• “We have the ability to not only look at a consumer as an individual and assess their buying patterns, but
also look at the multiple consumers residing within a household, understand who is the dominant buyer,
who reacts to our coupons, and, foremost, understand how they react to the channel that they’re coming
to us on.”
– http://www.forbes.com/sites/bernardmarr/2016/04/06/big-data-driven-decision-making-at-dominos-pizza/#5c668fd4647f
What Do Data Scientists Do?
(Finance)
Source: Hortonworks
What Do Data Scientists Do?
(Government)
• Fraud, Waste and Abuse (FWA)
– Fraud and Abuse occur when there loopholes
created by complex interactions between business
controls, regulatory requirements and day-to-day
process. Recognizing these control point loopholes
are hard, manual review is difficult.
Source: KPMG
What Do Data Scientists Do?
(Government)
• Fraud, Waste and Abuse (FWA)
– Fraud and Abuse occur when there loopholes
created by complex interactions between business
controls, regulatory requirements and day-to-day
process. Recognizing these control point loopholes
are hard, manual review is difficult.
Source: KPMG
What Do Data Scientists Do?
(Government)
• FWA in Other Sectors
Source: KPMG
• Data Analysts/Scientists in Games are
concerned with how to:
– Engage the gamer
– Monetize the gamer
What Do Data Scientists Do?
(Game industry)
• Pre-launch data simulation
– Simulating loot drop rules and preference in Call
of Duty before launching the game
What Do Data Scientists Do?
(Game industry)
Source: Activision
• In-Game analytics:
– Why are people leaving?
– Investigating churn, building a churn prediction
model and impact behavior before players quit
What Do Data Scientists Do?
(Game industry)
Source: Activision
• Game Feature Research:
What Do Data Scientists Do?
(Game industry)
Source: Activision
What Do Data Scientists Do?
(Non profit)
Use-case: DataKind.org
Source: DataKind
What are the job perspectives?
[By 2018] “The United States alone faces a shortage of
140,000 to 190,000 people with deep analytical skills
as well as 1.5 million managers
and analysts to analyze big data and make decisions
based on their findings.”
• http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation
What are the job perspectives?
• http://www.indeed.com/salary?q1=%22Data+Scientist%22&l1=
What are the job perspectives?
• https://www.glassdoor.com/Best-Jobs-in-America-LST_KQ0,20.htm
What are the job perspectives?
• https://www.dezyre.com/article/data-scientist-salary-report-of-100-top-tech-companies-/218
How Happy Are Data Scientists?
Machine Learning Developers are Happy!
StackOverflow survey
Bachelor of Science in Data Science
• Building Foundations
• 120 credits
• Foundations in:
– Math
– Statistics and Multivariate Statistics
– Machine Learning
– Computer Programming
– Practicum
57
Q&A?
58

More Related Content

What's hot (20)

PDF
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
Jean-François Gagné
 
PPTX
IBM QRadar BB & Rules
Muhammad Abdel Aal
 
PDF
Redo log
PaweOlchawa1
 
PDF
MySQL Security
Ted Wennmark
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PDF
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
PDF
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
HostedbyConfluent
 
PPTX
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
confluent
 
PPTX
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
PDF
MariaDB MaxScale
MariaDB plc
 
PDF
The Full MySQL and MariaDB Parallel Replication Tutorial
Jean-François Gagné
 
PDF
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
Joshua Shinavier
 
PDF
MySQL Performance Schema in 20 Minutes
Sveta Smirnova
 
PDF
Maxscale switchover, failover, and auto rejoin
Wagner Bianchi
 
PDF
Introducing Databricks Delta
Databricks
 
PPTX
Solr vs. Elasticsearch - Case by Case
Alexandre Rafalovitch
 
PPTX
Zero to Snowflake Presentation
Brett VanderPlaats
 
PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
PPTX
From distributed caches to in-memory data grids
Max Alexejev
 
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
Jean-François Gagné
 
IBM QRadar BB & Rules
Muhammad Abdel Aal
 
Redo log
PaweOlchawa1
 
MySQL Security
Ted Wennmark
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
HostedbyConfluent
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
confluent
 
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
MariaDB MaxScale
MariaDB plc
 
The Full MySQL and MariaDB Parallel Replication Tutorial
Jean-François Gagné
 
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
Joshua Shinavier
 
MySQL Performance Schema in 20 Minutes
Sveta Smirnova
 
Maxscale switchover, failover, and auto rejoin
Wagner Bianchi
 
Introducing Databricks Delta
Databricks
 
Solr vs. Elasticsearch - Case by Case
Alexandre Rafalovitch
 
Zero to Snowflake Presentation
Brett VanderPlaats
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
From distributed caches to in-memory data grids
Max Alexejev
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 

Similar to Introduction to Big Data and Data Science (20)

PDF
365 Data Science
IvanHo572682
 
PPTX
Data Science
Rabin BK
 
PPTX
Introduction to Data Science 5-13.pptx
devakisharma1
 
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
PPTX
Introduction to Data Science - Overview and application
AyyappanGurusamySiva
 
PPTX
Introduction to Data Science\
Rajuyadav887963
 
PPTX
Introduction to Data Science
Rajuyadav887963
 
PPTX
Introduction to Data Science
Rajuyadav887963
 
PPTX
Introduction to Data Science Presentation
SwarnaSLcse
 
PPTX
Introduction to Data Science 5-13.pptx
Nilesh Raj
 
PPT
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Shiv Shakti Ghosh
 
PPTX
intro to data science Clustering and visualization of data science subfields ...
jybufgofasfbkpoovh
 
PPTX
NumPy_ SciPy_ _ DatiiiikaFrames (2).pptx
smartashammari
 
PPTX
Introduction to Data Science
SarmiHarsha
 
PPTX
Introduction to Data Science 5-13.pptx
Aravind Reddy
 
PDF
Data Science: lesson01_intro-to-ds-and-ml.pdf
alhashediyemen
 
PDF
00-01 DSnDA.pdf
SugumarSarDurai
 
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
PPTX
Big Data and the Art of Data Science
Andrew Gardner
 
PPTX
Data science
DeekshaSrivas
 
365 Data Science
IvanHo572682
 
Data Science
Rabin BK
 
Introduction to Data Science 5-13.pptx
devakisharma1
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Introduction to Data Science - Overview and application
AyyappanGurusamySiva
 
Introduction to Data Science\
Rajuyadav887963
 
Introduction to Data Science
Rajuyadav887963
 
Introduction to Data Science
Rajuyadav887963
 
Introduction to Data Science Presentation
SwarnaSLcse
 
Introduction to Data Science 5-13.pptx
Nilesh Raj
 
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Shiv Shakti Ghosh
 
intro to data science Clustering and visualization of data science subfields ...
jybufgofasfbkpoovh
 
NumPy_ SciPy_ _ DatiiiikaFrames (2).pptx
smartashammari
 
Introduction to Data Science
SarmiHarsha
 
Introduction to Data Science 5-13.pptx
Aravind Reddy
 
Data Science: lesson01_intro-to-ds-and-ml.pdf
alhashediyemen
 
00-01 DSnDA.pdf
SugumarSarDurai
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Big Data and the Art of Data Science
Andrew Gardner
 
Data science
DeekshaSrivas
 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Ad

Introduction to Big Data and Data Science

  • 1. BECKER COLLEGE Introduction to Big Data and Data Science Prof Feyzi R. Bagirov Becker College
  • 2. Agenda • What is Big Data? • What is Data Science? • Who are Data Scientists? • What do Data Scientists do? • What are the job perspectives for Data Scientists? • How happy are Data Scientists with their jobs • Becker’s BS in Data Science • Becker’s Big Data Analytics concentration
  • 3. What is Big Data?
  • 4. How much data do we use • Everyday, people send 150 billion new email messages • Every 4 minutes, a terabyte of data (72 hours of video) is uploaded to YouTube • Facebook’s databases ingest 500 terabytes of new data per day • The CERN Large Hadron Collider generates 1 petabyte per second • Sensors from a Boeing 787 jet create 40 terabytes of data per hour • An Oil & Gas off-shore rig operation generates 8 terabytes a day • A self-driving car generates 1 gigabyte per second • General Electric gas turbines generates 500 gigabytes per day • The proposed Square Kilometer Array telescope will generate an exabyte of data per day • 90% of the data in the world today has been created in the last two years alone • 80% of data captured today is unstructured 4,000,000,000,000,000,000,000 bytes Zeta Mega KiloGigaTeraPetaExa
  • 5. How much data do we use According to IBM, 90% of the data in the world today was created in the last 2 years alone. “Big Data: Getting Ready For The 2013 Big Bang”, Forbes Magazine, May 1, 2013 4,000,000,000,000,000,000,000 bytes
  • 6. 4,000,000,000,000,000,000,000 bytes Zeta Mega KiloGigaTeraPetaExa In 2013, the World will produce a 4 zetabytes (or 4 million petabytes) of new data. Gatner, 2013
  • 7. Definition of Big Data • Big Data – tools that process and analyze complex data at speeds and scales that were previously not cost-effective.
  • 8. History of Big Data Humans use tally sticks to record data for the first time to track trading activity and record inventory 18,000 century BCE 2,400 century BCE The abacus is developed and the first libraries are built in Babylonia 300 century BCE The Library of Alexandria is the World’s Largest Storage Center 100-200 century BCE Antikythera – the first mechanical computer is developed in Greece 1663 John Graunt conducts the first statistical analysis experiments to curb the spread of bubonic plague in Europe 1865 The Term “Business Intelligence” is used first 1928 Fritz Pfleumer creates a method of storing data magnetically, which forms the basis of modern digital data storage 1965 The US Gov plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape 1965 Relational Database model developed by IBM mathematici an Edgar F. Codd. Everyone can have an ability to use databases, not just computer scientists. 1969 Early use of term Big Data in magazine article by Erik Larson 1991 Birth of the WWW. Anyone can upload their own data Birth of the ARPANET, that later led to the creation of Internet (October 29, 1969 22:30) 1989
  • 9. History of Big Data 1996 The price of digital storage makes it more cost- effective than paper 1997 Google launched the World’s most popular search engine 1997 First use of the term Big Data in an academic paper 2001 3 Vs of Big Data – Volume, Velocity and Variety - defined by Dough Laney 2005 Hadoop – an open source Big Data framework is developed 2009 The average US company with over 1000 employees is storing more than 200 Tb of data, according McKinsey Global Institute Every two days, as much data is being created, as was from the beginning of human civilization to the year 2003 (Eric Schmidt, Google) 2010 2011 By 2018, the US will face a shortfall of 140- 190,000 data scientists (McKinsey) 2014 Mobile internet use overtakes desktop for the first time 2015 Internet of Things is being adopted by industries 2020 Some 30 billion objects may be connected to the Internet of Things
  • 11. 4 V’s of Big Data
  • 12. 4 V’s of Big Data • Volume – a Terabyte? a Petabyte? More?... • Variety – a Web Log? A Tweeter feed? A YouTube video? • Velocity – New data comes every hour? Minute? Second? • Veracity – how much do I trust this data? 40%? 100%? 0%?
  • 13. History of Big Data IBM delivers an HDD, weighing over a ton, storing 5 Mb of data (September, 1956)
  • 15. How Big is Big? 4,000,000,000,000,000,000,000 bytes Zeta Mega KiloGigaTeraPetaExa
  • 17. Unstructured Data • Refers to information that does not have a pre-defined data model or is not organized in a pre-defined manner. • Examples: social network feeds, customer reviews or comments, YouTube videos, etc.
  • 18. Structured Data • Refers to information that does not have a pre-defined data model or is not organized in a pre-defined manner.
  • 23. What is Data Science?
  • 24. 24 What is Data Science? *http://en.wikipedia.org/wiki/Data_science • 1960-The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. • 2002-The International Council for Science: Committee on Data for Science and Technology started the Data Science Journal • 2004-Usama Fayyad became the first CDO at Yahoo. • 2008-DJ Patil and Jeff Hammerbacher coined the term “data scientist” to define their jobs at Linkedin and Facebook, respectively
  • 26. What is Data Science? Math & Statistics • Discrete • Finite • Linear Algebra • Multivariate Computer Science • Programming • Business Intelligence Soft Skills • Oral Communications • Creativity • Project Management • Team play • Presentation
  • 28. Data Science vs Data Analytics vs … • Business Intelligence – covers data analysis and relies heavily on aggregation, focusing on business information • Statistics – the study of collection, analysis, interpretation, presentation and organization of data. • Data Mining – a techniques that focuses on modeling and knowledge discovery for predictive rather than prescriptive purposes • Data Analytics – a process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. • Business Analytics - practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning – Descriptive Analytics – analyzes the past performance and understands that performance by mining historical data to look for the reasons behind past success or failure – Predictive Analytics - encompasses a variety of statistical techniques from predictive modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events. – Prescriptive Analytics - automatically synthesizes big data, multiple disciplines of mathematical sciences and computational sciences, and business rules, to make predictions and then suggests decision options to take advantage of the predictions. • Data Science – an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields, such as statistics, data mining, and predictive analytics. • https://en.wikipedia.org/wiki/Data_science • https://en.wikipedia.org/wiki/Data_analysis
  • 29. Who are Data Scientists?
  • 30. Who are Data Scientists?
  • 31. Who are Data Scientists?
  • 32. What Do Data Scientists Do?
  • 33. What Do Data Scientists Do? In a nutshell a data scientist creates data products. This can mean a lot of things but we can generalize as having the ability to create interfaces for people and machines that use data of any kind. Responsibilities vary a lot. It can be running experiments, creating interfaces using machine learning, providing insights from complex datasets. Data scientists work with hypothesis. For instance the experiments we run at Minclip are becoming full fledged randomised controlled trials but I think that is the most similar case. I believe the term scientist appeared when data itself became a field of study. The way machine learning treats data is highly empirical. The process of improving and validating a model, while not using the traditional statistical methods of scientific research is, nevertheless, highly empiric, skeptic and pragmatic. Sometimes more than some papers that are published. • Quora http://qr.ae/RUWYc8
  • 34. What Do Data Scientists Do? • “There are multiple communities of data scientists throughout the amazon offices which are easily approachable” • “They mostly work on the vertical like ad space optimization or marketing. People have in depth understanding of domain and some of the best minds in the industry” • “There is a Data Science Toolkit, which contains almost every kind of tools for Data Scientists… Biggest data warehouse (Datanet) to play with, Extedned internal wiki of almost every possible topic in the universe of Data; mentorship of data science wizards” – Quora, http://qr.ae/RUPSv4
  • 35. What Do Data Scientists Do? • Netflix Prize – was an open competition for the best collaborative filtering algorithm to predict user rating for films, based on previous ratings without any other information about users or films.
  • 36. What Do Data Scientists Do? On 9/21/2009, $1 mln was given to the Pragmatic Chaos team that improved prediction by 10.06%
  • 37. What Do Data Scientists Do? • We work on core ML, on computer vision, on computational photography and on language technologies. • In computer vision we have a system that processes every single image and video uploaded to Facebook, totaling well over 1B items per day. We predict the content of an image for example in order to generate captions for the blind, or to automatically detect and take down offensive content, improve media search results, automate visual captcha among many other use cases. • In language technology, one thing we are trying to do is eliminate language barriers on Facebook. In order to do this we translate over 2B posts every single day, with over 1800 language directions representing more than 40 unique languages. • In core ML, we focus on researching and shipping large scale and realtime ML/AI algorithms for some of the biggest ML applications in the world. Whenever a users logs into Facebook, these models are used to rank news feed stories (1B users every day, 1.5K stories per user per day on average), ads, search results (1B+ queries a day), trending news, friend recommendations and even rank notifications that a user receives, or rank the comments on a post. – Quora (http://qr.ae/RZ3JBx)
  • 38. What Do Data Scientists Do? • There are multiple analytics teams at Facebook • A team of Data Scientists working on Ads and is probably the largest and most centralized analytics team at Facebook • Our goal is to come up with data backed insights which will result in informing the product road-map or move key metrics that our product teams track. We sometimes also build infrastructure (less common in my world) that are used by other Data Scientists and engineers. We work in close concert with Engineering and Product and we often wear Engineering or Product management hats in addition to our Data Scientist responsibilities. We spend our time in: – Analyzing and designing experiments to optimize product features or move key metrics – Data mining/analysis to come up with business opportunities to pursue or product feature suggestions or sometimes to understand metric movements. – Building production ML models (though this is mostly done by SW Engineering) • The multidisciplinary nature of the role, access to one of the largest troves of data, brilliant colleagues and ability to create a huge impact in a very short time period make this an exciting job. – Quora (http://qr.ae/RUPJbx)
  • 39. What Do Data Scientists Do? • Predicting the past – let's say you want to determine the gender of Jason Lemkin. If you are a human, that's easy (hint: he's a man). If you are a computer, it is more difficult. But you might have a large dataset of genders and first names and see that 99% of Jasons are men so your algorithm says he is a man. This would be much more difficult with me ("Auren" is a more gender neutral name) and so you might not be confident enough to make a gender pronouncement and thus might need more data (like doing natural language processing on articles about me that refer to me as "he" and "him). • Predicting the future – figuring out what posts should be shown to the right person. – Quora: http://qr.ae/RUgn33
  • 40. What Do Data Scientists Do? • Airbnb wrangles a lot of data—roughly 11 petabytes. Much of it, such as a guest’s lodging preferences and whether a host likes to be continuously booked or prefers having a few days free between visitations, helps the online marketplace’s search algorithm determine the most likely match between guest and host. • Preferences of this sort fall into one of four data categories: – Behavioral, which describes user behavior as they interact with the Airbnb website; – Dimensional, which covers user attributes including access device used, language and location; – Sentiment, which reflects lodging reviews, ratings and survey results; – Imputed, which infers user behaviors, such as “this guest always travels to big cities, whereas this other guest always travels to small coastal towns.” • To collect, process and analyze all this data, Airbnb relies on a team of about 100 people. These include around 20 engineers who support the computing infrastructure and Newman's 80-person data science team. – http://www.information-management.com/news/big-data-analytics/how-airbnb-uses-big-data-to- better-match-guests-rooms-10028582-1.html
  • 41. What Do Data Scientists Do? • Data captured through all its channels – text message, Twitter, Pebble, Android, Amazon Echo – to name just a fraction – is fed into the Domino’s Information Management Framework. There it’s combined with enrichment data from a large number of third party sources such as the United States Postal Service as well as geocode information, demographic and competitor data, to allow in depth customer segmentation. • “We have the ability to not only look at a consumer as an individual and assess their buying patterns, but also look at the multiple consumers residing within a household, understand who is the dominant buyer, who reacts to our coupons, and, foremost, understand how they react to the channel that they’re coming to us on.” – http://www.forbes.com/sites/bernardmarr/2016/04/06/big-data-driven-decision-making-at-dominos-pizza/#5c668fd4647f
  • 42. What Do Data Scientists Do? (Finance) Source: Hortonworks
  • 43. What Do Data Scientists Do? (Government) • Fraud, Waste and Abuse (FWA) – Fraud and Abuse occur when there loopholes created by complex interactions between business controls, regulatory requirements and day-to-day process. Recognizing these control point loopholes are hard, manual review is difficult. Source: KPMG
  • 44. What Do Data Scientists Do? (Government) • Fraud, Waste and Abuse (FWA) – Fraud and Abuse occur when there loopholes created by complex interactions between business controls, regulatory requirements and day-to-day process. Recognizing these control point loopholes are hard, manual review is difficult. Source: KPMG
  • 45. What Do Data Scientists Do? (Government) • FWA in Other Sectors Source: KPMG
  • 46. • Data Analysts/Scientists in Games are concerned with how to: – Engage the gamer – Monetize the gamer What Do Data Scientists Do? (Game industry)
  • 47. • Pre-launch data simulation – Simulating loot drop rules and preference in Call of Duty before launching the game What Do Data Scientists Do? (Game industry) Source: Activision
  • 48. • In-Game analytics: – Why are people leaving? – Investigating churn, building a churn prediction model and impact behavior before players quit What Do Data Scientists Do? (Game industry) Source: Activision
  • 49. • Game Feature Research: What Do Data Scientists Do? (Game industry) Source: Activision
  • 50. What Do Data Scientists Do? (Non profit) Use-case: DataKind.org Source: DataKind
  • 51. What are the job perspectives? [By 2018] “The United States alone faces a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts to analyze big data and make decisions based on their findings.” • http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation
  • 52. What are the job perspectives? • http://www.indeed.com/salary?q1=%22Data+Scientist%22&l1=
  • 53. What are the job perspectives? • https://www.glassdoor.com/Best-Jobs-in-America-LST_KQ0,20.htm
  • 54. What are the job perspectives? • https://www.dezyre.com/article/data-scientist-salary-report-of-100-top-tech-companies-/218
  • 55. How Happy Are Data Scientists? Machine Learning Developers are Happy! StackOverflow survey
  • 56. Bachelor of Science in Data Science • Building Foundations • 120 credits • Foundations in: – Math – Statistics and Multivariate Statistics – Machine Learning – Computer Programming – Practicum
  • 58. 58