SlideShare a Scribd company logo
Where Is Your Data?:
An Introduction to Problems and
Bottlenecks in Data Systems
!
John Joo, Program Director
David Drummond, Program Director
!
Insight Data Engineering
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Program mentors are data engineers from
top technology companies including:
Goals
• Understand the different components of the
tech stack at a high level.
• Understand the hardware bottlenecks that
dictate the tech stack.
• Understand the tech stacks that are generally
used for different types of companies, and why.
Computing basics
Various ports
(I/O)
up to ~ 10GB/s
CPU
(processor)
~ 1GHz
Hard Drive
(storage)
~ 250GB
RAM
(memory)
~ 8GB
Various ports
(I/O)
up to ~ 10GB/s
RAM
(memory)
~ 8GB
CPU
(processor)
~ 1GHz
Hard Drive (storage)
~ 250GB
Various ports
(I/O)
up to ~ 10GB/s
RAM
(memory)
~ 8GB
CPU
(processor)
~ 1GHz
Hard Drive (storage)
~ 250GB
Network Processing Storage
What does this look like for a
business?
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Data @ Point of Sale
• 1 Transaction → 2 kb
• What did Customer buy?
• How much did Customer
spend?
• When did Customer make
this transaction?
Daily Data @ Individual Store
• ~50,000 transactions / store /
day → 100 MB
• Servers at back of store
• What items were sold today?
• What was our revenue for
today?
• How much was refunded today?
• What do we need to do to
restock for tomorrow?
Yearly Data @ Individual Store
• 20 million transactions → 40 GB /
year
• What are some seasonal trends in
purchased items?
• How should we target our coupons or
advertisements to local customers?
• Who were the most efficient
employees?
• Should the store’s hours change
depending on the time of year?
Various ports
(I/O)
up to ~ 10GB/s
RAM
(memory)
~8GB
CPU
(processor)
~ 1GHz
Hard Drive (storage)
~ 250GB
Yearly Data @ All Stores
• 7 billion transactions → 10 TB / year
• Requires in data centers
• What national sales campaigns should we
run? Ads, coupons, commercials, web.
• What should the CEO's compensation
be?
• Where should we open Supercenters,
Discount Stores, Neighborhood Stores,
Walmart Expresses?
• What music should we play in the stores?
Complete Historic
Data @ All Stores
• 16 years (1992 - 2008)
• 1 trillion transactions → 2.5 PB
• Data centers
• “Area 71” in Caverna, Missouri.
• 125,000-square-foot
• 460 TB
• Colorado Springs
• 210,000-square-foot
• $100 million
Area 71
Various ports
(I/O)
RAM
(memory)
CPU
(processor)
Hard Drive
(storage)
Network Processing Storage
Bottlenecks in Data Systems
Proper data system design should consider
these limiting bottlenecks:
• Loading data into the CPU and memory
• Finding data on the disk
• Moving data across the network
Bottlenecks: Loading Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
Bottlenecks: Loading Data
• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
• Solution: Distributed computing with ample memory
Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Finding Data
• Finding a new file on disk (known as random seeks)
• Solution: SSD and structuring data in the order it is accessed
Actuator arm
with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Moving Data
• Moving data from machine to machine over a network
Bottlenecks: Moving Data
• Solution: Keeping data close to the processors
• Moving data from machine to machine over a network
Bottlenecks: Example
• Processing a 2 kB transaction in memory, sequentially and
randomly on disk, or across the network
100 :1 200 :1 50 :1
Tech Stacks for Companies
Depending on your growth plans:
• Single system with small data
• Distributed data center with large data
• Renting computers for flexibility
Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Small Firms with Small Data
• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Small Firms with Small Data
Large Firms with Stable Growth
• Example: Facebook with steadily growing data centers
• Pros: Economies of scale, redundancy, innovative design
• Cons: Upfront capital, dedicated maintenance
• >100 PB of Data
• 7 PB / Day
• 1 kW / TB
• ~$20 / TB / Month
Start-Ups with Exponential Growth
• Example: AirBnB - rent processing and storage from AWS
• Pros: Scales easily, no maintenance, no upfront capital
• Cons: Expensive in the long run, depend on data provider
• 50 GB / Day
• $20-50 / TB / Mo
Start-Ups with Exponential Growth
• Example: Netflix - AWS fails on Christmas Eve
• Con: You can rent the computers, but you own the failure
Data Pipeline
Ingestion
Realtime Processing
File System Batch Processing
Database
Gathering
data in a
reliable way
Storing the
unstructured
data redundantly
Processing the
data in large
batches at the
data center
Processing live
streaming data reliably
Organizing
data for quick
access
Conclusion
• Understand the different components of the
tech stack at a high level
• Understand the hardware bottlenecks that
dictate the tech stack
• Understand the tech stacks that are generally
used for different types of companies, and why

More Related Content

What's hot (9)

PDF
MongoDB: What, why, when
Eugenio Minardi
 
PPTX
Teradata Intelligent Memory
inside-BigData.com
 
PPTX
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
George Joseph
 
PPTX
NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"
Mike Nelson
 
PPTX
London VMUG Presentation 19th July 2012
Chris Evans
 
PPTX
Big Data Business Transformation - Big Picture and Blueprints
Ashnikbiz
 
PPT
Lecture1
Sunil Chavan
 
PDF
Datavail Health Check
Datavail
 
PPT
Why Now May Be The Time To Consider A Managed Services Approach to Database A...
Datavail
 
MongoDB: What, why, when
Eugenio Minardi
 
Teradata Intelligent Memory
inside-BigData.com
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
George Joseph
 
NetApp SAPPHIRE 2016 in SUSE booth: "Safeguarding HANA"
Mike Nelson
 
London VMUG Presentation 19th July 2012
Chris Evans
 
Big Data Business Transformation - Big Picture and Blueprints
Ashnikbiz
 
Lecture1
Sunil Chavan
 
Datavail Health Check
Datavail
 
Why Now May Be The Time To Consider A Managed Services Approach to Database A...
Datavail
 

Viewers also liked (7)

PPT
Tailwind Strategies Overview Oct 2009
tailwindstrategies
 
PDF
Bottlenecks -- some ramblings and a bit of data from maize PAGXXII
jrossibarra
 
PDF
The Knowledge Reengineering Bottleneck
Rinke Hoekstra
 
PPTX
Pqm bottlenecks
dhvani1234
 
PPTX
Top Devops bottlenecks, constraints and best practices
Mike Kavis
 
PDF
People as Bottlenecks
Gaetano Mazzanti
 
PPTX
Performance Bottleneck Identification
Mustufa Batterywala
 
Tailwind Strategies Overview Oct 2009
tailwindstrategies
 
Bottlenecks -- some ramblings and a bit of data from maize PAGXXII
jrossibarra
 
The Knowledge Reengineering Bottleneck
Rinke Hoekstra
 
Pqm bottlenecks
dhvani1234
 
Top Devops bottlenecks, constraints and best practices
Mike Kavis
 
People as Bottlenecks
Gaetano Mazzanti
 
Performance Bottleneck Identification
Mustufa Batterywala
 
Ad

Similar to Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems (20)

PDF
Data for Action Talk - 2016-02-22
David E Drummond
 
PDF
DataStax Enterprise in the Field – 20160920
Daniel Cohen
 
PDF
The New Model
David Kaiser
 
PDF
ITI015En-The evolution of databases (I)
Huibert Aalbers
 
PDF
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
Linaro
 
PPTX
"The Cutting Edge Can Hurt You"
Chris Dwan
 
PDF
Meta scale kognitio hadoop webinar
Michael Hiskey
 
PDF
Meta scale kognitio hadoop webinar
Kognitio
 
PPSX
Big data with Hadoop - Introduction
Tomy Rhymond
 
PPTX
Webinar: Sizing Up Object Storage for the Enterprise
Storage Switzerland
 
PPTX
5 Things that Make Hadoop a Game Changer
Caserta
 
PPTX
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 
PDF
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
KEY
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
PDF
2010 AIRI Petabyte Challenge - View From The Trenches
George Ang
 
PDF
Building a High Performance Analytics Platform
Santanu Dey
 
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
PPTX
Melt iron heterogeneous computing - lspe v3
Rinka Singh
 
PPTX
Connect internal hardware components.pptx
abdifetah
 
Data for Action Talk - 2016-02-22
David E Drummond
 
DataStax Enterprise in the Field – 20160920
Daniel Cohen
 
The New Model
David Kaiser
 
ITI015En-The evolution of databases (I)
Huibert Aalbers
 
LCA13: Jason Taylor Keynote - ARM & Disaggregated Rack - LCA13-Hong - 6 March...
Linaro
 
"The Cutting Edge Can Hurt You"
Chris Dwan
 
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Meta scale kognitio hadoop webinar
Kognitio
 
Big data with Hadoop - Introduction
Tomy Rhymond
 
Webinar: Sizing Up Object Storage for the Enterprise
Storage Switzerland
 
5 Things that Make Hadoop a Game Changer
Caserta
 
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
2010 AIRI Petabyte Challenge - View From The Trenches
George Ang
 
Building a High Performance Analytics Platform
Santanu Dey
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
Melt iron heterogeneous computing - lspe v3
Rinka Singh
 
Connect internal hardware components.pptx
abdifetah
 
Ad

Recently uploaded (20)

PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PPTX
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PPTX
Fundamentals of Quantitative Design and Analysis.pptx
aliali240367
 
PDF
Tesia Dobrydnia - An Avid Hiker And Backpacker
Tesia Dobrydnia
 
PDF
William Stallings - Foundations of Modern Networking_ SDN, NFV, QoE, IoT, and...
lavanya896395
 
PPTX
darshai cross section and river section analysis
muk7971
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPTX
template.pptxr4t5y67yrttttttttttttttttttttttttttttttttttt
SithamparanaathanPir
 
PDF
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
PPTX
Seminar Description: YOLO v1 (You Only Look Once).pptx
abhijithpramod20002
 
PDF
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PPTX
Alan Turing - life and importance for all of us now
Pedro Concejero
 
PDF
Digital water marking system project report
Kamal Acharya
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
Fundamentals of Quantitative Design and Analysis.pptx
aliali240367
 
Tesia Dobrydnia - An Avid Hiker And Backpacker
Tesia Dobrydnia
 
William Stallings - Foundations of Modern Networking_ SDN, NFV, QoE, IoT, and...
lavanya896395
 
darshai cross section and river section analysis
muk7971
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
template.pptxr4t5y67yrttttttttttttttttttttttttttttttttttt
SithamparanaathanPir
 
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
Seminar Description: YOLO v1 (You Only Look Once).pptx
abhijithpramod20002
 
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
Alan Turing - life and importance for all of us now
Pedro Concejero
 
Digital water marking system project report
Kamal Acharya
 

Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

  • 1. Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems ! John Joo, Program Director David Drummond, Program Director ! Insight Data Engineering
  • 3. Program mentors are data engineers from top technology companies including:
  • 4. Goals • Understand the different components of the tech stack at a high level. • Understand the hardware bottlenecks that dictate the tech stack. • Understand the tech stacks that are generally used for different types of companies, and why.
  • 6. Various ports (I/O) up to ~ 10GB/s CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB RAM (memory) ~ 8GB
  • 7. Various ports (I/O) up to ~ 10GB/s RAM (memory) ~ 8GB CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB
  • 8. Various ports (I/O) up to ~ 10GB/s RAM (memory) ~ 8GB CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB Network Processing Storage
  • 9. What does this look like for a business?
  • 11. Data @ Point of Sale • 1 Transaction → 2 kb • What did Customer buy? • How much did Customer spend? • When did Customer make this transaction?
  • 12. Daily Data @ Individual Store • ~50,000 transactions / store / day → 100 MB • Servers at back of store • What items were sold today? • What was our revenue for today? • How much was refunded today? • What do we need to do to restock for tomorrow?
  • 13. Yearly Data @ Individual Store • 20 million transactions → 40 GB / year • What are some seasonal trends in purchased items? • How should we target our coupons or advertisements to local customers? • Who were the most efficient employees? • Should the store’s hours change depending on the time of year?
  • 14. Various ports (I/O) up to ~ 10GB/s RAM (memory) ~8GB CPU (processor) ~ 1GHz Hard Drive (storage) ~ 250GB
  • 15. Yearly Data @ All Stores • 7 billion transactions → 10 TB / year • Requires in data centers • What national sales campaigns should we run? Ads, coupons, commercials, web. • What should the CEO's compensation be? • Where should we open Supercenters, Discount Stores, Neighborhood Stores, Walmart Expresses? • What music should we play in the stores?
  • 16. Complete Historic Data @ All Stores • 16 years (1992 - 2008) • 1 trillion transactions → 2.5 PB • Data centers • “Area 71” in Caverna, Missouri. • 125,000-square-foot • 460 TB • Colorado Springs • 210,000-square-foot • $100 million Area 71
  • 18. Bottlenecks in Data Systems Proper data system design should consider these limiting bottlenecks: • Loading data into the CPU and memory • Finding data on the disk • Moving data across the network
  • 19. Bottlenecks: Loading Data • All data that is processed must be loaded into the CPU Disk Storage Memory CPU Price Speed
  • 20. Bottlenecks: Loading Data • All data that is processed must be loaded into the CPU Disk Storage Memory CPU Price Speed • Solution: Distributed computing with ample memory
  • 21. Bottlenecks: Finding Data • Finding a new file on disk (known as random seeks) Actuator arm with head that reads from disk End of Desired File Beginning of Desired File
  • 22. Bottlenecks: Finding Data • Finding a new file on disk (known as random seeks) • Solution: SSD and structuring data in the order it is accessed Actuator arm with head that reads from disk End of Desired File Beginning of Desired File
  • 23. Bottlenecks: Moving Data • Moving data from machine to machine over a network
  • 24. Bottlenecks: Moving Data • Solution: Keeping data close to the processors • Moving data from machine to machine over a network
  • 25. Bottlenecks: Example • Processing a 2 kB transaction in memory, sequentially and randomly on disk, or across the network 100 :1 200 :1 50 :1
  • 26. Tech Stacks for Companies Depending on your growth plans: • Single system with small data • Distributed data center with large data • Renting computers for flexibility
  • 27. Small Firms with Small Data • Example: Small medical firm with slow growth • Pros: Easy to maintain, data locality, inexpensive • Cons: Difficult to grow quickly, risky, not ideal for analysis
  • 28. Small Firms with Small Data • Example: Small medical firm with slow growth • Pros: Easy to maintain, data locality, inexpensive • Cons: Difficult to grow quickly, risky, not ideal for analysis
  • 29. Small Firms with Small Data
  • 30. Large Firms with Stable Growth • Example: Facebook with steadily growing data centers • Pros: Economies of scale, redundancy, innovative design • Cons: Upfront capital, dedicated maintenance • >100 PB of Data • 7 PB / Day • 1 kW / TB • ~$20 / TB / Month
  • 31. Start-Ups with Exponential Growth • Example: AirBnB - rent processing and storage from AWS • Pros: Scales easily, no maintenance, no upfront capital • Cons: Expensive in the long run, depend on data provider • 50 GB / Day • $20-50 / TB / Mo
  • 32. Start-Ups with Exponential Growth • Example: Netflix - AWS fails on Christmas Eve • Con: You can rent the computers, but you own the failure
  • 33. Data Pipeline Ingestion Realtime Processing File System Batch Processing Database Gathering data in a reliable way Storing the unstructured data redundantly Processing the data in large batches at the data center Processing live streaming data reliably Organizing data for quick access
  • 34. Conclusion • Understand the different components of the tech stack at a high level • Understand the hardware bottlenecks that dictate the tech stack • Understand the tech stacks that are generally used for different types of companies, and why