SlideShare a Scribd company logo
HANDLING BIGGER DATA
What to do if your data’s too big
Data nerding
Your 5-7 things
❑ Bigger data
❑ Much bigger data
❑ Much bigger data storage
❑ Bigger data science teams
BIGGER DATA
Or, ‘data that’s a bit too big’
3
First, don’t panic
Computer storage
250Gb Internal hard drive. (hopefully)
permanent storage. The place you’re
storing photos, data etc
16Gb RAM. Temporary
storage. The place
read_csv loads your
dataset into.
2Tb External hard
drive. A handy place
to keep bigger
datafiles.
Gigabytes, Terabytes etc.
Name Size in bytes Contains (roughly)
Byte 1 1 character (‘a’, ‘1’ etc)
Kilobyte 1,000 Half a printed page
Megabyte 1,000,000 1 novella. 5Mb = complete works of Shakespeare
Gigabyte 1,000,000,000 1 high-fidelity symphony recording; 10m of shelved books
Terabyte 1,000,000,000,000 All the x-ray films in a large hospital; 10 = library of
congress collection. 2.6 = Panama Papers leak
Petabyte 1,000,000,000,000,000 2 = all US academic libraries; 10= 1 hour’s output from
SKA telescope
Exabyte 1,000,000,000,000,000,000 5 = all words ever spoken by humans
Zettabyte 1,000,000,000,000,000,000,000
Yottabyte 1,000,000,000,000,000,000,000,000 Current storage capacity of the Internet
Things to Try: Too Big
❑Read data in ‘chunks’
csv_chunks = pandas.read_csv(‘myfile.csv’, chunksize = 10000)
❑ Divide and conquer in your code:
csv_chunks = pandas.read_csv(‘myfile.csv’, skiprows=10000, chunksize = 10000)
❑Use parallel processing
❑ E.g the Dask library
Things to try: Too Slow
❑Use %timeit to find where the speed problems are
❑Use compiled python, (e.g. the Numba library)
❑Use C code (via Cython)
8
MUCH BIGGER DATA
Or, ‘What if it really doesn’t fit?’
9
Volume, Velocity, Variety
Much Faster Datastreams
Twitter firehose:
❑ Firehose averages 6,000 tweets per second
❑ Record is 143,199 tweets in one second (Aug 3rd 2013, Japan)
❑ Twitter public streams = 1% of Firehose steam
Google index (2013):
❑ 30 trillion unique pages on the internet
❑ Google index = 100 petabytes (100 million gigabytes)
❑ 100 billion web searches a month
❑ Search returned in about ⅛ second
Distributed systems
❑ Store the data on multiple ‘servers’:
❑ Big idea: Distributed file systems
❑ Replicate data (server hardware breaks more often than you think)
❑ Do the processing on multiple servers:
❑ Lots of code does the same thing to different pieces of data
❑ Big idea: Map/Reduce
Parallel Processors
❑Laptop: 4 cores, 16 GB RAM, 256 GB disk
❑Workstation: 24 cores, 1 TB RAM
❑Clusters: as big as you can imagine…
13
Distributed filesystems
Your typical rack server...
Map/Reduce: Crowdsourcing for computers
Distributed Programming Platforms
Hadoop
❑ HDFS: distributed filesystem
❑ MapReduce engine: processing
Spark
❑ In-memory processing
❑ Because moving data around is the biggest bottleneck
Typical (Current) Ecosystem
HDFS
Spark
Python
R
SQL
Tableau
Publisher
Data warehouse
Anaconda comes with this…
Parallel Python Libraries
❑ Dask
❑ Datasets look like NumpyArrays, Pandas DataFrames
❑ df.groupby(df.index).value.mean()
❑ Direct access into HDFS, S3 etc
❑ PySpark
❑ Also has DataFrames
❑ Connects to Spark
20
MUCH BIGGER DATA
STORAGE
Or, ‘Where do we put all this stuff?’
2
1
SQL Databases
❑ Row/column tables
❑ Keys
❑ SQL query language
❑ Joins etc (like Pandas)
ETL (Extract - Transform - Load)
❑ Extract
❑ Extract data from multiple sources
❑ Transform
❑ Convert data into database formats (e.g. sql)
❑ Load
❑ Load data into database
Data warehouses
NoSql Databases
❑ Not forced into row/column
❑ Lots of different types
❑ Key/value: can add feature without rewriting
tables
❑ Graph: stores nodes and edges
❑ Column: useful if you have a lot more reads
than writes
❑ Document: general-purpose. MongoDb is
commonly used.
Data Lakes
BIGGER DATA SCIENCE
TEAMS
Or, ‘Who does this stuff?’
2
7
Big Data Work
❑ Data Science
❑ Data Analysis
❑ Data Engineering
❑ Data Strategy
Big Data Science Teams
❑ Usually seen:
❑ Project manager
❑ Business analysts
❑ Data Scientists / Analysts: insight from data
❑ Data Engineers / Developers: data flow implementation, production systems
❑ Sometimes seen:
❑ Data Architect: data flow design
❑ User Experience / User Interface developer / Visual designer
Data Strategy
❑ Why should data be important here?
❑ Which business questions does this place have?
❑ What data does/could this place have access to?
❑ How much data work is already here?
❑ Who has the data science gene?
❑ What needs to change to make this place data-driven?
❑ People (training, culture)
❑ Processes
❑ Technologies (data access, storage, analysis tools)
❑ Data
Data Analysis
❑ What are the statistics of this dataset?
❑ E.g. which pages are popular
❑ Usually on already-formatted data, e.g. google analytics results
Data Science
❑ Ask an interesting question
❑ Get the data
❑ Explore the data
❑ Model the data
❑ Communicate and visualize your results
Data Engineering
❑ Big data storage
❑ SQL, NoSQL
❑ warehouses, lakes
❑ Cloud computing architectures
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ Big data analytics
❑ Distributed programming
platforms
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ etc.
EXERCISES
Or, ‘Trying some of this out’
3
4
Exercises
❑ Use pandas read_csv() to read a datafile in in chunks
LEARNING MORE
Or, ‘books’
3
6
READING
3
7
“Books are a
uniquely portable
magic” – Stephen
King
THANK YOU
sjterp@thoughtworks.com

More Related Content

PPTX
Session 04 communicating results
bodaceacat
 
PPTX
Session 01 designing and scoping a data science project
bodaceacat
 
PPTX
Data Science Project Lifecycle and Skill Set
IDEAS - Int'l Data Engineering and Science Association
 
PDF
Unit 3 part 2
MohammadAsharAshraf
 
PPTX
Session 10 handling bigger data
bodaceacat
 
PPTX
Introduction to data science
Sampath Kumar
 
PDF
Introduction To Data Science
Spotle.ai
 
PDF
Introduction to Data Science
ANOOP V S
 
Session 04 communicating results
bodaceacat
 
Session 01 designing and scoping a data science project
bodaceacat
 
Data Science Project Lifecycle and Skill Set
IDEAS - Int'l Data Engineering and Science Association
 
Unit 3 part 2
MohammadAsharAshraf
 
Session 10 handling bigger data
bodaceacat
 
Introduction to data science
Sampath Kumar
 
Introduction To Data Science
Spotle.ai
 
Introduction to Data Science
ANOOP V S
 

What's hot (20)

PDF
Data science presentation
MSDEVMTL
 
PPTX
Big data and data science overview
Colleen Farrelly
 
PPTX
Introduction to Data Science
Caserta
 
PPTX
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
PDF
Data science
Sreejith c
 
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
PDF
Introduction to data science intro,ch(1,2,3)
heba_ahmad
 
PPTX
Data science
SouravSadhukhan6
 
PPTX
Data Science using Python
ShapeMySkills Pvt Ltd
 
PPS
Big Data Science: Intro and Benefits
Chandan Rajah
 
PDF
Data science presentation 2nd CI day
Mohammed Barakat
 
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Edureka!
 
PDF
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
PDF
Data science
9diov
 
PDF
Demystifying Data Science with an introduction to Machine Learning
Julian Bright
 
PPTX
Big data deep learning: applications and challenges
fazail amin
 
PDF
Data science vs. Data scientist by Jothi Periasamy
Peter Kua
 
PDF
The Evolution of Data Science
Kenny Daniel
 
PDF
Data Science
Prithwis Mukerjee
 
PDF
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Ilkay Altintas, Ph.D.
 
Data science presentation
MSDEVMTL
 
Big data and data science overview
Colleen Farrelly
 
Introduction to Data Science
Caserta
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Data science
Sreejith c
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Introduction to data science intro,ch(1,2,3)
heba_ahmad
 
Data science
SouravSadhukhan6
 
Data Science using Python
ShapeMySkills Pvt Ltd
 
Big Data Science: Intro and Benefits
Chandan Rajah
 
Data science presentation 2nd CI day
Mohammed Barakat
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Edureka!
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
Data science
9diov
 
Demystifying Data Science with an introduction to Machine Learning
Julian Bright
 
Big data deep learning: applications and challenges
fazail amin
 
Data science vs. Data scientist by Jothi Periasamy
Peter Kua
 
The Evolution of Data Science
Kenny Daniel
 
Data Science
Prithwis Mukerjee
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Ilkay Altintas, Ph.D.
 
Ad

Viewers also liked (11)

DOCX
Measure of safety culture
Salim Solanki
 
DOCX
Опис досвіду роботи
20022017
 
PDF
Encuesta de Salarios
Manfred Nuñez-Solorio
 
PPTX
Session 02 python basics
Sara-Jayne Terp
 
PDF
Potencial De Membrana Plasmatica
Sobeida J Machado C
 
DOCX
Resume
Jonathan Phillips
 
DOCX
Female faculty members awareness about retirement planning avenues
Dr. Gargi Pant Shukla
 
PPTX
Final gallery
Shaun Watkiss
 
PPTX
Decálogo de las férulas miorrelajantes
David Isla
 
PPTX
Transportes do Brasil
João José Ferreira Tojal
 
PDF
0610 w16 qp_42
Omniya Jay
 
Measure of safety culture
Salim Solanki
 
Опис досвіду роботи
20022017
 
Encuesta de Salarios
Manfred Nuñez-Solorio
 
Session 02 python basics
Sara-Jayne Terp
 
Potencial De Membrana Plasmatica
Sobeida J Machado C
 
Female faculty members awareness about retirement planning avenues
Dr. Gargi Pant Shukla
 
Final gallery
Shaun Watkiss
 
Decálogo de las férulas miorrelajantes
David Isla
 
Transportes do Brasil
João José Ferreira Tojal
 
0610 w16 qp_42
Omniya Jay
 
Ad

Similar to Session 10 handling bigger data (20)

PPT
PUC Masterclass Big Data
Arjen de Vries
 
PDF
Introduction to Big Data
Kristof Jozsa
 
PPTX
Big Data - An Overview
Arvind Kalyan
 
PDF
What's the Big Deal About Big Data?.pdf
Steven Jong
 
PDF
The Big Data Developer (@pavlobaron)
Pavlo Baron
 
PPT
Harry Potter and Enormous Data (Pavlo Baron)
Pavlo Baron
 
PPTX
Inroduction to Big Data
Omnia Safaan
 
PDF
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
PPTX
Data analytics introduction
amiyadash
 
PPTX
Big Data
Mahesh Bmn
 
PPTX
Unit 1
vishal choudhary
 
KEY
Big data and APIs for PHP developers - SXSW 2011
Eli White
 
PPTX
Big data business case
Karthik Padmanabhan ( MLE℠)
 
PPTX
Big data explanation with real time use case
N.Jagadish Kumar
 
PDF
Big data technology
omer mohamed abd alrhman
 
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
PPTX
selected topics in CS-CHaaapteerobe.pptx
BachaLamessaa
 
PDF
Big Data
Mehmet Burak Akgün
 
PDF
GADLJRIET850691
neha trivedi
 
PUC Masterclass Big Data
Arjen de Vries
 
Introduction to Big Data
Kristof Jozsa
 
Big Data - An Overview
Arvind Kalyan
 
What's the Big Deal About Big Data?.pdf
Steven Jong
 
The Big Data Developer (@pavlobaron)
Pavlo Baron
 
Harry Potter and Enormous Data (Pavlo Baron)
Pavlo Baron
 
Inroduction to Big Data
Omnia Safaan
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
Data analytics introduction
amiyadash
 
Big Data
Mahesh Bmn
 
Big data and APIs for PHP developers - SXSW 2011
Eli White
 
Big data business case
Karthik Padmanabhan ( MLE℠)
 
Big data explanation with real time use case
N.Jagadish Kumar
 
Big data technology
omer mohamed abd alrhman
 
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
selected topics in CS-CHaaapteerobe.pptx
BachaLamessaa
 
GADLJRIET850691
neha trivedi
 

More from Sara-Jayne Terp (20)

PPTX
Distributed defense against disinformation: disinformation risk management an...
Sara-Jayne Terp
 
PPTX
Risk, SOCs, and mitigations: cognitive security is coming of age
Sara-Jayne Terp
 
PPTX
disinformation risk management: leveraging cyber security best practices to s...
Sara-Jayne Terp
 
PPTX
Cognitive security: all the other things
Sara-Jayne Terp
 
PPTX
The Business(es) of Disinformation
Sara-Jayne Terp
 
PPTX
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
Sara-Jayne Terp
 
PPTX
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
Sara-Jayne Terp
 
PPTX
2021-02-10_CogSecCollab_UBerkeley
Sara-Jayne Terp
 
PPTX
Using AMITT and ATT&CK frameworks
Sara-Jayne Terp
 
PPTX
2020 12 nyu-workshop_cog_sec
Sara-Jayne Terp
 
PPTX
2020 09-01 disclosure
Sara-Jayne Terp
 
PDF
2019 11 terp_mansonbulletproof_master copy
Sara-Jayne Terp
 
PPTX
BSidesLV 2018 talk: social engineering at scale, a community guide
Sara-Jayne Terp
 
PPTX
Social engineering at scale
Sara-Jayne Terp
 
PPTX
engineering misinformation
Sara-Jayne Terp
 
PPTX
Online misinformation: they're coming for our brainz now
Sara-Jayne Terp
 
PPTX
Sj terp ciwg_nyc2017_credibility_belief
Sara-Jayne Terp
 
PPT
Belief: learning about new problems from old things
Sara-Jayne Terp
 
PPT
risks and mitigations of releasing data
Sara-Jayne Terp
 
PPTX
Session 09 learning relationships.pptx
Sara-Jayne Terp
 
Distributed defense against disinformation: disinformation risk management an...
Sara-Jayne Terp
 
Risk, SOCs, and mitigations: cognitive security is coming of age
Sara-Jayne Terp
 
disinformation risk management: leveraging cyber security best practices to s...
Sara-Jayne Terp
 
Cognitive security: all the other things
Sara-Jayne Terp
 
The Business(es) of Disinformation
Sara-Jayne Terp
 
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
Sara-Jayne Terp
 
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
Sara-Jayne Terp
 
2021-02-10_CogSecCollab_UBerkeley
Sara-Jayne Terp
 
Using AMITT and ATT&CK frameworks
Sara-Jayne Terp
 
2020 12 nyu-workshop_cog_sec
Sara-Jayne Terp
 
2020 09-01 disclosure
Sara-Jayne Terp
 
2019 11 terp_mansonbulletproof_master copy
Sara-Jayne Terp
 
BSidesLV 2018 talk: social engineering at scale, a community guide
Sara-Jayne Terp
 
Social engineering at scale
Sara-Jayne Terp
 
engineering misinformation
Sara-Jayne Terp
 
Online misinformation: they're coming for our brainz now
Sara-Jayne Terp
 
Sj terp ciwg_nyc2017_credibility_belief
Sara-Jayne Terp
 
Belief: learning about new problems from old things
Sara-Jayne Terp
 
risks and mitigations of releasing data
Sara-Jayne Terp
 
Session 09 learning relationships.pptx
Sara-Jayne Terp
 

Recently uploaded (20)

PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 

Session 10 handling bigger data

  • 1. HANDLING BIGGER DATA What to do if your data’s too big Data nerding
  • 2. Your 5-7 things ❑ Bigger data ❑ Much bigger data ❑ Much bigger data storage ❑ Bigger data science teams
  • 3. BIGGER DATA Or, ‘data that’s a bit too big’ 3
  • 5. Computer storage 250Gb Internal hard drive. (hopefully) permanent storage. The place you’re storing photos, data etc 16Gb RAM. Temporary storage. The place read_csv loads your dataset into. 2Tb External hard drive. A handy place to keep bigger datafiles.
  • 6. Gigabytes, Terabytes etc. Name Size in bytes Contains (roughly) Byte 1 1 character (‘a’, ‘1’ etc) Kilobyte 1,000 Half a printed page Megabyte 1,000,000 1 novella. 5Mb = complete works of Shakespeare Gigabyte 1,000,000,000 1 high-fidelity symphony recording; 10m of shelved books Terabyte 1,000,000,000,000 All the x-ray films in a large hospital; 10 = library of congress collection. 2.6 = Panama Papers leak Petabyte 1,000,000,000,000,000 2 = all US academic libraries; 10= 1 hour’s output from SKA telescope Exabyte 1,000,000,000,000,000,000 5 = all words ever spoken by humans Zettabyte 1,000,000,000,000,000,000,000 Yottabyte 1,000,000,000,000,000,000,000,000 Current storage capacity of the Internet
  • 7. Things to Try: Too Big ❑Read data in ‘chunks’ csv_chunks = pandas.read_csv(‘myfile.csv’, chunksize = 10000) ❑ Divide and conquer in your code: csv_chunks = pandas.read_csv(‘myfile.csv’, skiprows=10000, chunksize = 10000) ❑Use parallel processing ❑ E.g the Dask library
  • 8. Things to try: Too Slow ❑Use %timeit to find where the speed problems are ❑Use compiled python, (e.g. the Numba library) ❑Use C code (via Cython) 8
  • 9. MUCH BIGGER DATA Or, ‘What if it really doesn’t fit?’ 9
  • 11. Much Faster Datastreams Twitter firehose: ❑ Firehose averages 6,000 tweets per second ❑ Record is 143,199 tweets in one second (Aug 3rd 2013, Japan) ❑ Twitter public streams = 1% of Firehose steam Google index (2013): ❑ 30 trillion unique pages on the internet ❑ Google index = 100 petabytes (100 million gigabytes) ❑ 100 billion web searches a month ❑ Search returned in about ⅛ second
  • 12. Distributed systems ❑ Store the data on multiple ‘servers’: ❑ Big idea: Distributed file systems ❑ Replicate data (server hardware breaks more often than you think) ❑ Do the processing on multiple servers: ❑ Lots of code does the same thing to different pieces of data ❑ Big idea: Map/Reduce
  • 13. Parallel Processors ❑Laptop: 4 cores, 16 GB RAM, 256 GB disk ❑Workstation: 24 cores, 1 TB RAM ❑Clusters: as big as you can imagine… 13
  • 15. Your typical rack server...
  • 17. Distributed Programming Platforms Hadoop ❑ HDFS: distributed filesystem ❑ MapReduce engine: processing Spark ❑ In-memory processing ❑ Because moving data around is the biggest bottleneck
  • 20. Parallel Python Libraries ❑ Dask ❑ Datasets look like NumpyArrays, Pandas DataFrames ❑ df.groupby(df.index).value.mean() ❑ Direct access into HDFS, S3 etc ❑ PySpark ❑ Also has DataFrames ❑ Connects to Spark 20
  • 21. MUCH BIGGER DATA STORAGE Or, ‘Where do we put all this stuff?’ 2 1
  • 22. SQL Databases ❑ Row/column tables ❑ Keys ❑ SQL query language ❑ Joins etc (like Pandas)
  • 23. ETL (Extract - Transform - Load) ❑ Extract ❑ Extract data from multiple sources ❑ Transform ❑ Convert data into database formats (e.g. sql) ❑ Load ❑ Load data into database
  • 25. NoSql Databases ❑ Not forced into row/column ❑ Lots of different types ❑ Key/value: can add feature without rewriting tables ❑ Graph: stores nodes and edges ❑ Column: useful if you have a lot more reads than writes ❑ Document: general-purpose. MongoDb is commonly used.
  • 27. BIGGER DATA SCIENCE TEAMS Or, ‘Who does this stuff?’ 2 7
  • 28. Big Data Work ❑ Data Science ❑ Data Analysis ❑ Data Engineering ❑ Data Strategy
  • 29. Big Data Science Teams ❑ Usually seen: ❑ Project manager ❑ Business analysts ❑ Data Scientists / Analysts: insight from data ❑ Data Engineers / Developers: data flow implementation, production systems ❑ Sometimes seen: ❑ Data Architect: data flow design ❑ User Experience / User Interface developer / Visual designer
  • 30. Data Strategy ❑ Why should data be important here? ❑ Which business questions does this place have? ❑ What data does/could this place have access to? ❑ How much data work is already here? ❑ Who has the data science gene? ❑ What needs to change to make this place data-driven? ❑ People (training, culture) ❑ Processes ❑ Technologies (data access, storage, analysis tools) ❑ Data
  • 31. Data Analysis ❑ What are the statistics of this dataset? ❑ E.g. which pages are popular ❑ Usually on already-formatted data, e.g. google analytics results
  • 32. Data Science ❑ Ask an interesting question ❑ Get the data ❑ Explore the data ❑ Model the data ❑ Communicate and visualize your results
  • 33. Data Engineering ❑ Big data storage ❑ SQL, NoSQL ❑ warehouses, lakes ❑ Cloud computing architectures ❑ Privacy / security ❑ Uptime ❑ Maintenance ❑ Big data analytics ❑ Distributed programming platforms ❑ Privacy / security ❑ Uptime ❑ Maintenance ❑ etc.
  • 34. EXERCISES Or, ‘Trying some of this out’ 3 4
  • 35. Exercises ❑ Use pandas read_csv() to read a datafile in in chunks
  • 37. READING 3 7 “Books are a uniquely portable magic” – Stephen King