Session 10 handling bigger data

HANDLING BIGGER DATA
What to do if your data’s too big
Data nerding

Your 5-7 things
❑ Bigger data
❑ Much bigger data
❑ Much bigger data storage
❑ Bigger data science teams

BIGGER DATA
Or, ‘data that’s a bit too big’
3

Computer storage
250Gb Internal hard drive. (hopefully)
permanent storage. The place you’re
storing photos, data etc
16Gb RAM. Temporary
storage. The place
read_csv loads your
dataset into.
2Tb External hard
drive. A handy place
to keep bigger
datafiles.

Gigabytes, Terabytes etc.
Name Size in bytes Contains (roughly)
Byte 1 1 character (‘a’, ‘1’ etc)
Kilobyte 1,000 Half a printed page
Megabyte 1,000,000 1 novella. 5Mb = complete works of Shakespeare
Gigabyte 1,000,000,000 1 high-fidelity symphony recording; 10m of shelved books
Terabyte 1,000,000,000,000 All the x-ray films in a large hospital; 10 = library of
congress collection. 2.6 = Panama Papers leak
Petabyte 1,000,000,000,000,000 2 = all US academic libraries; 10= 1 hour’s output from
SKA telescope
Exabyte 1,000,000,000,000,000,000 5 = all words ever spoken by humans
Zettabyte 1,000,000,000,000,000,000,000
Yottabyte 1,000,000,000,000,000,000,000,000 Current storage capacity of the Internet

Things to Try: Too Big
❑Read data in ‘chunks’
csv_chunks = pandas.read_csv(‘myfile.csv’, chunksize = 10000)
❑ Divide and conquer in your code:
csv_chunks = pandas.read_csv(‘myfile.csv’, skiprows=10000, chunksize = 10000)
❑Use parallel processing
❑ E.g the Dask library

Things to try: Too Slow
❑Use %timeit to find where the speed problems are
❑Use compiled python, (e.g. the Numba library)
❑Use C code (via Cython)
8

MUCH BIGGER DATA
Or, ‘What if it really doesn’t fit?’
9

Much Faster Datastreams
Twitter firehose:
❑ Firehose averages 6,000 tweets per second
❑ Record is 143,199 tweets in one second (Aug 3rd 2013, Japan)
❑ Twitter public streams = 1% of Firehose steam
Google index (2013):
❑ 30 trillion unique pages on the internet
❑ Google index = 100 petabytes (100 million gigabytes)
❑ 100 billion web searches a month
❑ Search returned in about ⅛ second

Distributed systems
❑ Store the data on multiple ‘servers’:
❑ Big idea: Distributed file systems
❑ Replicate data (server hardware breaks more often than you think)
❑ Do the processing on multiple servers:
❑ Lots of code does the same thing to different pieces of data
❑ Big idea: Map/Reduce

Parallel Processors
❑Laptop: 4 cores, 16 GB RAM, 256 GB disk
❑Workstation: 24 cores, 1 TB RAM
❑Clusters: as big as you can imagine…
13

Map/Reduce: Crowdsourcing for computers

Distributed Programming Platforms
Hadoop
❑ HDFS: distributed filesystem
❑ MapReduce engine: processing
Spark
❑ In-memory processing
❑ Because moving data around is the biggest bottleneck

Typical (Current) Ecosystem
HDFS
Spark
Python
R
SQL
Tableau
Publisher
Data warehouse

Parallel Python Libraries
❑ Dask
❑ Datasets look like NumpyArrays, Pandas DataFrames
❑ df.groupby(df.index).value.mean()
❑ Direct access into HDFS, S3 etc
❑ PySpark
❑ Also has DataFrames
❑ Connects to Spark
20

MUCH BIGGER DATA
STORAGE
Or, ‘Where do we put all this stuff?’
2
1

SQL Databases
❑ Row/column tables
❑ Keys
❑ SQL query language
❑ Joins etc (like Pandas)

ETL (Extract - Transform - Load)
❑ Extract
❑ Extract data from multiple sources
❑ Transform
❑ Convert data into database formats (e.g. sql)
❑ Load
❑ Load data into database

NoSql Databases
❑ Not forced into row/column
❑ Lots of different types
❑ Key/value: can add feature without rewriting
tables
❑ Graph: stores nodes and edges
❑ Column: useful if you have a lot more reads
than writes
❑ Document: general-purpose. MongoDb is
commonly used.

BIGGER DATA SCIENCE
TEAMS
Or, ‘Who does this stuff?’
2
7

Big Data Work
❑ Data Science
❑ Data Analysis
❑ Data Engineering
❑ Data Strategy

Big Data Science Teams
❑ Usually seen:
❑ Project manager
❑ Business analysts
❑ Data Scientists / Analysts: insight from data
❑ Data Engineers / Developers: data flow implementation, production systems
❑ Sometimes seen:
❑ Data Architect: data flow design
❑ User Experience / User Interface developer / Visual designer

Data Strategy
❑ Why should data be important here?
❑ Which business questions does this place have?
❑ What data does/could this place have access to?
❑ How much data work is already here?
❑ Who has the data science gene?
❑ What needs to change to make this place data-driven?
❑ People (training, culture)
❑ Processes
❑ Technologies (data access, storage, analysis tools)
❑ Data

Data Analysis
❑ What are the statistics of this dataset?
❑ E.g. which pages are popular
❑ Usually on already-formatted data, e.g. google analytics results

Data Science
❑ Ask an interesting question
❑ Get the data
❑ Explore the data
❑ Model the data
❑ Communicate and visualize your results

Data Engineering
❑ Big data storage
❑ SQL, NoSQL
❑ warehouses, lakes
❑ Cloud computing architectures
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ Big data analytics
❑ Distributed programming
platforms
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ etc.

EXERCISES
Or, ‘Trying some of this out’
3
4

Exercises
❑ Use pandas read_csv() to read a datafile in in chunks

LEARNING MORE
Or, ‘books’
3
6

READING
3
7
“Books are a
uniquely portable
magic” – Stephen
King

THANK YOU
sjterp@thoughtworks.com

Session 10 handling bigger data

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Session 10 handling bigger data (20)

More from Sara-Jayne Terp (20)

Recently uploaded (20)

Session 10 handling bigger data