SlideShare a Scribd company logo
Demystifying Data
Engineering
Data engineering
• Software engineering with an emphasis on
dealing with large amounts of data
• A “specialty” of software engineering
Why now?
• Always value in scale, but it was previously
too difficult / expensive
• Economics and technology advances make
these scales accessible
Enable others to answer questions on
dataset within latency constraints
Data engineering
• Distributed systems – consensus,
consistency, availability, etc.
• Parallel processing
• Databases
• Queuing
Data engineering
• Human-fault tolerance
• Metrics and monitoring
• Multi-tenancy
BackType
• When I joined:
• Comment search by keyword
• Comment search by user
• Basic stats on commenters
• Link search on Twitter
BackType
Kyoto
Cabinet
Custom
workers
Custom
crawlers
BackType
• Inflexible
• Prone to corruption
• Heavy operational burden
• Not scalable
• Not fault-tolerant
BackType
• Enable asking any question (with high
latency)
• Allows exploration and experimentation
• Establishes human-fault tolerance
Collector
Collector
Collector
Collector
Demystifying Data Engineering
ElephantDB
• Export results of MapReduce pipelines for
querying
• Low latency querying but out of date by
many hours
• Incredibly simple
Demystifying Data Engineering
• Infrastructure
• Data pipelines
• Abstractions
Data engineering
Data pipeline example
Tweets
(S3)
Normalize
URLs
Compute
hour bucket
Sum by
hour/url
Emit
ElephantDB
indexes
Data pipeline example
Tweets
(Kafka)
Normalize
URLs
Compute
hour bucket
Update hour/
url bucket
Cassandra
Abstraction example
MapReduce Cascading Cascalog
Demystifying Data Engineering
Infrastructure
• HDFS
• MapReduce
• Kafka
• Storm
• Spark
• Cassandra
• HBase
• ElephantDB
• Zookeeper
Streaming compute
team at Twitter
• Started streaming compute team at Twitter
• One shared Storm cluster for entire
company
Multi-tenancy
• Independent applications on same cluster
• Topologies should not affect one another
Resource allocation
• Topologies should be given an appropriate
amount of resources
Initial approach
• Use Mesos to provide resource guarantees
• Users include resources needed as part of
topology submission
Demystifying Data Engineering
Solution
• Implement new scheduler which gives
production topologies dedicated hardware
• Only Storm team can configure production
topologies
• Left-over machines are used as failover or
for in-development topologies
Demystifying Data Engineering
Data Engineering vs Data Science
• Well-defined problems
• No special statistics skills required
• Larger scope
• Not just analytics
Open source
• Almost all major Big Data tools are open
source (e.g. Hadoop, Storm, Spark, Kafka,
Cassandra, HBase, etc.)
• Many have commercial support
Open source
• Very important for recruiting data
engineers
• Strong developers want to work at places
where they can be involved with open
source
Open source
• Develop a technology brand for company
(in conjunction with a tech blog)
• Creating a popular open source project can
give you access to lots of strong engineers
Open source
• Identify strong engineers in the community
you may want to recruit
• Learn best practices and get help from the
people who know the tools the best
• *Do not* expect to get “free work” on
your projects
Ideal data engineer
• Strong software engineering skills
• Abstraction
• Testing
• Version control
• Refactoring
Ideal data engineer
• Strong software engineering skills
• Strong algorithm skills
Ideal data engineer
• Strong software engineering skills
• Strong algorithm skills
• Good at digging into open source code
Ideal data engineer
• Strong software engineering skills
• Strong algorithm skills
• Good at digging into open source code
• Good at stress testing
Demystifying Data Engineering
Finding strong data engineers
• Standard “coding on the whiteboard”
interviews are near useless
• Use take home projects to gauge general
programming ability
• The best is to see projects that require
data engineering
Questions?

More Related Content

What's hot (20)

PDF
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
PDF
Emerging Trends in Data Engineering
Ananth PackkilDurai
 
PPTX
Azure data platform overview
James Serra
 
PDF
Architecture of Big Data Solutions
Guido Schmutz
 
PDF
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
PPTX
Introduction to Data Engineering
Hadi Fadlallah
 
PDF
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Edureka!
 
PPTX
Introduction to Data Engineering
Vivek Aanand Ganesan
 
PDF
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Edureka!
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
PDF
Big Data Architecture
Guido Schmutz
 
PDF
R Data Access from hdfs,spark,hive
arunkumar sadhasivam
 
PPTX
Introduction to data science club
Data Science Club
 
PPTX
Microsoft power bi
techpro360
 
PDF
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Edureka!
 
PDF
Introduction to apache spark
Aakashdata
 
PPTX
Data science
Ranjit Nambisan
 
PDF
How to Become a Data Scientist
ryanorban
 
PPTX
Better decision making with proper business intelligence
madhavlankapati
 
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
Emerging Trends in Data Engineering
Ananth PackkilDurai
 
Azure data platform overview
James Serra
 
Architecture of Big Data Solutions
Guido Schmutz
 
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
Introduction to Data Engineering
Hadi Fadlallah
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Edureka!
 
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Edureka!
 
Big data and Hadoop
Rahul Agarwal
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Big Data Architecture
Guido Schmutz
 
R Data Access from hdfs,spark,hive
arunkumar sadhasivam
 
Introduction to data science club
Data Science Club
 
Microsoft power bi
techpro360
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Edureka!
 
Introduction to apache spark
Aakashdata
 
Data science
Ranjit Nambisan
 
How to Become a Data Scientist
ryanorban
 
Better decision making with proper business intelligence
madhavlankapati
 

Viewers also liked (15)

PDF
Data Engineering Quick Guide
Asim Jalis
 
PDF
From Volume to Value - A Guide to Data Engineering
Ry Walker
 
PPTX
Data Engineering Efficiency @ Netflix - Strata 2017
Michelle Ufford
 
PDF
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
PyData
 
KEY
The Secrets of Building Realtime Big Data Systems
nathanmarz
 
PDF
HBase Data Types
Nick Dimiduk
 
PPTX
Big data road map
karthika karthi
 
PDF
The inherent complexity of stream processing
nathanmarz
 
PPTX
Data analytics
Dr.Bhuvaneswari Velumani
 
PDF
Apache Big Data EU 2015 - HBase
Nick Dimiduk
 
PDF
11 Hard to Ignore Data Analytics Quotes
Cloudlytics
 
PPTX
Big Data: The 6 Key Skills Every Business Needs
Bernard Marr
 
PPTX
Big Data: The 4 Layers Everyone Must Know
Bernard Marr
 
PPTX
What is Big Data?
Bernard Marr
 
PPTX
Big Data - 25 Amazing Facts Everyone Should Know
Bernard Marr
 
Data Engineering Quick Guide
Asim Jalis
 
From Volume to Value - A Guide to Data Engineering
Ry Walker
 
Data Engineering Efficiency @ Netflix - Strata 2017
Michelle Ufford
 
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
PyData
 
The Secrets of Building Realtime Big Data Systems
nathanmarz
 
HBase Data Types
Nick Dimiduk
 
Big data road map
karthika karthi
 
The inherent complexity of stream processing
nathanmarz
 
Data analytics
Dr.Bhuvaneswari Velumani
 
Apache Big Data EU 2015 - HBase
Nick Dimiduk
 
11 Hard to Ignore Data Analytics Quotes
Cloudlytics
 
Big Data: The 6 Key Skills Every Business Needs
Bernard Marr
 
Big Data: The 4 Layers Everyone Must Know
Bernard Marr
 
What is Big Data?
Bernard Marr
 
Big Data - 25 Amazing Facts Everyone Should Know
Bernard Marr
 
Ad

Similar to Demystifying Data Engineering (20)

PPTX
The Evolution of Data Engineering Emerging Trends and Scalable Architecture D...
Ashis86
 
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
PDF
Data_Engineering_Learning_Roadmap.pdf
SayakSarkar22
 
PPTX
The Evolution of Data Engineering Emerging Trends and Scalable Architecture D...
Ashis86
 
PDF
Data Engineering Course Syllabus - WeCloudData
WeCloudData
 
PPT
SQL or NoSQL, that is the question!
Andraz Tori
 
PPTX
Data Engineering Overview for freshers.pptx
xeranaw566
 
PPTX
Data Engineering Overview for new learners.pptx
xeranaw566
 
PPTX
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
PDF
The role of data engineering in data science and analytics practice
Joseph Benjamin Ilagan
 
PDF
The Basics of Data Engineering with IABAC
IABAC
 
PDF
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
AlexandreMacedo50
 
PDF
Data Science as Scale
Conor B. Murphy
 
PPTX
5 Major Trends in Data You Should Know
Tomasz Tunguz
 
PDF
Data Engineering.pdf
Datacademy.ai
 
PDF
data_engineering_basics.pdf
Ketan Patil
 
PPTX
Key Skills Required for Data Engineering
Fibonalabs
 
PDF
Mastering Data Engineering: Common Data Engineer Interview Questions You Shou...
FredReynolds2
 
PDF
What is data engineering?
yongdam kim
 
PPTX
Essential Data Engineering for Data Scientist
SoftServe
 
The Evolution of Data Engineering Emerging Trends and Scalable Architecture D...
Ashis86
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Data_Engineering_Learning_Roadmap.pdf
SayakSarkar22
 
The Evolution of Data Engineering Emerging Trends and Scalable Architecture D...
Ashis86
 
Data Engineering Course Syllabus - WeCloudData
WeCloudData
 
SQL or NoSQL, that is the question!
Andraz Tori
 
Data Engineering Overview for freshers.pptx
xeranaw566
 
Data Engineering Overview for new learners.pptx
xeranaw566
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
The role of data engineering in data science and analytics practice
Joseph Benjamin Ilagan
 
The Basics of Data Engineering with IABAC
IABAC
 
2024-07-eb-big-book-of-data-engineering-3rd-edition.pdf
AlexandreMacedo50
 
Data Science as Scale
Conor B. Murphy
 
5 Major Trends in Data You Should Know
Tomasz Tunguz
 
Data Engineering.pdf
Datacademy.ai
 
data_engineering_basics.pdf
Ketan Patil
 
Key Skills Required for Data Engineering
Fibonalabs
 
Mastering Data Engineering: Common Data Engineer Interview Questions You Shou...
FredReynolds2
 
What is data engineering?
yongdam kim
 
Essential Data Engineering for Data Scientist
SoftServe
 
Ad

More from nathanmarz (15)

PPT
Using Simplicity to Make Hard Big Data Problems Easy
nathanmarz
 
PDF
The Epistemology of Software Engineering
nathanmarz
 
PDF
Your Code is Wrong
nathanmarz
 
PDF
Runaway complexity in Big Data... and a plan to stop it
nathanmarz
 
PDF
Storm
nathanmarz
 
PDF
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
KEY
ElephantDB
nathanmarz
 
KEY
Become Efficient or Die: The Story of BackType
nathanmarz
 
KEY
Clojure at BackType
nathanmarz
 
KEY
Cascalog workshop
nathanmarz
 
KEY
Cascalog at Strange Loop
nathanmarz
 
PDF
Cascalog at Hadoop Day
nathanmarz
 
KEY
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
KEY
Cascalog
nathanmarz
 
KEY
Cascading
nathanmarz
 
Using Simplicity to Make Hard Big Data Problems Easy
nathanmarz
 
The Epistemology of Software Engineering
nathanmarz
 
Your Code is Wrong
nathanmarz
 
Runaway complexity in Big Data... and a plan to stop it
nathanmarz
 
Storm
nathanmarz
 
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
ElephantDB
nathanmarz
 
Become Efficient or Die: The Story of BackType
nathanmarz
 
Clojure at BackType
nathanmarz
 
Cascalog workshop
nathanmarz
 
Cascalog at Strange Loop
nathanmarz
 
Cascalog at Hadoop Day
nathanmarz
 
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
Cascalog
nathanmarz
 
Cascading
nathanmarz
 

Recently uploaded (20)

PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 

Demystifying Data Engineering