SlideShare a Scribd company logo
Introduction to Data Engineering
Hadi Fadlallah
About Me
…
…
…
…
Just a Data Engineer!
4/6/2022 2
Plan
• What is Data Engineering?
• Data Engineer vs. Data Scientist vs. Data Analyst
• Understanding Data Management (Data Layers, DQS, MDS,
Provenance)
• Distributed Computing
• Designing Data Pipelines (Choosing Paradigm / Technologies)
• Data Engineer Jobs / Required Skills
• Helpful Tips
• Online Courses
4/6/2022 3
What is Data
Engineering?
4/6/2022 4
What is Data Engineering?
4/6/2022 5
• Data Engineering is the act of:
• Collecting data
• Transforming (…) data
• Validating data
“Making data consumable”
What is a Data Engineer?
AI/ML Engineer BI Developer Data Analyst
Database
Administrator
Report Developer Data Developer
Data Architect Data Integration
Specialist
ETL Developer
Data Scientist
4/6/2022 6
Data Engineer vs. Data Scientist
4/6/2022 7
Source: https://elu.nl/careers-in-data-science-data-analyst-vs-data-engineer-vs-data-scientist/
4/6/2022 8
4/6/2022 9
Data
Sources
Processed
Data
Storage
Data
Ingestion
Raw Data Storage
Data Processing Data
Visualization
Understanding Data Layers
4/6/2022 10
Source: https://www.sharpersoftware.com/data/dqs
4/6/2022 11
Source: https://www.dataentryoutsourced.com/blog/ever-changing-master-data-management/
Data Provenance
4/6/2022 12
Source: https://www.canto.com/blog/data-lineage/
Data Wrangling vs. Data Pre-processing
4/6/2022 13
Source: https://medium.com/swlh/data-pre-processing-data-wrangling-4a6a8624e747
Data Pre-processing Data Wrangling
What are Distributed Systems?
4/6/2022 14
15/20
• 1 Cluster = many
machines (nodes)
• Parallel computing
• Scalability
• Fault-tolerance (High
availability)
16/20
Data ingestion Data storage Data Processing
4/6/2022 17
Designing Data
Pipelines
Understanding Data &
Requirements
4/6/2022 18
Source:
https://firebrand.training/uk/lea
rn/pmp/course-material/project-
scope-management/collect-
requirements
Choosing a Paradigm
4/6/2022 19
4/6/2022 20
Source: https://www.xplenty.com/blog/etl-vs-elt/
4/6/2022 21
Source: https://www.quora.com/Why-and-when-should-I-use-NoSQL-instead-of-SQL
4/6/2022 22
Source: https://www.talend.com/resources/customer-360-data-hub/
Choosing the RIGHT
Technologies
4/6/2022 23
Relational databases
4/6/2022 24
Source: https://www.stambia.com/en/solutions/by-technology/integration-to-database/relational-database-rdbms
Data Integration
4/6/2022 25
Source: https://deevita.com/top-data-integration-tools/
NoSQL databases
4/6/2022 26
Source: https://dock.ie/data.html
4/6/2022 27
Big Data ecosystem
Source: https://sranka.wordpress.com/2014/01/29/big-data-technologies/
4/6/2022 28
Data Engineer Jobs
4/6/2022 29
Stack Overflow: https://stackoverflow.com/jobs?q=data+engineer
LinkedIn: https://www.linkedin.com/jobs/search/?geoId=92000001&keywords=data%20engineer&location=Remote
Required Skills
2. SQL, Python
3. Pandas library
4. AWS, Azure
5. Relational databases (SQL Server and MySQL)
6. Hadoop, Spark, Hive
7. NOSQL databases (MongoDB, Neo4j, and Cassandra)
…
4/6/2022 30
4/6/2022 31
1st Required Skill
4/6/2022 32
Source: https://towardsdatascience.com/most-in-demand-tech-skills-for-data-engineers-58f4c1ca25ab
4/6/2022 33
4/6/2022 34
First, make it run
correctly (in an
acceptable way). Then
start improving
performance
4/6/2022 35
Again, make sure
that you understand
well the data and
the requirements!
4/6/2022 36
DON’T use Big Data
technologies just
because you have a
huge volume of
data.
Follow best
practices
4/6/2022 37
4/6/2022 38
Online
Courses
4/6/2022 39
• Coursera:
• Google Cloud - Data Engineering, Big Data, and Machine Learning on GCP
Specialization
• San Diego - Big Data Specialization
• Udacity:
• Data Engineering nanodegree
• DataCamp:
• Data Engineer with Python Track
• IBM – CognitiveClass.ai
• Free data science and data engineering courses
• Udemy:
• Data Science A-Z™: Real-Life Data Science Exercises Included
Thank you
4/6/2022 40
https://hadifadl.github.io

More Related Content

What's hot (20)

PDF
Data Mesh for Dinner
Kent Graziano
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PDF
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
PDF
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
PDF
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
PPTX
Data Lake Overview
James Serra
 
PDF
Big Data Ecosystem
Lucian Neghina
 
PDF
Introducing Databricks Delta
Databricks
 
PDF
Intro to Delta Lake
Databricks
 
PDF
The Future Of Big Data
Matthew Dennis
 
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider
 
PPTX
Building an Effective Data Warehouse Architecture
James Serra
 
PPTX
Big data architectures and the data lake
James Serra
 
PDF
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
DATAVERSITY
 
PPTX
Snowflake Overview
Snowflake Computing
 
PDF
The ABCs of Treating Data as Product
DATAVERSITY
 
PDF
How to govern and secure a Data Mesh?
confluent
 
PPTX
Azure data platform overview
James Serra
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
Data Mesh for Dinner
Kent Graziano
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Data Lake Overview
James Serra
 
Big Data Ecosystem
Lucian Neghina
 
Introducing Databricks Delta
Databricks
 
Intro to Delta Lake
Databricks
 
The Future Of Big Data
Matthew Dennis
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Dr. Arif Wider
 
Building an Effective Data Warehouse Architecture
James Serra
 
Big data architectures and the data lake
James Serra
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
DATAVERSITY
 
Snowflake Overview
Snowflake Computing
 
The ABCs of Treating Data as Product
DATAVERSITY
 
How to govern and secure a Data Mesh?
confluent
 
Azure data platform overview
James Serra
 
Hadoop MapReduce Fundamentals
Lynn Langit
 

Similar to Introduction to Data Engineering (20)

PPTX
What makes it worth becoming a Data Engineer?
Hadi Fadlallah
 
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
PDF
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
Tao Xie
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
What’s in Your Data Warehouse?
DATAVERSITY
 
PPTX
20180701 - 1st Meeting - Data Science Orientation
Duc Lai Trung Minh
 
PDF
Data-Ed Online: Data Architecture Requirements
DATAVERSITY
 
PDF
Data-Ed: Data Architecture Requirements
Data Blueprint
 
PDF
Data Strategy Best Practices
DATAVERSITY
 
PPTX
Practical Business Intelligence in SharePoint 2013 - Helsinki Finalnd
Ivan Sanders
 
PPTX
Practical Business Intelligence in SharePoint 2013 - Honolulu
Ivan Sanders
 
PDF
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Denodo
 
PDF
Koneksys - Offering Services to Connect Data using the Data Web
Koneksys
 
PPTX
datavault2.pptx
Mounika662749
 
PPTX
[PU&D] Why the Microsoft 365 Administrator should care about the Power Platfo...
Tomasz Poszytek
 
PDF
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Dean Hallman
 
PDF
Microsoft Technologies for Data Science sql_saturday_201505
Mark Tabladillo
 
PDF
Data Modeling Fundamentals
DATAVERSITY
 
PPTX
Data Vault 2.0: Big Data Meets Data Warehousing
All Things Open
 
PDF
Data engineering design patterns
Valdas Maksimavičius
 
What makes it worth becoming a Data Engineer?
Hadi Fadlallah
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
Tao Xie
 
Databricks Fundamentals
Dalibor Wijas
 
What’s in Your Data Warehouse?
DATAVERSITY
 
20180701 - 1st Meeting - Data Science Orientation
Duc Lai Trung Minh
 
Data-Ed Online: Data Architecture Requirements
DATAVERSITY
 
Data-Ed: Data Architecture Requirements
Data Blueprint
 
Data Strategy Best Practices
DATAVERSITY
 
Practical Business Intelligence in SharePoint 2013 - Helsinki Finalnd
Ivan Sanders
 
Practical Business Intelligence in SharePoint 2013 - Honolulu
Ivan Sanders
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Denodo
 
Koneksys - Offering Services to Connect Data using the Data Web
Koneksys
 
datavault2.pptx
Mounika662749
 
[PU&D] Why the Microsoft 365 Administrator should care about the Power Platfo...
Tomasz Poszytek
 
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Dean Hallman
 
Microsoft Technologies for Data Science sql_saturday_201505
Mark Tabladillo
 
Data Modeling Fundamentals
DATAVERSITY
 
Data Vault 2.0: Big Data Meets Data Warehousing
All Things Open
 
Data engineering design patterns
Valdas Maksimavičius
 
Ad

More from Hadi Fadlallah (20)

PPTX
RaDEn : A Scalable and Efficient Platform for Engineering Radiation Data
Hadi Fadlallah
 
PPTX
ORADIEX : A Big Data driven smart framework for real-time surveillance and an...
Hadi Fadlallah
 
PPTX
An introduction to Business intelligence
Hadi Fadlallah
 
PPTX
Big data lab as a service
Hadi Fadlallah
 
PPTX
Risk management and IT technologies
Hadi Fadlallah
 
PPTX
Fog computing
Hadi Fadlallah
 
PPTX
Inertial sensors
Hadi Fadlallah
 
PPTX
Big Data Integration
Hadi Fadlallah
 
PPTX
Cloud computing pricing models
Hadi Fadlallah
 
PPTX
Internet of things security challenges
Hadi Fadlallah
 
PPTX
Marketing Mobile
Hadi Fadlallah
 
PPTX
Secure Aware Routing Protocol
Hadi Fadlallah
 
PPTX
Bhopal disaster
Hadi Fadlallah
 
PPTX
Penetration testing in wireless network
Hadi Fadlallah
 
PPTX
Cyber propaganda
Hadi Fadlallah
 
PPTX
Dhcp authentication using certificates
Hadi Fadlallah
 
PPTX
Introduction to Data mining
Hadi Fadlallah
 
PPTX
Sql parametrized queries
Hadi Fadlallah
 
PPTX
Introduction to software testing
Hadi Fadlallah
 
PPTX
Enhancing the performance of kmeans algorithm
Hadi Fadlallah
 
RaDEn : A Scalable and Efficient Platform for Engineering Radiation Data
Hadi Fadlallah
 
ORADIEX : A Big Data driven smart framework for real-time surveillance and an...
Hadi Fadlallah
 
An introduction to Business intelligence
Hadi Fadlallah
 
Big data lab as a service
Hadi Fadlallah
 
Risk management and IT technologies
Hadi Fadlallah
 
Fog computing
Hadi Fadlallah
 
Inertial sensors
Hadi Fadlallah
 
Big Data Integration
Hadi Fadlallah
 
Cloud computing pricing models
Hadi Fadlallah
 
Internet of things security challenges
Hadi Fadlallah
 
Marketing Mobile
Hadi Fadlallah
 
Secure Aware Routing Protocol
Hadi Fadlallah
 
Bhopal disaster
Hadi Fadlallah
 
Penetration testing in wireless network
Hadi Fadlallah
 
Cyber propaganda
Hadi Fadlallah
 
Dhcp authentication using certificates
Hadi Fadlallah
 
Introduction to Data mining
Hadi Fadlallah
 
Sql parametrized queries
Hadi Fadlallah
 
Introduction to software testing
Hadi Fadlallah
 
Enhancing the performance of kmeans algorithm
Hadi Fadlallah
 
Ad

Recently uploaded (20)

PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 

Introduction to Data Engineering