SlideShare a Scribd company logo
Data.Engineers.Toolkit
Tools for Cloud Data Engineering
The Most Popular Tools in Use Today across
10 Skillsets in Data Engineering
Business Platform Success
We help our clients build their global
platforms on scalable data platforms
with our Playbook, Framework, and
Knowledge base.
Who we help Succeed
ARCHITECT
noun: architect; chief builder
verb: architect; design or make (COMPUTING)
“We create and manage global platforms that run on
Cassandra and related technologies.”
5
Things We Love : Scalable Fast Data
Without Datastax
With Datastax
The Landscape of Cloud Data Engineering
Query BI
Data
Warehouse
DataOps
DevOps
Data Engineering
SQL
NoSQL
Queues
Data Lake
SQL - The foundation of data
engineering. Still very relevant.
SQL / Relational Databases in Data Engineering
1. MySQL - The most popular DB /
variant of SQL in use (MariaDB).
2. PostgreSQL - Used by more and
more to replace Oracle
3. Microsoft SQL - Still relevant. Not
going anywhere.
4. Oracle - Big companies use this. Still
relevant.
1. Popularity - Very popular because
most software commercial or open
source runs on relational databases.
2. Function - What SQL can do in
relation to ACID transactions
currently hard to beat in NoSQL
3. Staying Power - Open, Commercial,
Cloud options. No reason to see it
disappearing.
Tools Factors
NoSQL - The foundation for big
data applications. Lots of variants.
NoSQL / Non-relational DBs in Data Engineering
1. Mongo - Due to popularity in Node
world, in use everywhere.
2. Redis - Needed not only for Apps but
in the process of data engineering.
3. Dynamo - Easy to get started. Lots of
AWS play apps on Dynamo.
4. Cassandra - In use by the largest
companies with critical ops.
1. Popularity - Popular because of
ease of use to get started
2. Function - Each has its own special
reason to be useful.
3. Staying Power - Different variants /
implementations / managed services
for these DBs mean that enough
people need it for these additional
markets of services.
Tools Factors
Data Lakes on HDFS - The
standard for storage and retrieval of
files - structured, unstructured,
semi-structured, or binary.
Data Lakes on HDFS / S3 Distributed File Storage
1. HDFS - Universal protocol for
distributed file system access.
2. Amazon S3 - Supports HDFS and S3
object API also a standard now.
3. Google Storage - Does what S3 does
on Google
4. Azure Blob - Does what S3 does on
Azure
1. Popularity - Popularized due to big
data and clouds needing their own
distributed file storage.
2. Function - Use as an object storage
(key:value) or to store raw files , or
structured data for use later in query
engines.
3. Staying Power - Is responsible for
the massive storage of all “cold” data
that doesn’t need to be in a
database. HDFS/S3 standards now
universal.
Tools Factors
Streams/Queues - Adding “Real-
time” processing into the mix.
Streams / Queues in Data Engineering
1. Popularity - Popular because of the
rise of real-time use-cases in
business platforms.
2. Function - Used to store “everything”
that’s happening as well as for
focused “events” to trigger
processes.
3. Staying Power - Different reasons for
staying power: demand in the market
and current users continue to grow
use-cases.
Tools Factors
1. RabbitMQ - Lots of use in business,
works well until it doesn’t.
2. Apache Kafka - Full ecosystem and
variants that support Kafka protocol.
3. Amazon SQS - Easy to get started
and use in Amazon. Similar services
in other Clouds
Data Engineering - The actual
work.
Popular Data Engineering Tools
1. Popularity - Different reasons for
popularity. Commercial tools save
tons of time.
2. Function - Allows to consolidate and
standardize all flows into a single
system.
3. Staying Power - Apache Spark is a
core part of cloud offerings.
Stitchdata, Fivetran popular at large
companies. Dbt is new but has good
growth.
Tools Factors
1. Apache Spark - The most popular big
data engineering toolkit. Python,
Scala, Java, R, C#
2. Dbt - New tool but very powerful.
Abstracts database engineering into
SQL.
3. Fivetran - Commercial tool for
visually managing data flows.
4. Stitch - Similar to Fivetran, many
connectors / open Singer framework.
Data Operations - Managing In /
Out / Around
Data Operations in Data Engineering
1. Popularity - Traction in big and small
companies.
2. Function - Allows to orchestrate
complex workflows of tasks (DAG).
3. Staying Power - Airflow future proof
in Kubernetes, Argo is the new kid in
Kubernetes. Jenkins is in use in
many companies.
Tools Factors
1. Airflow - Many connectors to
manage complex data flows.
2. Jenkins - Used for CICD can do
linear pipelines.
3. Prefect - New but powerful tool in
Python
4. Argo - Does CICD but the Workflow
engine is useful, runs Kubeflow
Data Warehouse - Running SQL at
large scale.
Data Warehouse - Analytics Across Data
1. Popularity - Warehousing
conventions around for a while -
dimensions, facts.
2. Function - After bringing data
together and relating it , can do
massive SQL queries.
3. Staying Power - Theory isn’t going
anywhere. Technologies my change,
but the core concept is solid.
Tools Factors
1. Redshift - Widely used due to
Amazon
2. BigQuery - Well integrated query
engine in Google.
3. Snowflake - Does a bit of data
engineering as well as query engine.
4. MsSQL/Oracle - Commercial DBs
have a data warehouse
configuration.
Query - Virtualizing data through
standard query engines.
Query Engines - Analytics Across Data Sources
1. Popularity - Hive is a standard,
works in different systems like
Spark/Hadoop. Presto popular.
Denodo coming up.
2. Function - Separates storage from
query. “Virtualizes” queries.
3. Staying Power - The theory has
been now implemented in
Snowflake, Redshift - separate
storage from query. These will stick.
Tools Factors
1. Apache Hive - Available in Hadoop
ecosystem or some variants by
cloud vendors.
2. Spark SQL / Hive - Like Hive but on
Spark.
3. PrestoDB - Open data virtualization,
can run on Spark, works with Hive.
4. Denodo - Commercial data
virtualization, can run on Spark
Business Intelligence -
Visualization and dashboarding data
for consumers.
Business Intelligence tools for Data Engineers
1. Popularity - BI is HUGE. Learning it
is not just about the tool. Tools are
always coming and going.
2. Function - Allows non programmers
to discover, analyze, and create
visualizations, and reports that other
non-technical people can consume.
3. Staying Power - Tableau will stick
around. Open source Redash now
supported by Databricks.
Tools Factors
1. Tableau - Very popular since they
give people community access.
2. Looker - Commercial grade tool -
expect good UI.
3. Redash - Powerful open source tool
for data professionals to make
reports/dashboards.
4. Metabase - Easy to use tool for non
admin / dba types.
DevOps - Infrastructure/Software
Configuration/Large Scale Admin
Dev Ops Tools for Data Engineering
Tools More Tools
1. Terraform - Manage different clouds
with one language.
2. Prometheus / Grafana - The O.G. of
time series system data vis.
3. Ansible - Organizes commands that
need to be run better - Setup,
Configure, Run ad-hoc commands
1. Docker - Customize your image.
2. Kubernetes - Run your cluster.
3. Argo - CICD for Containers in
Kubernetes land.
4. Jenkins - General purpose CICD -
can use this to run other tools.
Any Questions?
Create and
manage global
data platforms.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037
Knowledge
Playbook.anant.us
Blog.anant.us
Cassandra.link
Cassandra.tools
Let’s talk.
Service Catalog
Cassandra
Spark
Kafka
Airflow
DevOps
DataOps
Training
Data Engineering
DevOps
DataOps
(Apprentice)

More Related Content

What's hot (20)

PPTX
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
HostedbyConfluent
 
PDF
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Edureka!
 
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
PDF
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
PPTX
Grafana vs Kibana
jeetendra mandal
 
PPTX
Data Lake Overview
James Serra
 
PPTX
Migrating with Debezium
Mike Fowler
 
PDF
Data Mesh
Piethein Strengholt
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Intro to Delta Lake
Databricks
 
PDF
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
PPTX
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
PDF
Oracle Database Migration to Oracle Cloud Infrastructure
SinanPetrusToma
 
PDF
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Visual_BI
 
PPTX
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 
PDF
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
PDF
Apache Kafka in the Healthcare Industry
Kai Wähner
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
HostedbyConfluent
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Edureka!
 
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Data Lakehouse Symposium | Day 4
Databricks
 
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Grafana vs Kibana
jeetendra mandal
 
Data Lake Overview
James Serra
 
Migrating with Debezium
Mike Fowler
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Intro to Delta Lake
Databricks
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
Oracle Database Migration to Oracle Cloud Infrastructure
SinanPetrusToma
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Visual_BI
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
Apache Kafka in the Healthcare Industry
Kai Wähner
 

Similar to Data Engineer's Lunch #55: Get Started in Data Engineering (20)

PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
PPTX
The Evolution of Data Engineering Emerging Trends and Scalable Architecture D...
Ashis86
 
PDF
Big Data Engineering for Machine Learning
Vasu S
 
PPTX
Introduction to Data Engineering
Durga Gadiraju
 
PDF
The Evolving Landscape of Data Engineering
Andrei Savu
 
PPTX
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Modern data warehouse
Stephen Alex
 
PDF
Technologies for Data Analytics Platform
N Masahiro
 
PPTX
Big Data/Hadoop Option Analysis
zafarali1981
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PDF
Architecting Agile Data Applications for Scale
Databricks
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PPTX
Piranha vs. mammoth predator appliances that chew up big data
Jack (Yaakov) Bezalel
 
PPTX
Big Data Strategy for the Relational World
Andrew Brust
 
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
PPTX
Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"
Ibrahim Muhammadi
 
PPTX
Apache Hive for modern DBAs
Luis Marques
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
The Evolution of Data Engineering Emerging Trends and Scalable Architecture D...
Ashis86
 
Big Data Engineering for Machine Learning
Vasu S
 
Introduction to Data Engineering
Durga Gadiraju
 
The Evolving Landscape of Data Engineering
Andrei Savu
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Stephen Alex
 
Technologies for Data Analytics Platform
N Masahiro
 
Big Data/Hadoop Option Analysis
zafarali1981
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Architecting Agile Data Applications for Scale
Databricks
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Demystifying data engineering
Thang Bui (Bob)
 
Piranha vs. mammoth predator appliances that chew up big data
Jack (Yaakov) Bezalel
 
Big Data Strategy for the Relational World
Andrew Brust
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"
Ibrahim Muhammadi
 
Apache Hive for modern DBAs
Luis Marques
 
Ad

More from Anant Corporation (20)

PPTX
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
PPTX
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
PDF
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
PDF
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
PDF
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
PPTX
YugabyteDB Developer Tools
Anant Corporation
 
PPTX
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
PPTX
Machine Learning Orchestration with Airflow
Anant Corporation
 
PDF
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
PDF
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
PDF
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
PDF
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
PDF
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
PPTX
CL 121
Anant Corporation
 
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
PDF
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
PPTX
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
YugabyteDB Developer Tools
Anant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
Machine Learning Orchestration with Airflow
Anant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 

Data Engineer's Lunch #55: Get Started in Data Engineering

  • 1. Data.Engineers.Toolkit Tools for Cloud Data Engineering The Most Popular Tools in Use Today across 10 Skillsets in Data Engineering
  • 2. Business Platform Success We help our clients build their global platforms on scalable data platforms with our Playbook, Framework, and Knowledge base.
  • 3. Who we help Succeed
  • 4. ARCHITECT noun: architect; chief builder verb: architect; design or make (COMPUTING) “We create and manage global platforms that run on Cassandra and related technologies.”
  • 5. 5 Things We Love : Scalable Fast Data Without Datastax With Datastax
  • 6. The Landscape of Cloud Data Engineering Query BI Data Warehouse DataOps DevOps Data Engineering SQL NoSQL Queues Data Lake
  • 7. SQL - The foundation of data engineering. Still very relevant.
  • 8. SQL / Relational Databases in Data Engineering 1. MySQL - The most popular DB / variant of SQL in use (MariaDB). 2. PostgreSQL - Used by more and more to replace Oracle 3. Microsoft SQL - Still relevant. Not going anywhere. 4. Oracle - Big companies use this. Still relevant. 1. Popularity - Very popular because most software commercial or open source runs on relational databases. 2. Function - What SQL can do in relation to ACID transactions currently hard to beat in NoSQL 3. Staying Power - Open, Commercial, Cloud options. No reason to see it disappearing. Tools Factors
  • 9. NoSQL - The foundation for big data applications. Lots of variants.
  • 10. NoSQL / Non-relational DBs in Data Engineering 1. Mongo - Due to popularity in Node world, in use everywhere. 2. Redis - Needed not only for Apps but in the process of data engineering. 3. Dynamo - Easy to get started. Lots of AWS play apps on Dynamo. 4. Cassandra - In use by the largest companies with critical ops. 1. Popularity - Popular because of ease of use to get started 2. Function - Each has its own special reason to be useful. 3. Staying Power - Different variants / implementations / managed services for these DBs mean that enough people need it for these additional markets of services. Tools Factors
  • 11. Data Lakes on HDFS - The standard for storage and retrieval of files - structured, unstructured, semi-structured, or binary.
  • 12. Data Lakes on HDFS / S3 Distributed File Storage 1. HDFS - Universal protocol for distributed file system access. 2. Amazon S3 - Supports HDFS and S3 object API also a standard now. 3. Google Storage - Does what S3 does on Google 4. Azure Blob - Does what S3 does on Azure 1. Popularity - Popularized due to big data and clouds needing their own distributed file storage. 2. Function - Use as an object storage (key:value) or to store raw files , or structured data for use later in query engines. 3. Staying Power - Is responsible for the massive storage of all “cold” data that doesn’t need to be in a database. HDFS/S3 standards now universal. Tools Factors
  • 13. Streams/Queues - Adding “Real- time” processing into the mix.
  • 14. Streams / Queues in Data Engineering 1. Popularity - Popular because of the rise of real-time use-cases in business platforms. 2. Function - Used to store “everything” that’s happening as well as for focused “events” to trigger processes. 3. Staying Power - Different reasons for staying power: demand in the market and current users continue to grow use-cases. Tools Factors 1. RabbitMQ - Lots of use in business, works well until it doesn’t. 2. Apache Kafka - Full ecosystem and variants that support Kafka protocol. 3. Amazon SQS - Easy to get started and use in Amazon. Similar services in other Clouds
  • 15. Data Engineering - The actual work.
  • 16. Popular Data Engineering Tools 1. Popularity - Different reasons for popularity. Commercial tools save tons of time. 2. Function - Allows to consolidate and standardize all flows into a single system. 3. Staying Power - Apache Spark is a core part of cloud offerings. Stitchdata, Fivetran popular at large companies. Dbt is new but has good growth. Tools Factors 1. Apache Spark - The most popular big data engineering toolkit. Python, Scala, Java, R, C# 2. Dbt - New tool but very powerful. Abstracts database engineering into SQL. 3. Fivetran - Commercial tool for visually managing data flows. 4. Stitch - Similar to Fivetran, many connectors / open Singer framework.
  • 17. Data Operations - Managing In / Out / Around
  • 18. Data Operations in Data Engineering 1. Popularity - Traction in big and small companies. 2. Function - Allows to orchestrate complex workflows of tasks (DAG). 3. Staying Power - Airflow future proof in Kubernetes, Argo is the new kid in Kubernetes. Jenkins is in use in many companies. Tools Factors 1. Airflow - Many connectors to manage complex data flows. 2. Jenkins - Used for CICD can do linear pipelines. 3. Prefect - New but powerful tool in Python 4. Argo - Does CICD but the Workflow engine is useful, runs Kubeflow
  • 19. Data Warehouse - Running SQL at large scale.
  • 20. Data Warehouse - Analytics Across Data 1. Popularity - Warehousing conventions around for a while - dimensions, facts. 2. Function - After bringing data together and relating it , can do massive SQL queries. 3. Staying Power - Theory isn’t going anywhere. Technologies my change, but the core concept is solid. Tools Factors 1. Redshift - Widely used due to Amazon 2. BigQuery - Well integrated query engine in Google. 3. Snowflake - Does a bit of data engineering as well as query engine. 4. MsSQL/Oracle - Commercial DBs have a data warehouse configuration.
  • 21. Query - Virtualizing data through standard query engines.
  • 22. Query Engines - Analytics Across Data Sources 1. Popularity - Hive is a standard, works in different systems like Spark/Hadoop. Presto popular. Denodo coming up. 2. Function - Separates storage from query. “Virtualizes” queries. 3. Staying Power - The theory has been now implemented in Snowflake, Redshift - separate storage from query. These will stick. Tools Factors 1. Apache Hive - Available in Hadoop ecosystem or some variants by cloud vendors. 2. Spark SQL / Hive - Like Hive but on Spark. 3. PrestoDB - Open data virtualization, can run on Spark, works with Hive. 4. Denodo - Commercial data virtualization, can run on Spark
  • 23. Business Intelligence - Visualization and dashboarding data for consumers.
  • 24. Business Intelligence tools for Data Engineers 1. Popularity - BI is HUGE. Learning it is not just about the tool. Tools are always coming and going. 2. Function - Allows non programmers to discover, analyze, and create visualizations, and reports that other non-technical people can consume. 3. Staying Power - Tableau will stick around. Open source Redash now supported by Databricks. Tools Factors 1. Tableau - Very popular since they give people community access. 2. Looker - Commercial grade tool - expect good UI. 3. Redash - Powerful open source tool for data professionals to make reports/dashboards. 4. Metabase - Easy to use tool for non admin / dba types.
  • 26. Dev Ops Tools for Data Engineering Tools More Tools 1. Terraform - Manage different clouds with one language. 2. Prometheus / Grafana - The O.G. of time series system data vis. 3. Ansible - Organizes commands that need to be run better - Setup, Configure, Run ad-hoc commands 1. Docker - Customize your image. 2. Kubernetes - Run your cluster. 3. Argo - CICD for Containers in Kubernetes land. 4. Jenkins - General purpose CICD - can use this to run other tools.
  • 28. Create and manage global data platforms. www.anant.us | [email protected] | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037