Data Engineer's Lunch #55: Get Started in Data Engineering

Data.Engineers.Toolkit
Tools for Cloud Data Engineering
The Most Popular Tools in Use Today across
10 Skillsets in Data Engineering

Business Platform Success
We help our clients build their global
platforms on scalable data platforms
with our Playbook, Framework, and
Knowledge base.

ARCHITECT
noun: architect; chief builder
verb: architect; design or make (COMPUTING)
“We create and manage global platforms that run on
Cassandra and related technologies.”

5
Things We Love : Scalable Fast Data
Without Datastax
With Datastax

The Landscape of Cloud Data Engineering
Query BI
Data
Warehouse
DataOps
DevOps
Data Engineering
SQL
NoSQL
Queues
Data Lake

SQL - The foundation of data
engineering. Still very relevant.

SQL / Relational Databases in Data Engineering
1. MySQL - The most popular DB /
variant of SQL in use (MariaDB).
2. PostgreSQL - Used by more and
more to replace Oracle
3. Microsoft SQL - Still relevant. Not
going anywhere.
4. Oracle - Big companies use this. Still
relevant.
1. Popularity - Very popular because
most software commercial or open
source runs on relational databases.
2. Function - What SQL can do in
relation to ACID transactions
currently hard to beat in NoSQL
3. Staying Power - Open, Commercial,
Cloud options. No reason to see it
disappearing.
Tools Factors

NoSQL - The foundation for big
data applications. Lots of variants.

NoSQL / Non-relational DBs in Data Engineering
1. Mongo - Due to popularity in Node
world, in use everywhere.
2. Redis - Needed not only for Apps but
in the process of data engineering.
3. Dynamo - Easy to get started. Lots of
AWS play apps on Dynamo.
4. Cassandra - In use by the largest
companies with critical ops.
1. Popularity - Popular because of
ease of use to get started
2. Function - Each has its own special
reason to be useful.
3. Staying Power - Different variants /
implementations / managed services
for these DBs mean that enough
people need it for these additional
markets of services.
Tools Factors

Data Lakes on HDFS - The
standard for storage and retrieval of
files - structured, unstructured,
semi-structured, or binary.

Data Lakes on HDFS / S3 Distributed File Storage
1. HDFS - Universal protocol for
distributed file system access.
2. Amazon S3 - Supports HDFS and S3
object API also a standard now.
3. Google Storage - Does what S3 does
on Google
4. Azure Blob - Does what S3 does on
Azure
1. Popularity - Popularized due to big
data and clouds needing their own
distributed file storage.
2. Function - Use as an object storage
(key:value) or to store raw files , or
structured data for use later in query
engines.
3. Staying Power - Is responsible for
the massive storage of all “cold” data
that doesn’t need to be in a
database. HDFS/S3 standards now
universal.
Tools Factors

Streams/Queues - Adding “Real-
time” processing into the mix.

Streams / Queues in Data Engineering
1. Popularity - Popular because of the
rise of real-time use-cases in
business platforms.
2. Function - Used to store “everything”
that’s happening as well as for
focused “events” to trigger
processes.
3. Staying Power - Different reasons for
staying power: demand in the market
and current users continue to grow
use-cases.
Tools Factors
1. RabbitMQ - Lots of use in business,
works well until it doesn’t.
2. Apache Kafka - Full ecosystem and
variants that support Kafka protocol.
3. Amazon SQS - Easy to get started
and use in Amazon. Similar services
in other Clouds

Data Engineering - The actual
work.

Popular Data Engineering Tools
1. Popularity - Different reasons for
popularity. Commercial tools save
tons of time.
2. Function - Allows to consolidate and
standardize all flows into a single
system.
3. Staying Power - Apache Spark is a
core part of cloud offerings.
Stitchdata, Fivetran popular at large
companies. Dbt is new but has good
growth.
Tools Factors
1. Apache Spark - The most popular big
data engineering toolkit. Python,
Scala, Java, R, C#
2. Dbt - New tool but very powerful.
Abstracts database engineering into
SQL.
3. Fivetran - Commercial tool for
visually managing data flows.
4. Stitch - Similar to Fivetran, many
connectors / open Singer framework.

Data Operations - Managing In /
Out / Around

Data Operations in Data Engineering
1. Popularity - Traction in big and small
companies.
2. Function - Allows to orchestrate
complex workflows of tasks (DAG).
3. Staying Power - Airflow future proof
in Kubernetes, Argo is the new kid in
Kubernetes. Jenkins is in use in
many companies.
Tools Factors
1. Airflow - Many connectors to
manage complex data flows.
2. Jenkins - Used for CICD can do
linear pipelines.
3. Prefect - New but powerful tool in
Python
4. Argo - Does CICD but the Workflow
engine is useful, runs Kubeflow

Data Warehouse - Running SQL at
large scale.

Data Warehouse - Analytics Across Data
1. Popularity - Warehousing
conventions around for a while -
dimensions, facts.
2. Function - After bringing data
together and relating it , can do
massive SQL queries.
3. Staying Power - Theory isn’t going
anywhere. Technologies my change,
but the core concept is solid.
Tools Factors
1. Redshift - Widely used due to
Amazon
2. BigQuery - Well integrated query
engine in Google.
3. Snowflake - Does a bit of data
engineering as well as query engine.
4. MsSQL/Oracle - Commercial DBs
have a data warehouse
configuration.

Query - Virtualizing data through
standard query engines.

Query Engines - Analytics Across Data Sources
1. Popularity - Hive is a standard,
works in different systems like
Spark/Hadoop. Presto popular.
Denodo coming up.
2. Function - Separates storage from
query. “Virtualizes” queries.
3. Staying Power - The theory has
been now implemented in
Snowflake, Redshift - separate
storage from query. These will stick.
Tools Factors
1. Apache Hive - Available in Hadoop
ecosystem or some variants by
cloud vendors.
2. Spark SQL / Hive - Like Hive but on
Spark.
3. PrestoDB - Open data virtualization,
can run on Spark, works with Hive.
4. Denodo - Commercial data
virtualization, can run on Spark

Business Intelligence -
Visualization and dashboarding data
for consumers.

Business Intelligence tools for Data Engineers
1. Popularity - BI is HUGE. Learning it
is not just about the tool. Tools are
always coming and going.
2. Function - Allows non programmers
to discover, analyze, and create
visualizations, and reports that other
non-technical people can consume.
3. Staying Power - Tableau will stick
around. Open source Redash now
supported by Databricks.
Tools Factors
1. Tableau - Very popular since they
give people community access.
2. Looker - Commercial grade tool -
expect good UI.
3. Redash - Powerful open source tool
for data professionals to make
reports/dashboards.
4. Metabase - Easy to use tool for non
admin / dba types.

DevOps - Infrastructure/Software
Configuration/Large Scale Admin

Dev Ops Tools for Data Engineering
Tools More Tools
1. Terraform - Manage different clouds
with one language.
2. Prometheus / Grafana - The O.G. of
time series system data vis.
3. Ansible - Organizes commands that
need to be run better - Setup,
Configure, Run ad-hoc commands
1. Docker - Customize your image.
2. Kubernetes - Run your cluster.
3. Argo - CICD for Containers in
Kubernetes land.
4. Jenkins - General purpose CICD -
can use this to run other tools.

Create and
manage global
data platforms.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Knowledge
Playbook.anant.us
Blog.anant.us
Cassandra.link
Cassandra.tools
Let’s talk.
Service Catalog
Cassandra
Spark
Kafka
Airflow
DevOps
DataOps
Training
Data Engineering
DevOps
DataOps
(Apprentice)

Data Engineer's Lunch #55: Get Started in Data Engineering

More Related Content

What's hot (20)

Similar to Data Engineer's Lunch #55: Get Started in Data Engineering (20)

More from Anant Corporation (20)

Recently uploaded (20)

Data Engineer's Lunch #55: Get Started in Data Engineering