Track
Data engineering supports the movement and transformation of data. As companies rely on huge amounts of data to gain insights and drive innovation, the demand for data engineers continues to grow.
For data professionals, diving into data engineering projects offers a wealth of opportunities. Hands-on challenges sharpen your technical skills and provide a tangible portfolio to showcase your knowledge and experience.
In this article, I have curated a selection of data engineering projects designed to help you advance your skills and confidently tackle real-world data challenges!
Why Work on Data Engineering Projects?
Building a solid understanding of data engineering through theory and practice is important. If you’re reading this article, you may already know this, but here are three specific reasons to dive into these projects:
Building technical skills
Data engineering projects provide hands-on experience with technologies and methodologies. You'll develop proficiency in programming languages, database management, big data processing, and cloud computing. These technical skills are fundamental to data engineering roles and highly transferable across the tech industry.
Portfolio development
Creating a portfolio of data engineering projects demonstrates your practical abilities to potential employers. You provide tangible evidence of your capabilities by showcasing implementations of data pipelines, warehouse designs, and optimization solutions.
A strong portfolio sets you apart in the job market and complements your resume with real-world accomplishments.
Learning tools and technologies
The data engineering field employs a diverse array of tools and technologies. Working on projects exposes you to data processing frameworks, workflow management tools, and visualization platforms.
This practical experience keeps you current with industry trends and enhances adaptability in an evolving technological landscape.
Data Engineering Projects for Beginners
These projects aim to introduce the main tools used by data engineers. Start here if you are new to data engineering or need a refresher.
Project 1: ETL pipeline with open data (CSV to Parquet to BigQuery)
This project entails building an ETL pipeline using a publicly available dataset, such as weather or transportation data. You will extract the data from a raw CSV file, clean and transform it using Python, and load the transformed data into Google BigQuery.
To make this project truly modern, try using Polars for your transformations instead of the traditional Pandas library. Polars is significantly faster and becoming a favorite tool in the data engineering community. Additionally, before loading the data into the cloud, practice converting it into Parquet format. Parquet is a columnar storage format that is far more efficient than CSV and is the standard for big data storage.
This project is excellent for beginners as it introduces core ETL concepts—data extraction, transformation, and loading—while giving exposure to cloud tools like BigQuery and critical file formats.
You'll also learn how to interact with cloud data warehouses, a core skill in modern data engineering, using simple tools like Python and the BigQuery API. For an introduction, review the beginner’s guide to BigQuery.
As for the data, you can select an available dataset from either Kaggle or data.gov.
Resources
Here are some resources, including GitHub repositories and tutorials, that provide step-by-step guidance:
YouTube videos and tutorials:
- Polars tutorial: Our tutorial compares Pandas and Polars libraries, helping you to understand why data engineers are switching to Polars for large datasets.
- ETL Batch Pipeline with Cloud Storage, Dataflow, and BigQuery: This video showcases a complete use case of an ETL batch pipeline deployed on Google Cloud, illustrating the extraction, transformation, and loading stages into BigQuery.
GitHub Repositories:
- End-to-End Data Pipeline: This repository demonstrates a fully automated pipeline that extracts data from CSV files, transforms it using Python and dbt, and loads it into Google BigQuery.
- ETL Pipeline with Airflow and BigQuery: This project showcases an ETL pipeline orchestrated with Apache Airflow that automates the extraction of data from CSV files, transformation using Python, and loading into BigQuery.
Courses:
- ETL and ELT in Python: Learn more about ETL processes in Python, covering foundational concepts and practical implementations to build data pipelines.
- Understanding Modern Data Architecture: This course offers a comprehensive overview of modern data architecture, focusing on best practices for moving and structuring data in cloud-based systems like BigQuery.
Skills developed
- Extracting data from CSV with Python.
- Transforming and cleaning data with Polars or Pandas.
- Working with columnar file formats like Parquet.
- Loading data into BigQuery with Python and SQL.
Project 2: Weather data pipeline with Python and PostgreSQL
This project introduces aspiring data engineers to the fundamental process of building a data pipeline, focusing on three core aspects: data collection, cleansing, and storage.
Using Python, you’ll fetch weather conditions and forecasts from Open-Meteo, a completely free API that requires no API key. Once the weather data is collected, you’ll process the raw JSON, which may involve converting temperature units, handling missing values, or standardizing location names. Finally, you’ll store the cleansed data in a PostgreSQL database.
Modern Twist (Recommended): Instead of installing PostgreSQL directly on your computer, try running it in a Docker container. This keeps your computer clean and proves to employers that you understand containerization (a mandatory skill for modern data engineering).
Resources
Here are some valuable resources to help you with this specific stack:
- Documentation:
- Open-Meteo Docs: The documentation is excellent and includes a URL builder so you can see the data structure before you write any code.
GitHub repositories:
- Weather and Air Quality ETL Pipeline: This repository demonstrates an ETL pipeline that extracts weather and air quality data from public APIs, transforms it into a clean, analyzable format, and loads it into a PostgreSQL database.
- Weather Data Integration Project: An end-to-end ETL pipeline that extracts weather data, transforms it, and loads it into a PostgreSQL database.
Courses:
- Creating PostgreSQL Databases: This course offers a comprehensive guide to PostgreSQL, covering essential skills for creating, managing, and optimizing databases—a critical step in the weather data pipeline.
- Data Engineer in Python: This skill track covers foundational data engineering skills, including data collection, transformation, and storage, providing a strong start for building pipelines in Python.
Skills developed
- Using Python to write data pipeline applications.
- Collecting data from external sources (APIs).
- Docker basics (spinning up a database container).
- Setting up databases and writing SQL to store data.
Project 3: London transport analysis
This project offers an excellent starting point for aspiring data engineers. It introduces you to working with real-world data from a major public transport network that handles over 1.5 million daily journeys.
The project's strength lies in its use of industry-standard data warehouse solutions like Snowflake, Amazon Redshift, Google BigQuery, or Databricks. These platforms are crucial in modern data engineering, allowing you to efficiently process and analyze large datasets.
By analyzing transport trends, popular methods, and usage patterns, you'll learn how to extract meaningful insights from large datasets - a core competency in data engineering.
Resources
Here are some resources, including guided projects and courses, that provide step-by-step guidance:
Guided projects:
- Exploring London’s Travel Network: This guided project teaches you how to analyze London's public transport data, helping you explore trends, popular routes, and usage patterns. You'll gain experience with large-scale data analysis using real-world data from a major public transport network.
Courses:
- Data Warehousing Concepts: This course covers essential data warehousing principles, including architectures and use cases for platforms like Snowflake, Redshift, and BigQuery. It's an excellent foundation for implementing large-scale data storage and processing solutions.
Skills developed
- Understanding the context of writing queries by better understanding the data.
- Working with large datasets.
- Understanding big data concepts.
- Working with data warehouses and big data tools, like Snowflake, Redshift, BigQuery, or Databricks.
Become a Data Engineer
Intermediate Data Engineering Projects
These projects focus on skills like being a better programmer and mixing different data platforms. These technical skills are essential for your ability to contribute to an existing tech stack and work as part of a larger team.
Project 4: Performing a code review
This project is all about reviewing the code of another data engineer. While it may not be as hands-on with the technology as some other projects, being able to review others’ code is an important part of growing as a data engineer.
Reading and reviewing code is just as important of a skill as writing code. After understanding foundational data engineering concepts and practices, you can apply them to reviewing others’ code to ensure that it follows best practices and reduces any potential bugs in the code.
Resources
Here are some valuable resources, including projects and articles, that provide step-by-step guidance:
Guided projects:
- Performing a Code Review: This guided project offers hands-on experience in code review, simulating the code review process as if you were a senior data professional. It’s an excellent way to practice identifying potential bugs and ensuring best practices are followed.
Articles:
- How to Do a Code Review: This resource provides recommendations on conducting code reviews effectively, based on extensive experience, and covers various aspects of the review process.
Skills developed
- Reading and evaluating code written by other data engineers
- Finding bugs and logic errors when reviewing code
- Providing feedback on code in a clear and helpful manner
Project 5: Building a retail data pipeline
In this project, you'll build a complete ETL pipeline with Walmart's retail data. You'll retrieve data from various sources, including SQL databases and Parquet files, apply transformation techniques to prepare and clean the data, and finally load it into an easily accessible format.
This project is excellent for building foundational yet advanced data engineering knowledge because it covers essential skills like data extraction from multiple formats, data transformation for meaningful analysis, and data loading for efficient storage and access. It helps reinforce concepts like handling diverse data sources, optimizing data flows, and maintaining scalable pipelines.
Resources
Here are some valuable resources, including guided projects and courses, that provide step-by-step guidance:
Guided projects:
- Building a Retail Data Pipeline: This guided project takes you through constructing a retail data pipeline using Walmart’s retail data. You’ll learn to retrieve data from SQL databases and Parquet files, transform it for analysis, and load it into an accessible format.
Courses:
- Database Design: A solid understanding of database design is essential when working on data pipelines. This course covers the basics of designing and structuring databases, which is valuable for handling diverse data sources and optimizing storage.
Skills developed
- Designing data pipelines for real-world use cases.
- Extracting data from multiple sources and different formats.
- Cleaning and transforming data from different formats to improve its consistency and quality.
- Loading this data into an easily accessible format.
Project 6: Factors influencing student performance with SQL
In this project, you'll analyze a comprehensive database focused on various factors that impact student success, such as study habits, sleep patterns, and parental involvement. By crafting SQL queries, you'll investigate the relationships between these factors and exam scores, exploring questions like the effect of extracurricular activities and sleep on academic performance.
This project builds data engineering skills by enhancing your ability to manipulate and query databases effectively.
You'll develop skills in data analysis, interpretation, and deriving insights from complex datasets, essential for making data-driven decisions in educational contexts and beyond.
Resources
Here are some resources, including guided projects and courses, that provide step-by-step guidance:
Guided projects:
- Factors that Fuel Student Performance: This guided project enables you to explore the influence of various factors on student success by analyzing a comprehensive database. You’ll use SQL to investigate relationships between study habits, sleep patterns, and academic performance, gaining experience in data-driven educational analysis.
Courses:
- Data Manipulation in SQL: A strong foundation in SQL data manipulation is key for this project. This course covers SQL techniques for extracting, transforming, and analyzing data in relational databases, equipping you with the skills to handle complex datasets.
Skills developed
- Writing and optimizing SQL queries to retrieve and manipulate data effectively.
- Analyzing complex datasets to identify trends and relationships.
- Formulating hypotheses and interpreting results based on data.
Project 7: High-performance local analytics with DuckDB
While the previous project focused on writing queries, this project focuses on performance and architecture. You will use DuckDB, a modern "in-process" database, to analyze a dataset that would be too slow or heavy for standard tools like Excel or Pandas.
You will take a large public dataset (like the NYC Taxi Trip Data or Citibike Data), convert it into the industry-standard Parquet format, and run complex aggregation queries. You will learn how "Columnar Storage" allows you to query millions of rows in a fraction of a second on your own laptop, without needing to install a server.
This project is impressive to employers because it shows you are keeping up with the latest trends in the "Modern Data Stack."
Resources
Here are resources to help you build this high-performance project:
- Data Sources:
- NYC Taxi & Limousine Commission: Use the "Yellow Taxi Trip Records" for a robust, real-world dataset that is perfect for testing speed.
- Documentation:
- DuckDB "SQL on Parquet": Read the official guide on how to query Parquet files directly. This is the core skill of this project.
Skills developed
- Understanding Columnar Storage (Parquet) vs. Row Storage (CSV).
- Using DuckDB for serverless, high-speed SQL.
- Benchmarking query performance.
- Working with "larger-than-memory" datasets on a local machine.
Advanced Data Engineering Projects
One hallmark of an advanced data engineer is the ability to create pipelines that can handle a multitude of data types in different technologies. These projects focus on expanding your skill set by combining multiple advanced data engineering tools to create scalable data processing systems.
Project 8: Cleaning a dataset with Pyspark
Using an advanced tool like PySpark, you can build pipelines that take advantage of Apache Spark's capabilities.
Before you attempt to build a project like this, it's important to complete an introductory course to understand the fundamentals of PySpark. This foundational knowledge will enable you to fully utilize this tool for effective data extraction, transformation, and loading.
Resources
Here are some valuable resources, including guided projects, courses, and tutorials, that provide step-by-step guidance:
Guided projects:
- Cleaning an Orders Dataset with PySpark: This guided project walks you through cleaning an e-commerce orders dataset using PySpark, helping you understand how to extract, transform, and load data in a scalable way with Apache Spark.
Courses:
- Introduction to PySpark: This course provides an in-depth introduction to PySpark, covering essential concepts and techniques for effectively working with large datasets in Spark. It's an ideal starting point for building a strong foundation in PySpark.
Tutorials:
- PySpark Tutorial: Getting Started with PySpark: This tutorial introduces the core components of PySpark, guiding you through the setup and fundamental operations so you can confidently start building data pipelines with PySpark.
Skills developed
- Expanding experience with PySpark
- Cleaning and transforming data for stakeholders
- Ingesting large batches of data
- Deepening knowledge of Python in ETL processes
Project 9: Data modeling with dbt and BigQuery
A popular and powerful modern tool for data engineers is dbt (Data Build Tool), which allows data engineers to follow a software development approach. It offers intuitive version control, testing, boilerplate code generation, lineage, and environments. dbt can be combined with BigQuery or other cloud data warehouses to store and manage your datasets.
This project will allow you to create pipelines in dbt, generate views, and link the final data to BigQuery.
Resources
Here are some valuable resources, including courses and video tutorials, that provide step-by-step guidance:
YouTube videos:
- End to End Modern Data Engineering with dbt: In this video, CodeWithYu provides a comprehensive walkthrough of setting up and using dbt with BigQuery, covering the steps for building data pipelines and generating views. It’s a helpful guide for beginners learning to combine dbt and BigQuery in a data engineering workflow.
Courses:
- Introduction to dbt: This course introduces the fundamentals of dbt, covering basic concepts like Git workflows, testing, and environment management. It’s an excellent starting point for using dbt effectively in data engineering projects.
Skills developed
- Learn about dbt
- Learn about BigQuery
- Understand how to create SQL-based transformations
- Use software engineering best practices in data engineering (version control, testing, and documentation)
Project 10: Airflow and Snowflake ETL using S3 storage and BI in Tableau
With this project, we’ll look at using Airflow to pull in data using an API and transfer that data into Snowflake using an Amazon S3 bucket. The purpose is to handle the ETL in Airflow and the analytical storage in Snowflake.
This is an excellent project because it connects to multiple data sources through several cloud storage systems, all orchestrated with Airflow. This project is very complete because it has many moving parts and resembles a real-world data architecture. This project also touches on business intelligence (BI) by adding visualizations in Tableau.
Resources
Here are some valuable resources, including courses and video tutorials, that provide step-by-step guidance:
YouTube videos:
- Data Pipeline with Airflow, S3, and Snowflake: In this video, Seattle Data Guy demonstrates how to use Airflow to pull data from the PredictIt API, load it into Amazon S3, perform Snowflake transformations, and create Tableau visualizations. This end-to-end guide is ideal for understanding the integration of multiple tools in a data pipeline.
Courses:
- Introduction to Apache Airflow in Python: This course provides an overview of Apache Airflow, covering essential concepts such as DAGs, operators, and task dependencies. It's a great foundation for understanding how to structure and manage workflows in Airflow.
- Introduction to Snowflake: This course introduces Snowflake, a powerful data warehousing solution. It covers managing data storage, querying, and optimization. It’s perfect for gaining foundational knowledge before working with Snowflake in data pipelines.
- Data Visualization in Tableau: This course covers essential Tableau skills for data visualization, allowing you to turn data into insightful visuals—a core step for interpreting data pipeline outputs.
Skills developed
- Practice creating DAGs in Airflow
- Practice connecting to an API in Python
- Practice storing data in Amazon S3 buckets
- Moving data from Amazon to Snowflake for analysis
- Simple visualization of data in Tableau
- Creating a comprehensive, end-to-end data platform
Project 11: Hacker News ETL in AWS using Airflow
This project tackles a complex data pipeline with multiple steps using advanced data processing tools in the AWS ecosystem.
Instead of dealing with restricted social media APIs, you will use the Hacker News API, which is completely free and open. You will set up Apache Airflow to extract top stories and comments, transform the data to flatten the nested JSON structures, and load it into the cloud.
The architecture follows a standard "Modern Data Stack" pattern:
- Extract: Airflow triggers a Python script to fetch data from the Hacker News API.
- Load: The raw JSON data is dumped into an Amazon S3 bucket (your "Data Lake").
- Transform: You will use AWS Glue to crawl the data and create a schema.
- Analyze: Finally, you will use Amazon Athena to run SQL queries directly on your S3 data (serverless analysis) or load it into Amazon Redshift for warehousing.
Resources
Here are some resources, including courses and video tutorials, that provide step-by-step guidance:
Documentation:
- Hacker News API: The official documentation is simple and hosted on GitHub. It teaches you how to traverse the "Item IDs" to find stories and comments.
GitHub Repositories:
- News Data Pipeline with Airflow & AWS: Look for repositories that demonstrate "Airflow to S3" pipelines. You can adapt these easily by simply changing the API endpoint from "NewsAPI" to "Hacker News."
- dlt (Data Load Tool) Hacker News Demo: The team at
dltHubhas a great blog post and repo specifically about pulling Hacker News data into data warehouses. This is a great modern alternative reference.
Courses and tutorials:
- Introduction to AWS: This course provides a solid foundation in AWS, covering essential concepts and tools. Understanding the basics of AWS services like S3, Glue, Athena, and Redshift will be crucial for successfully implementing this project.
- AWS Glue & Athena: Look for tutorials specifically on "crawling JSON data in S3 with Glue" to understand how to turn your raw files into queryable tables.
Skills developed
- Orchestration: creating complex DAGs in Airflow to manage dependencies.
- API Interaction: recursively fetching nested data (comments within stories) from a public API.
- Data Lake: Storing raw partition data in Amazon S3.
- Serverless SQL: Using AWS Glue to catalog data and AWS Athena to query it without a database server.
- Infrastructure: Managing AWS permissions (IAM) to allow Airflow to talk to S3.
Project 12: Building a real-time data pipeline with PySpark, Kafka, and Redshift
In this project, you’ll create a robust, real-time data pipeline using PySpark, Apache Kafka, and Amazon Redshift to handle high data ingestion, processing, and storage volumes.
The pipeline will capture data from various sources in real time, process and transform it using PySpark, and load the transformed data into Redshift for further analysis. Additionally, you’ll implement monitoring and alerting to ensure data accuracy and pipeline reliability.
This project is an excellent opportunity to build foundational skills in real-time data processing and handling big data technologies, such as Kafka for streaming and Redshift for cloud-based data warehousing.
Resources
Here are some resources, including courses and video tutorials, that provide step-by-step guidance:
YouTube videos:
- Building a Real-Time Data Pipeline with PySpark, Kafka, and Redshift: This video by Darshir Parmar guides you through building a complete real-time data pipeline with PySpark, Kafka, and Redshift. It includes steps for data ingestion, transformation, and loading. The video also covers monitoring and alerting techniques to ensure pipeline performance.
Courses:
- Introduction to Apache Kafka: This course covers the basics of Apache Kafka, a crucial component for real-time data streaming in this project. It provides an overview of Kafka’s architecture and how to implement it in data pipelines.
- Streaming Concepts: This course introduces the fundamental concepts of data streaming, including real-time processing and event-driven architectures. It’s an ideal resource for gaining foundational knowledge before building real-time pipelines.
Summary Table of Data Engineering Projects
Here is a summary of the data engineering projects from above to give you a quick reference to the different projects:
| Project Name | Level | Skills Developed | Tools & Technologies |
|---|---|---|---|
| 1. ETL Pipeline with Open Data | Beginner | Data extraction, cleaning, and loading; Working with columnar formats; Cloud data warehousing. | Python, Polars (or Pandas), Google BigQuery, Parquet, CSV |
| 2. Weather Data Pipeline | Beginner | API data collection; Data cleansing; Containerization basics; SQL storage. | Python, Open-Meteo API, PostgreSQL, Docker, SQL |
| 3. London Transport Analysis | Beginner | Large-scale data analysis; Big data concepts; Query context understanding. | Snowflake, Amazon Redshift, BigQuery, or Databricks |
| 4. Performing a Code Review | Intermediate | Code evaluation; Bug detection; Logic error identification; Peer feedback. | Code Review Tools (General), Git |
| 5. Building a Retail Data Pipeline | Intermediate | Pipeline design; Multi-source extraction; Data consistency; Optimization. | SQL, Parquet, Python, Database Tools |
| 6. Factors Influencing Student Performance | Intermediate | Complex SQL querying; Trend identification; Hypothesis testing; Data interpretation. | SQL (Relational Databases) |
| 7. High-performance Local Analytics | Intermediate | Columnar vs. Row storage; Serverless SQL; Benchmarking; Local big data processing. | DuckDB, Parquet, NYC Taxi/Citibike Data |
| 8. Cleaning a dataset with Pyspark | Advanced | Distributed computing; Large-scale data ingestion; ETL with Spark. | PySpark, Apache Spark, Python |
| 9. Data Modeling with dbt | Advanced | Data modeling; Software engineering best practices (CI/CD, testing); SQL transformations. | dbt (Data Build Tool), Google BigQuery, Git |
| 10. Airflow & Snowflake ETL | Advanced | DAG creation; API connection; Cloud storage integration; Business Intelligence (BI). | Apache Airflow, Amazon S3, Snowflake, Tableau, Python |
| 11. Hacker News ETL in AWS | Advanced | Orchestration; Handling nested JSON; Data Lakes; Serverless SQL; Infrastructure management. | Apache Airflow, AWS S3, AWS Glue, AWS Athena, AWS Redshift |
| 12. Real-time Data Pipeline | Advanced | Real-time data streaming; High-volume ingestion; Monitoring & alerting; Event-driven architecture. | PySpark, Apache Kafka, Amazon Redshift |
Conclusion
This article presented excellent projects to help you practice your data engineering skills.
Focus on understanding the fundamental concepts behind how each tool works; this will enable you to use these projects in your job search and explain them successfully. Be sure to review any concepts you find challenging.
Along with building a project portfolio, I recommend taking the Professional Data Engineer in Python track and working towards obtaining a data engineering certification. This can be a valuable addition to your resume, as it demonstrates your commitment to completing relevant coursework.
Become a Data Engineer
FAQs
What skills do I need to start working on data engineering projects?
For beginner-level projects, basic programming knowledge in Python or SQL and an understanding of data basics (like cleaning and transforming) are helpful. Intermediate and advanced projects often require knowledge of specific tools, like Apache Airflow, Kafka, or cloud-based data warehouses like BigQuery or Redshift.
How can data engineering projects help in building my portfolio?
Completing data engineering projects allows you to showcase your ability to work with data at scale, build robust pipelines, and manage databases. Projects that cover end-to-end workflows (data ingestion to analysis) demonstrate practical skills to potential employers and are highly valuable for a portfolio.
Are cloud tools like AWS and Google BigQuery necessary for data engineering projects?
While not strictly necessary, cloud tools are highly relevant to modern data engineering. Many companies rely on cloud-based platforms for scalability and accessibility, so learning tools like AWS, Google BigQuery, and Snowflake can give you an edge and align your skills with industry needs.
How do I choose the right data engineering project for my skill level?
Start by assessing your knowledge and comfort with core tools. For beginners, projects like data cleaning or building a basic ETL pipeline in Python are great. Intermediate projects might involve databases and more complex queries, while advanced projects often integrate multiple tools (e.g., PySpark, Kafka, Redshift) for real-time or large-scale data processing.
I am a data scientist with experience in spatial analysis, machine learning, and data pipelines. I have worked with GCP, Hadoop, Hive, Snowflake, Airflow, and other data science/engineering processes.


