How to avoid log archeology in data pipelines

View organization page for The Scalable Way

140 followers

Ever had a “green” pipeline in Prefect… and an empty dashboard the next morning? That’s what happens when observability is an afterthought. In the rush to deliver, many teams build one big monolithic flow ‒ simple to start, painful to scale. But when it fails, debugging turns into log archeology. One table breaks, the whole job fails, and you’re left guessing what went wrong. There’s a smarter way to build ‒ granular, focused flows. Breaking pipelines into smaller, independent deployments gives you visibility where it matters. 🔗Read more on breaking down deployments for improved efficiency: https://lnkd.in/dhWmHmch How do you keep observability front and center in your data pipelines? #DataEngineering #DataOps #Observability #ETL #DataReliability #DataPlatforms

Breaking Down Prefect Deployments To Improve The Data Ops Efficiency thescalableway.com

To view or add a comment, sign in

More Relevant Posts

Stephane Castellani

Data & Marketing Senior Executive | VP/CMO | GTM Strategy, Product Marketing, Demand Gen & BDR/SDR Teams | Thought Leader | ARR Growth
2w
Report this post
I was recently in a conversation with a group of data engineers and data architects, and one question kept coming up: "How do we handle real-time insights without disrupting our existing warehouses and lakes?" 🤔 It’s a common challenge. Most architectures today are excellent at storing and analyzing historical data. 🏛️ Data warehouses manage structured analytics, and lakes give you flexible, large-scale storage. 💾 But when your business needs to act on data as it arrives, waiting for nightly/hourly ETL jobs isn’t enough. ⚡ That’s where real-time databases like CrateDB complement your stack. Sitting alongside your warehouse and lake, they let you: ⚡Run instant queries on live data (milliseconds) 🤖 Perform real-time analytics for operations and AI 🚨 Detect anomalies or trends the moment they happen (no need to wait for minutes or hours) It’s not about replacing what already works, it’s about adding a real-time layer so your data stack can support both historical analysis and immediate action. I explored this in more detail in my latest blog: 🔗 https://lnkd.in/eAuPMQAt #RealTimeAnalytics #DataEngineering #CrateDB #StreamingData #AI #DataArchitecture
Like Comment
To view or add a comment, sign in
Shuhei Ando

MEDLEY - Data Strategy Manager
1w
Report this post
I learned a lot from this article about how to design data organizations with dbt. It shows that the key point is not to choose between centralization or decentralization, but to think carefully about what should be managed centrally and what should be handled by each team. I also learned the importance of defining clear ownership of datasets and creating continuous improvement loops. This article gave me a new perspective on how to treat data as a real product and build better data organizations. https://lnkd.in/eeXH872S

The right structure for a scalable data team | dbt Labs getdbt.com
Like Comment
To view or add a comment, sign in
Nikhil Raj Uppari

Open to Opportunities (C2C) | Data Engineering Professional | Big Data | Cloud Platforms (AWS | Azure | GCP) | Spark | SQL | Python | ETL Pipelines | Data Warehousing | Snowflake | Kafka | Tableau | Power BI | MLOps
2w
Report this post
🚫 The 3 Data Engineering Myths That Refuse to Die After 10 years in data engineering, I’ve seen tools, trends, and even job titles change… But a few myths just refuse to go away. Let’s bust a few 👇 1️⃣ More data = better insights Nope. More data often means more noise. Good insights come from better questions and cleaner pipelines, not petabytes of uncurated logs. Sometimes, deleting half your data gives you twice the clarity. 2️⃣ ETL is dead because of ELT This one makes me smile every time. We didn’t “kill” ETL — we just moved the ‘T’. Whether you transform before or after loading doesn’t matter as much as how well you model and govern your data. ELT didn’t end ETL; it evolved it. 3️⃣ Data Engineers only move data If you’re still thinking of DEs as just pipeline plumbers, you’re missing the point. Modern data engineers are architects, optimizers, and enablers of analytics and AI. We design ecosystems that help data scientists, analysts, and decision-makers move faster — with trust in the data. 💭 Final thought: Data engineering isn’t about building the biggest pipelines or chasing the newest frameworks. It’s about making data useful, reliable, and actionable. What’s another myth you keep hearing in your team or org? Let’s put them to rest. 👇 #DataEngineering #BigData #ETL #Analytics #DataQuality #CloudData
Like Comment
To view or add a comment, sign in
SaiPrasath Ramachandran

Azure Data Engineer & ETL Developer
4d
Report this post
Delta Lake = Trustworthy Data at Scale In data engineering, speed is useless without reliability. When you’re running hundreds of ETL jobs every day, what really matters isn’t how fast your pipeline runs — it’s whether you can trust the data at the end of it. That’s where Delta Lake changes the game. Unlike traditional data lakes that can leave you with partial writes, duplicate records, or inconsistent reads, Delta Lake brings ACID transactions to big data. What does that mean for you? Atomicity: Your job either completes fully or not at all — no half-written files. Consistency: The table’s schema and state are always valid after each write. Isolation: Multiple jobs can run safely without overwriting each other’s data. Durability: Once committed, your data stays consistent even after restarts or crashes. A Real-World Example Imagine a Spark job writing millions of rows into your Sales_Fact table. Halfway through, a cluster node crashes. In a traditional Parquet setup — that means corrupted files and broken partitions. With Delta Lake, the transaction log ensures the write either rolls back or commits fully. No data loss. No inconsistencies. No 3 AM debugging calls. The Bottom Line Delta Lake turns your data lake into a data system you can trust. It’s not just about faster queries — it’s about confidence in every dashboard, report, and machine learning model built on top.
Like Comment
To view or add a comment, sign in
Vino Duraisamy

Developer advocate @Snowflake | Data & AI engineering | Python, SQL, Snowflake, Airflow, Spark
3w
Report this post
A critical data pipeline fails. Your first stop? 𝘛𝘈𝘚𝘒_𝘏𝘐𝘚𝘛𝘖𝘙𝘠 and a maze of timestamps. What if you can just go to the target table, click 𝘓𝘪𝘯𝘦𝘢𝘨𝘦, and see the exact task and upstream sources that caused the failure? With 𝗟𝗶𝗻𝗲𝗮𝗴𝗲 𝗳𝗼𝗿 𝘀𝘁𝗼𝗿𝗲𝗱 𝗽𝗿𝗼𝗰𝗲𝗱𝘂𝗿𝗲𝘀 𝗮𝗻𝗱 𝘁𝗮𝘀𝗸𝘀 generally available in Snowflake, now you can. ✅ You can now use the lineage graph to see when data movement from a source to a target object was the direct result of a task. ✅ When you select the arrow connecting two objects in the Snowsight UI, a panel opens with details about the task that ran the operation. This beats digging through query logs any day. This simple feature solves some major challenges for data engineers: 𝗥𝗲𝗮𝗹 𝗜𝗺𝗽𝗮𝗰𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀: The lineage graph already shows upstream and downstream dependencies. Now, by seeing the tasks involved, you can more accurately assess the blast radius of a code change before you deploy it. 𝗙𝗮𝘀𝘁𝗲𝗿 𝗥𝗼𝗼𝘁 𝗖𝗮𝘂𝘀𝗲 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀: When a table has bad data, you can now visually trace back not just what table it came from, but precisely which task was responsible for that transformation. 𝗖𝗹𝗲𝗮𝗿 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴: For complex DAGs, this provides a visual map of your data flow and orchestration. It's the ultimate documentation for getting a handle on existing pipelines or onboarding new teammates. ❌ This isn't just a UI facelift. It’s a fundamental improvement that 𝗶𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗲𝘀 𝗼𝘂𝗿 𝗼𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻 𝗹𝗮𝘆𝗲𝗿 𝘄𝗶𝘁𝗵 𝗼𝘂𝗿 𝗱𝗮𝘁𝗮 𝗴𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲 𝗮𝗻𝗱 𝗼𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝘁𝗼𝗼𝗹𝗶𝗻𝗴. 🔥 It gives us a more complete picture of our data's journey. #Snowflake #DataEngineering #DataLineage #Snowsight #DataObservability Raja Pino Zheng Chandrasekharan Ananth Damien Nicole Jeffrey Mona Dwarak
Like Comment
To view or add a comment, sign in
Sajeed Kazi

Principal Architect - Data Engineering & Analytics at Zensar Technologies
3w
Report this post
Extremely useful. Code is where all of the business logic is embedded and organizations/ engineers have struggled to create/ manage knowledge within the code. This is a great start!
Vino Duraisamy

Developer advocate @Snowflake | Data & AI engineering | Python, SQL, Snowflake, Airflow, Spark
3w

A critical data pipeline fails. Your first stop? 𝘛𝘈𝘚𝘒_𝘏𝘐𝘚𝘛𝘖𝘙𝘠 and a maze of timestamps. What if you can just go to the target table, click 𝘓𝘪𝘯𝘦𝘢𝘨𝘦, and see the exact task and upstream sources that caused the failure? With 𝗟𝗶𝗻𝗲𝗮𝗴𝗲 𝗳𝗼𝗿 𝘀𝘁𝗼𝗿𝗲𝗱 𝗽𝗿𝗼𝗰𝗲𝗱𝘂𝗿𝗲𝘀 𝗮𝗻𝗱 𝘁𝗮𝘀𝗸𝘀 generally available in Snowflake, now you can. ✅ You can now use the lineage graph to see when data movement from a source to a target object was the direct result of a task. ✅ When you select the arrow connecting two objects in the Snowsight UI, a panel opens with details about the task that ran the operation. This beats digging through query logs any day. This simple feature solves some major challenges for data engineers: 𝗥𝗲𝗮𝗹 𝗜𝗺𝗽𝗮𝗰𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀: The lineage graph already shows upstream and downstream dependencies. Now, by seeing the tasks involved, you can more accurately assess the blast radius of a code change before you deploy it. 𝗙𝗮𝘀𝘁𝗲𝗿 𝗥𝗼𝗼𝘁 𝗖𝗮𝘂𝘀𝗲 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀: When a table has bad data, you can now visually trace back not just what table it came from, but precisely which task was responsible for that transformation. 𝗖𝗹𝗲𝗮𝗿 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴: For complex DAGs, this provides a visual map of your data flow and orchestration. It's the ultimate documentation for getting a handle on existing pipelines or onboarding new teammates. ❌ This isn't just a UI facelift. It’s a fundamental improvement that 𝗶𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗲𝘀 𝗼𝘂𝗿 𝗼𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻 𝗹𝗮𝘆𝗲𝗿 𝘄𝗶𝘁𝗵 𝗼𝘂𝗿 𝗱𝗮𝘁𝗮 𝗴𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲 𝗮𝗻𝗱 𝗼𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝘁𝗼𝗼𝗹𝗶𝗻𝗴. 🔥 It gives us a more complete picture of our data's journey. #Snowflake #DataEngineering #DataLineage #Snowsight #DataObservability Raja Pino Zheng Chandrasekharan Ananth Damien Nicole Jeffrey Mona Dwarak
Like Comment
To view or add a comment, sign in
Harshit Saklecha

Lead Data Engineer at Mastercard
4d Edited
Report this post
The Data Engineering Landscape is Evolving 🔄 After years building data systems, here's what I'm seeing change (and what's staying the same): Some Things Never Change: Simple formats like CSV, Excel, and JSON keep winning because they're accessible to everyone. Postgres remains the go-to starting point before teams consider "real" data warehouses. And we still can't agree on what "data pipeline" actually means. The Real Problems: The hardest challenge is no longer where is the data — it's what to do with the data. Building a pipeline is easy; maintaining one is the real battle. Data quality, contracts, and governance are now the bottleneck. Schema evolution has become the new API design. What's Changing Fast: BI dashboards are being replaced by LLM-driven conversational analytics. SQL generation will be dominated by LLMs sooner than expected. Notebooks remain great for exploration, but fragile for production pipelines. Uncomfortable Truths: Streaming is powerful but costly — batch wins more often than people admit. Most "modern data stack" complexity solves problems 90% of companies don't actually have. Makefile + cron jobs are still the first schedulers that matter in practice. The Mental Shift: We're moving from pure engineering to product thinking. It's about designing for end users, committing to SLAs, and delivering business outcomes—not just keeping jobs green. Success metrics are changing too. Business velocity matters more than infrastructure uptime. And regulatory requirements around privacy and data residency now shape architecture decisions as much as scale ever did. #DataEngineering #DataScience #ModernDataStack #Analytics #TechTrends
Like Comment
To view or add a comment, sign in
Dominik Gehl

Lead Platform Engineer | Crafting Scalable, Data-Driven Solutions
2w
Report this post
Are your data platform's "best practices" actually holding you back? This article is a critical look at how conventional wisdom, like "collect everything" and "handing data engineers full autonomy over ingestion logic," can create technical debt and increase complexity. https://lnkd.in/evmwrCsw

Best practices that break data platforms - DataScienceCentral.com https://www.datasciencecentral.com
Like Comment
To view or add a comment, sign in
AGILE Infoways

28,786 followers
3d Edited
Report this post
Discover 2026’s leading AI tools for data engineering and learn how they streamline ETL, automate governance, enhance data quality, and accelerate insights across enterprise pipelines. Read more: https://lnkd.in/dpVMKw_C #DataEngineering #AITools #ArtificialIntelligence #DataAnalytics #MachineLearning #DigitalTransformation #AgileInfoways

Top 10 AI tools for data engineering services - 2026 agileinfoways.com
Like Comment
To view or add a comment, sign in
THARUN P V

Aspiring Data Engineer | Python | SQL | PySpark | Hadoop | Git | Content Creator | Actively Seeking Opportunities
1mo
Report this post
The Rise of Metadata-Driven Data Engineering Let’s be honest — our pipelines are getting out of control. Thousands of tables. Hundreds of DAGs. Dozens of tools. And when something breaks, the first question everyone asks is: 👉 “Where did this data even come from?” That’s where metadata-driven data engineering changes the game. Instead of managing pipelines through code alone, we’re now managing them through context — information about the data itself: lineage, quality, ownership, schema evolution, business meaning. Here’s what this shift looks like in practice 👇 ⚙️ Yesterday: ETL scripts scattered across repos No visibility into dependencies Manual documentation nobody updates 💡 Today: Tools like OpenMetadata, DataHub, and Atlan automatically track lineage Pipelines adapt based on metadata (e.g., schema changes trigger transformations) Data governance becomes operational, not afterthought And the best part? When metadata becomes a first-class citizen, data engineers can finally focus on designing systems, not chasing data drift. This is the future: ➡️ Pipelines that describe themselves. ➡️ Quality rules that evolve automatically. ➡️ Discovery that’s instant, not tribal. Metadata is no longer documentation. It’s automation. What do you think — Are we moving toward a world where metadata drives pipelines instead of pipelines generating metadata? 👇 Drop your thoughts — I’d love to hear how your team handles this shift. #DataEngineering #Metadata #DataGovernance #ETL #DataOps #ModernDataStack
Like Comment
To view or add a comment, sign in

140 followers

View Profile Follow

LinkedIn respects your privacy

How to avoid log archeology in data pipelines

Explore content categories