Gordon Murray’s Post

Staff AWS Systems Engineer | Modernizing & Rebuilding Cloud Infrastructure | Terraform | Automation | Security

Two years ago I built a small CDC pipeline using Flink and Hudi then mostly forgot about it. I noticed the repo still gets a few regular visits and clones. So tonight I updated it: Flink 1.19.1 and Hudi 1.0.2. Hudi is a good choice if you want your data lake to behave a bit more like a database, that is able to handle updates, deletes, and keep things consistent It’s a complete, working example with Docker Compose, MariaDB CDC, MinIO instead of S3, and a Flink SQL job handling real-time updates. Nice to see it’s still helping a few people out there. Repo: https://lnkd.in/ex7zMpzd

To view or add a comment, sign in

More Relevant Posts

Databricks

1,062,922 followers
3w
Report this post
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
23 Comments
Like Comment
To view or add a comment, sign in
Sriram Reddy

Co-Founder @ Byte Analytics | Ex-Microsoft | Big Data & ML Specialist
3w
Report this post
Exciting milestone for the open data ecosystem! 🚀 The ratification of Variant as a native data type in Apache Parquet™ is a big step forward for managing semi-structured data. With unified support across Delta Lake, Apache Iceberg™, and Apache Spark™, this truly strengthens interoperability in the open lakehouse world. A major leap toward simplifying how we store, process, and query flexible data — making analytics faster and more consistent across platforms. #DataEngineering #ApacheParquet #DeltaLake #ApacheSpark #OpenSource #Lakehouse
Databricks

1,062,922 followers
3w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Anji Palla

Lead Architect at Fractal Analytics || MLOps || LLMOps || Azure || AWS
3w
Report this post
Databricks Variant type sets a new standard for speed and efficiency with semi-structured data, offering up to 8x faster performance compared to traditional JSON string storage. - What stands out most is both the remarkable query speed and reduced storage requirements, Variant uses 22% less storage than plain strings, saving significant time and cost. - Real-world benchmarks show dramatic gains: ETL jobs that once took hours now finish in minutes, and 1TB queries dropped from over 4 hours to just 20 minutes. This combination of ultra-fast querying and lower storage overhead makes Variant a clear leader in big data analytics.
Databricks

1,062,922 followers
3w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Dave Herrald

security leader and storyteller | adviser | former Splunk SURGe | Boss of the SOC (BOTS) co-creator | former Google | Google Cybersecurity Certificate contributing author/instructor | former CISO | GIAC GSE #79
3w
Report this post
Huge news for anyone building a Security Lakehouse. Security data like logs, events, and telemetry from various sources (endpoints, cloud infra, SaaS ) is inherently semi-structured. A single security event might contain dozens of nested fields, which can change frequently with new product versions or changes to logging config. Before Variant, handling this required kludgey workarounds: 👎 Storing all the JSON as a massive, opaque string. This is slow to query and wastes compute power. 👎 Trying to force a rigid schema on the data. This leads to brittle pipelines that constantly break when a new field appears. The new Variant type solves this by providing a unified, open standard for storing this kind of data natively and efficiently within Parquet. This means you can now: 🔥 Simplify Ingestion: Security pipelines become more resilient, as you don't need to preemptively flatten or strictly validate every piece of semi-structured data. 🔥Accelerate Investigations: You can query nested or evolving fields much faster without complex JSON parsing at query time. Quicker queries mean faster threat detection and response. 🔥Reduce Costs: More efficient storage and faster queries often translate directly into lower compute costs for your security platform. This move brings the flexibility needed for modern security data alongside the high performance and open standards of the Lakehouse architecture.
Databricks

1,062,922 followers
3w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
1 Comment
Like Comment
To view or add a comment, sign in
Wang Ryder

Account Executive
3w
Report this post
[New Release] Variant is really a good feature to manage semi-structured data like JSON and XML with Delta and Iceberg!
Databricks

1,062,922 followers
3w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Sylvain Chambon

Senior Solutions Architect, FSI at Databricks
3w
Report this post
JSON and similar document formats rule the application world, but are very inefficient for analytics. Bridging the gap required clever engineering and federating multiple open source communities. On the technical details, Variant itself is a no brainer (binary encoding of JSON with up-front offsets to allow efficient skipping / traversal, similar in concept if not in detail to BSON and others) but shredding (the ability to extract common fields in a column chunk) is the game changer for query performance at scale.
Databricks

1,062,922 followers
3w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Alexander Tsuman

Senior Data/Software Engineer, TL
3w
Report this post
🚀 Exciting update for the data community! Variant, the new native data type for semi-structured data, has been ratified in the Apache Parquet™ ecosystem — unifying support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant makes it dramatically easier to store, query, and analyze flexible data like JSON, telemetry, and logs without complex transformations. It introduces schema-on-read efficiency, type inference, and nested field indexing, allowing engines to access data directly with consistent semantics across formats. Early benchmarks show 8x–30x faster performance with new shredding and column projection optimizations — a major step toward simplifying how lakehouses handle semi-structured data at scale. A big win for open data standards and interoperability. 👉 Read more: Introducing Variant – Databricks Blog #Databricks #ApacheParquet #DeltaLake #ApacheSpark #ApacheIceberg #DataEngineering #OpenSource #Lakehouse #BigData #DataPerformance
Databricks

1,062,922 followers
3w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
FengYuan Zhang

Medical beauty institutional investors
3w
Report this post
A huge milestone for the open data ecosystem! 🌍 The ratification of Variant as a native Parquet data type marks a major step forward in unifying semi-structured data handling across Delta Lake, Apache Iceberg, and Apache Spark. This will simplify pipelines, improve performance, and accelerate innovation across open lakehouse architectures.
Databricks

1,062,922 followers
3w

Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
Like Comment
To view or add a comment, sign in
Open Data Blend

428 followers
1w
Report this post
Apache Hudi just made working with append-only data in lake houses even easier. With the new automatic record key generation, you no longer need to manually specify primary key fields when creating tables. Record keys are critical for enabling updates, deletes, record-level indexing, and change data capture. Hudi’s first-class record key support ensures unique, stable identifiers across distributed workloads, while handling concurrent writes efficiently. 💡 Hudi continues to set the standard for database-like experience in modern lake houses. Read on for more: https://lnkd.in/evAcZdS7 #ApacheHudi #DataLakehouse #DataEngineering #OpenSource
Like Comment
To view or add a comment, sign in
Open Data Blend

428 followers
1mo
Report this post
Engineering a Time Series Database with Open Source: Rebuilding InfluxDB 3 in Rust + Apache Arrow InfluxDB 3 marks a complete rebuild of the core database engine, engineered to handle infinite cardinality, blazing-fast analytics, and SQL-first querying. 🔑 Key Takeaways: ⚡ Unlimited scale, no more cardinality limits 💾 Cheaper storage, tiered object storage for historical data 🛠️ SQL-first, query time series data with SQL for seamless integration 🦀 Built in Rust, fearless concurrency, memory safety, and top-tier performance 📊 Powered by FDAP stack, Apache Flight, DataFusion, Arrow & Parquet Read on for more: https://lnkd.in/e2C_NMBa #TimeSeriesDatabase #RustProgramming #RealTimeAnalytics #OpenSource
Like Comment
To view or add a comment, sign in

1,031 followers

1,009 Posts

View Profile Follow

LinkedIn respects your privacy

Gordon Murray’s Post

Explore content categories