Two years ago I built a small CDC pipeline using Flink and Hudi then mostly forgot about it. I noticed the repo still gets a few regular visits and clones. So tonight I updated it: Flink 1.19.1 and Hudi 1.0.2. Hudi is a good choice if you want your data lake to behave a bit more like a database, that is able to handle updates, deletes, and keep things consistent It’s a complete, working example with Docker Compose, MariaDB CDC, MinIO instead of S3, and a Flink SQL job handling real-time updates. Nice to see it’s still helping a few people out there. Repo: https://lnkd.in/ex7zMpzd
Gordon Murray’s Post
More Relevant Posts
-
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
To view or add a comment, sign in
-
-
Exciting milestone for the open data ecosystem! 🚀 The ratification of Variant as a native data type in Apache Parquet™ is a big step forward for managing semi-structured data. With unified support across Delta Lake, Apache Iceberg™, and Apache Spark™, this truly strengthens interoperability in the open lakehouse world. A major leap toward simplifying how we store, process, and query flexible data — making analytics faster and more consistent across platforms. #DataEngineering #ApacheParquet #DeltaLake #ApacheSpark #OpenSource #Lakehouse
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
To view or add a comment, sign in
-
-
Databricks Variant type sets a new standard for speed and efficiency with semi-structured data, offering up to 8x faster performance compared to traditional JSON string storage. - What stands out most is both the remarkable query speed and reduced storage requirements, Variant uses 22% less storage than plain strings, saving significant time and cost. - Real-world benchmarks show dramatic gains: ETL jobs that once took hours now finish in minutes, and 1TB queries dropped from over 4 hours to just 20 minutes. This combination of ultra-fast querying and lower storage overhead makes Variant a clear leader in big data analytics.
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
To view or add a comment, sign in
-
-
Huge news for anyone building a Security Lakehouse. Security data like logs, events, and telemetry from various sources (endpoints, cloud infra, SaaS ) is inherently semi-structured. A single security event might contain dozens of nested fields, which can change frequently with new product versions or changes to logging config. Before Variant, handling this required kludgey workarounds: 👎 Storing all the JSON as a massive, opaque string. This is slow to query and wastes compute power. 👎 Trying to force a rigid schema on the data. This leads to brittle pipelines that constantly break when a new field appears. The new Variant type solves this by providing a unified, open standard for storing this kind of data natively and efficiently within Parquet. This means you can now: 🔥 Simplify Ingestion: Security pipelines become more resilient, as you don't need to preemptively flatten or strictly validate every piece of semi-structured data. 🔥Accelerate Investigations: You can query nested or evolving fields much faster without complex JSON parsing at query time. Quicker queries mean faster threat detection and response. 🔥Reduce Costs: More efficient storage and faster queries often translate directly into lower compute costs for your security platform. This move brings the flexibility needed for modern security data alongside the high performance and open standards of the Lakehouse architecture.
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
To view or add a comment, sign in
-
-
[New Release] Variant is really a good feature to manage semi-structured data like JSON and XML with Delta and Iceberg!
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
To view or add a comment, sign in
-
-
JSON and similar document formats rule the application world, but are very inefficient for analytics. Bridging the gap required clever engineering and federating multiple open source communities. On the technical details, Variant itself is a no brainer (binary encoding of JSON with up-front offsets to allow efficient skipping / traversal, similar in concept if not in detail to BSON and others) but shredding (the ability to extract common fields in a column chunk) is the game changer for query performance at scale.
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
To view or add a comment, sign in
-
-
🚀 Exciting update for the data community! Variant, the new native data type for semi-structured data, has been ratified in the Apache Parquet™ ecosystem — unifying support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant makes it dramatically easier to store, query, and analyze flexible data like JSON, telemetry, and logs without complex transformations. It introduces schema-on-read efficiency, type inference, and nested field indexing, allowing engines to access data directly with consistent semantics across formats. Early benchmarks show 8x–30x faster performance with new shredding and column projection optimizations — a major step toward simplifying how lakehouses handle semi-structured data at scale. A big win for open data standards and interoperability. 👉 Read more: Introducing Variant – Databricks Blog #Databricks #ApacheParquet #DeltaLake #ApacheSpark #ApacheIceberg #DataEngineering #OpenSource #Lakehouse #BigData #DataPerformance
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
To view or add a comment, sign in
-
-
A huge milestone for the open data ecosystem! 🌍 The ratification of Variant as a native Parquet data type marks a major step forward in unifying semi-structured data handling across Delta Lake, Apache Iceberg, and Apache Spark. This will simplify pipelines, improve performance, and accelerate innovation across open lakehouse architectures.
Exciting news for the open data community: Variant, the native data type for semi-structured data, is now ratified in the Apache Parquet™ community — with support across Delta Lake, Apache Iceberg™, and Apache Spark™. Variant brings a unified, open standard to how the lakehouse stores and queries flexible data—making it faster, simpler, and more consistent across formats and engines. https://lnkd.in/gESP97Nm
To view or add a comment, sign in
-
-
Apache Hudi just made working with append-only data in lake houses even easier. With the new automatic record key generation, you no longer need to manually specify primary key fields when creating tables. Record keys are critical for enabling updates, deletes, record-level indexing, and change data capture. Hudi’s first-class record key support ensures unique, stable identifiers across distributed workloads, while handling concurrent writes efficiently. 💡 Hudi continues to set the standard for database-like experience in modern lake houses. Read on for more: https://lnkd.in/evAcZdS7 #ApacheHudi #DataLakehouse #DataEngineering #OpenSource
To view or add a comment, sign in
-
-
Engineering a Time Series Database with Open Source: Rebuilding InfluxDB 3 in Rust + Apache Arrow InfluxDB 3 marks a complete rebuild of the core database engine, engineered to handle infinite cardinality, blazing-fast analytics, and SQL-first querying. 🔑 Key Takeaways: ⚡ Unlimited scale, no more cardinality limits 💾 Cheaper storage, tiered object storage for historical data 🛠️ SQL-first, query time series data with SQL for seamless integration 🦀 Built in Rust, fearless concurrency, memory safety, and top-tier performance 📊 Powered by FDAP stack, Apache Flight, DataFusion, Arrow & Parquet Read on for more: https://lnkd.in/e2C_NMBa #TimeSeriesDatabase #RustProgramming #RealTimeAnalytics #OpenSource
To view or add a comment, sign in
-
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development