Tinybird’s Post

Here's how we scaled our ClickHouse streaming ingestion service when compute-compute separation wasn't enough ↓ 🚨 Problem Even dedicated ClickHouse writer instances with compute-compute separation hit limits with extremely high-throughput ingestion. Single-writer becomes bottleneck at petabyte scale, which was creating problems for our streaming ingestion service. 😫 The Scale Challenge We had customers ingesting massive data volumes during traffic spikes. Best practice is to isolate writes to a dedicated replica with compute-compute separation, but what happens when even that isn't enough? ✅ Solution Multi-writer mode distributes ingestion across multiple ClickHouse instances within the same cluster. We manually route traffic by data source or workspace to different writers, which is simple and efficient. 🔧 Technical Approach We preferred static routing over dynamic for reliability and simplicity. We configure routing rules at the data source/workspace level for explicit control. Predictable, testable, safe. ⚙️ Varnish + Custom VMOD We built a custom Varnish extension with a `backend_by_index()` function. It routes to "Nth healthy instance" instead of a specific replica. Automatic failover if a writer goes down. 📈 Real Impact We had a customer hitting memory limits daily during peak ingestion. Multi-writer mode moved one workspace to a different replica and eliminated the bottleneck. 💡 Key Insight We prioritized simplicity and operational safety over complex load balancing. Static routing with intelligent failover beats unpredictable dynamic routing. Read the full implementation details in the blog post. Link in comments.

  • No alternative text description for this image
  • No alternative text description for this image

To view or add a comment, sign in

Explore topics