How zstd improves OpenResty Edge's performance

View organization page for OpenResty Inc.

445 followers

Users complain about slow page loads, API response delays, and soaring bandwidth costs, yet few realize the bottleneck might lie in gzip—a compression algorithm that has faithfully served for decades. While it was once sufficient, gzip now struggles to keep pace with today's dynamic content and high-concurrency demands. That's why we've integrated the modern compression algorithm zstd into OpenResty Edge. By leveraging lower CPU overhead, we achieve higher compression ratios and faster transmission speeds. Curious about how to embrace next-generation compression technology in OpenResty Edge? This article has the answers you're looking for: https://lnkd.in/gS8zrGKj

Performance Takes a Leap Forward: OpenResty Edge Now Supports zstd Compression blog.openresty.com

To view or add a comment, sign in

More Relevant Posts

Marek Urbas

Fondateur | CEO | Certified ISO/IEC 27001 Lead Auditor | Certified ISO 9001 Lead Auditor
1w
Report this post
How Are OpenTelemetry and Fluent Bit Related? Learn about Fluent Bit's relationship to OpenTelemetry and its evolution in capturing not only logs but local metrics such as CPU, memory and storage use.

How Are OpenTelemetry and Fluent Bit Related? https://thenewstack.io
Like Comment
To view or add a comment, sign in
The New Stack

26,088 followers
1w Edited
Report this post
Learn about Fluent Bit's relationship to OpenTelemetry and its evolution in capturing not only logs but local metrics such as CPU, memory and storage use. By Phil Wilkins, thanks to Chronosphere

How Are OpenTelemetry and Fluent Bit Related? https://thenewstack.io
Like Comment
To view or add a comment, sign in
Naveen Reddy

Building Roundz.ai - Community Driven Platform | SDE3 at Amazon
2w
Report this post
𝗪𝗲 𝗼𝗻𝗰𝗲 𝘁𝗵𝗼𝘂𝗴𝗵𝘁 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝘄𝗼𝘂𝗹𝗱 𝗺𝗮𝗸𝗲 𝗲𝘃𝗲𝗿𝘆𝘁𝗵𝗶𝗻𝗴 𝗳𝗮𝘀𝘁𝗲𝗿. Spoiler: it didn’t. We were saving bandwidth like champs. But then latency crept up. CPU usage hit the roof. And users started noticing slower responses. That’s when it hit me — compression is a trade-off, not a magic fix. You’re basically trading CPU time for bandwidth. Sometimes it’s worth it. Sometimes it’s a disaster. Here’s what I’ve learned from that mess: • If your system is CPU-bound, compression will hurt more than it helps. • Tiny payloads? Don’t bother. The headers alone might make them bigger. • Stop using gzip for everything. LZ4 is better for real-time stuff. Brotli wins for static assets. • Adaptive compression is the real deal. Let your system decide when to compress based on data size and latency budget. And for the love of clean architecture, measure everything. Don’t just benchmark compression speed. Measure end-to-end latency, including decompression on the client side. Because yeah, saving 10KB sounds great… until it adds 50ms to your API response. The takeaway? Compression vs latency isn’t about sides. It’s about context. Figure out your constraints. Pick your battles. And maybe skip that “compress everything” toggle next time.
9 Comments
Like Comment
To view or add a comment, sign in
Shai Asher

Software Engineer at HiBob
1w Edited
Report this post
Bloom Filter Go optimization journey: 3s -> 440ms -> 675ms(!!) -> 66µs A Go optimization story. I just published a deep dive on the (sometimes painful) process of optimizing my Go Bloom filter. It's a real-world "pprof" war story, with a 6,595x speedup at the end. It started with a benchmark that took 3 seconds per op. Spoiler: The benchmark was lying. Profiling showed 79% of CPU time was in fmt.Sprintf. After fixing the test, the real baseline was 440ms. pprof showed the new bottleneck: runtime.mallocgc and runtime.mapassign_fast64. I was allocating a map on every single Add/Contains call. * My first "fix" (map pooling with for...delete) was a disaster. It was 1.5x slower (~675ms). Lesson learned: Zero Allocations ≠ Fast. * The real win came from a data structure change: replacing map[uint64] with a direct-access array. This gave O(1) indexing and zero allocations, bringing the time down to 66 microseconds. The journey taught me some critical lessons about performance: 1. Your Benchmark is Lying: My first profile showed 79% of CPU time was in fmt.Sprintf, not my code. Always pprof your tests! 2. Zero Allocations ≠ Fast: My first map pooling "fix" achieved 1 alloc/op but was 1.5x slower. We traded a GC bottleneck for a worse CPU bottleneck (O(N) for...delete). 3. Data Structure > Allocations: The real win wasn't just fewer allocations. It was replacing map[int] with array[...]. This single change gave O(1) access and eliminated 74% of CPU overhead. I documented the entire saga, with all the profiles, failed attempts, and benchmark charts. If you're into Go performance, profiling, or optimization, this one's for you. You can read about the full story here: https://lnkd.in/dCzTvbFS #golang #performance #optimization #pprof #softwareengineering #simd #go

The Bloom Filter Optimization Saga: From 3 Seconds to 66 Microseconds slow-is-fast.ghost.io
Like Comment
To view or add a comment, sign in
Sheik Uduman Ali M

Artificial Intelligence practitioner, Industry 4.0 & Cloud Expert, Platform Engineering & Design Thinking Consultant
3d
Report this post
This blog explains how a high-traffic payment service in Go was CPU-bound, so its most intensive endpoints were rewritten in Rust for better performance and efficiency. The new Rust implementation doubled traffic capacity and cut CPU usage, saving nearly $300,000 annually.

2x Performance, $300k Savings: A Case Study in Rewriting a Critical Service in Rust wxiaoyun.com

2 Comments
Like Comment
To view or add a comment, sign in
GyaanSetu Javascript

80 followers
3d
Report this post
React Tip: Pass a Function to useState, Don’t Call It We’ve all written code like this before: const [value, setValue] = useState(getInitialValue()); It looks innocent — but this line hides a subtle performance pitfall. When you call useState(getInitialValue()), that function executes on every render, not just the first one. That’s because React re-runs your component function on every render, and the argument to useState() is evaluated before React decides whether to reuse the existing state. If getInitialValue() does something expensive — like computing, parsing, or reading from storage — you’re wasting CPU cycles on every render. function getInitialValue() { console.log("getInitialValue() called"); // imagine something slow here... return Math.random(); } function ExampleComponent() { const [value, setValue] = useState(getInitialValue); return ( <div> <p>Value: {value}</p> <button onClick={() => setValue(v => v + 1)}>Increment</button> </div> ); } If you open the console, you’ll see: getInitialValue( https://lnkd.in/gPA6WT8d
Like Comment
To view or add a comment, sign in
Mohit Bhola

Software Engineer | Large-Scale C++ Applications | Imaging, Multimedia & Enterprise Software
3w Edited
Report this post
std::variant has been around since 2017 and it still never ceases to amaze me. It promises to hold objects of a fixed set of types (fundamental types/user defined types etc) in an internal buffer big enough for the largest type. IOWs: it *directly* holds the underlying object. The classic way of holding heterogeneous objects belonging to an inheritance hierarchy in a vector is via a (owning) pointer to the base class: class Base {}; class Derived1 : public Base {}; class Derived2 : public Base {}; ... class DerivedN : public Base {}; using MyCollection = std::vector<std::unique_ptr<Base>>; But even though the contents of the std::vector may fit in a cache line, the underlying objects would be lying elsewhere in memory - better if they come from a custom allocator, worse if they were just new'd randomly (which also leads to memory fragmentation). ***So, every time we try to access the underlying objects, we almost certainly nuke the CPU caches*** Enter std::variant, we can directly hold the underlying heterogeneous objects in a vector as: using MyCollection = std::vector<std::variant<Derived1, Derived2, ...DerivedN>>; This paves the way for the objects to fit in the cache lines, since there is no indirection now!!! As a result, once we access some object, other adjacent objects get pulled into the CPU caches owing to data locality leading to perf gains on subsequent accesses. The perf gains would be more pronounced if the objects are small in size, thus leading to more objects packed in a single cache line. So don't just approach std::variant to look and feel modern, you're getting free perf gains under the hood given the way CPU caches work :)
3 Comments
Like Comment
To view or add a comment, sign in
Arrow Electronics

361,183 followers
4d
Report this post
Reliability and large storage capacity are just two benefits that multilayer ceramic capacitors (MLCCs) and polymer tantalum capacitors bring to AI systems. We take a look at the role that MLCCs and polymer tantalum capacitors have in AI servers here: http://arw.li/60417zgCB

Applications of multilayer ceramic capacitors and polymer tantalum capacitors in AI servers | http://arw.li/60477zg77 arrow.com
Like Comment
To view or add a comment, sign in
Gireesh Punathil

STSM | Inventor | Author | Speaker | Interpreter | Mentor | Advocate
3w Edited
Report this post
𝘆𝗼𝘂 𝘁𝗵𝗶𝗻𝗸 𝗼𝗻𝗹𝘆 𝗵𝘂𝗺𝗮𝗻𝘀 𝗵𝗮𝘃𝗲 𝗳𝗿𝗶𝗰𝘁𝗶𝗼𝗻𝘀? here are some cool examples of deep system conflict: • GC compaction invalidates CPU caches and TLBs, spoiling temporal locality. under memory pressure, they blame each other • CPUs reorder instructions for throughput, but exceptions (signals, page fault) demand precise flow state, leading CPU to rollback executions, undoing any gain from reordering • JIT optimizes based on currently visible bytecode. when new classes or modules load (e.g. OSGi) everything is invalidated, forcing deoptimization storms • NUMA optimizes for locality, paging optimizes for availability. locality vanishes once the OS swaps or migrates pages, defeating NUMA's purpose • GCC vectorizers expect predictable loops. polymorphic code breaks such assumptions, forcing de-vectorization and running into slow paths • instrumentation injects extra logic, monkey patches rewrite functions, profilers skew timing. every probe alters code paths, object lifetimes and code optimization. end result: the observer mutates the observed • hardware prefetchers assume linear access. in bad times, they pull stale or useless data, polluting caches • simultaneous multi threading boosts CPU utilization. but shared execution units cause random stalls, proving deadly for real time workloads • branch prediction boosts performance. but some security patches (e.g. fence instructions for spectre) nullify advantages, trading gigahertz for safety bottom line: conflicts in systems arise from unreconciled assumptions (machine) and incoherent creativity (man)
5 Comments
Like Comment
To view or add a comment, sign in
Samtec Inc

58,296 followers
5d
Report this post
Our friends at Signal Integrity Journal ran an interesting article by Samtec’s Andrew Josephson, “ManyPoint Networks: A System Co-Design Framework for 448 Gbps AI Fabrics and Beyond.” This article introduces a hardware-centric definition of compute cluster bisection bandwidth as a performance metric for AI-scale 448 Gbps systems. Unlike traditional abstractions, this metric is grounded in physical interconnect layout and IO port availability, enabling system architects to evaluate bandwidth provisioning through real, bidirectional link paths. https://lnkd.in/gPEXRCWv

ManyPoint Networks: A System Co-Design Framework for 448 Gbps AI Fabrics and Beyond signalintegrityjournal.com
Like Comment
To view or add a comment, sign in

445 followers

View Profile Follow

LinkedIn respects your privacy

How zstd improves OpenResty Edge's performance

Explore content categories