Microservices Resilient Engineering - Java meetup.pptx

Resilient
Microservices
at Scale
Stories and Strategies from Our Journey

● 5 years at Yotpo
● 14 years of experience
● Experienced in building scalable and
reliable systems
Hi
Elia Rohana
Tech Leader at Yotpo Email Group
👋

Raised
$400M
YoY Growth
20%
Customers
100K
Employees
1K
NUMBERS
INVESTORS
PRODUCTS
Reviews Loyalty SMS Email
What is Yotpo?
eCommerce Retention Marketing Platform

🔌
Latency Issues
Slow service
What can go wrong?
😓
Network failures
DNS,routing issues
⚠️
Unavailable / unresponsive
Service is down, overloaded

🪄
Seamlessly
without the user even noticing
Resiliency
The capability of a system to handle unexpected situations
🦢
Gracefully
degrading service with fallbacks in place
🤖
Automatically
recovering, as if nothing ever happened

Resiliency
Why do we need
our applications
to be
resilient? 🤔
Failure is inevitable
In a complex system such as a
distributed system with microservice
architecture, failure is the normal case.
⚙️
Customer
satisfaction 😊
Ensuring our applications remain
reliable keeps customers happy and
engaged
Quiet production
Oncalls 📞🤫
Reducing failures and incidents means
fewer late-night calls and a more
peaceful work-life balance for
engineers.

Timeout
Ensure every network call has a
timeout
Large timeouts can impact client experience and
system throughput
Timeout Breakdown
● Connection Timeout: Time to complete a
TCP handshake
● Read Timeout: Time to wait for data to be
read

Timeout
Best Practices 📋
1. Measure your HTTP client metrics and
configure timeout accordingly
2. Set very low connection timeout for
internal service communication ~100ms
3. Configure timeout per client

Retry & Limits
Best Practice
● Limit retries to 3-5 attempts
● Retry transient errors (e.g., 502, 503, 504,
timeouts)
● Only retry idempotent operations
● Use exponential backoff for retries
● Monitor retry count and total request time
to avoid excessive retries
● Use TimeLimiter to enforce a timeout for the
entire operation, including retries.
Handling Network Issues

Fallbacks
/ A Better Alternative to Failing /
Fallbacks can be valuable, depending on business
needs.
Usually combined with other patterns
Fallback Techniques
● Hide feature
● Use default or alternative option
● Fallback to a Cache
● If no fallback -> fail fast
Limitations
● Write or batch operation
⚠️

But what happens when
a service is completely
down or unresponsive ?
🚨

In case of high load on the
system, We will keep
retrying until…
🚨
�
�
🚨
�
�
🚨
�
�

Circuit Breaker
A circuit breaker protects
systems by halting calls to
failing services, preventing
cascading failures, and
ensuring stability through fast
failures and fallback responses

Circuit Breakers
/ Calls to broken service
fails fast /
● Time-based sliding window
● Count based sliding window
Best Practices 👍
● Configure for slow or failed services
● Configure per operation with proper settings for every
backend/operation
● Define fallback when applicable

Now what Happens When a
Service Is Too Slow but
Doesn’t Fail?
🐢🐢
🐢

Bulkhead
/ Protect against latency issues /
Limits the number of operations to specific
resource in order not to “flood” the entire service
and avoid cascading failures

Bulkhead
Resource Isolation
● Dedicated resources (threads or semaphores) for each
service or task.
● Prevents one service from consuming all resources.
Two Bulkhead Types
● Semaphore Bulkhead: Limits concurrent requests.
● Thread Pool Bulkhead: Uses a bounded thread pool
with a queue for overflow.
Handling Overload
● Requests beyond the limit are rejected (Semaphore) or
queued/rejected (Thread Pool).
● Protects system stability and prevents resource
starvation.
/ how it works /

It is possible to combine the patterns

Semaphore Bulkhead Threadpool Bulkhead

Feature Threadpool Bulkhead Semaphore Bulkhead
Execution Tasks run in separate threads Tasks run in the calling thread
Concurrency Control Thread pool size and queue
capacity
Semaphore permits
Task Queueing Supports task queueing
(limited by capacity)
No queueing; tasks block or
reject
Overhead Higher (thread pool
management)
Lower (simple semaphore)
Use Case Asynchronous or I/O-heavy
workloads
Synchronous or lightweight
tasks

Bulkhead
/ Helps protect latency issues /
Best Practices 🏆
● Prioritize critical consumers over standard ones
● Define the right limits for the bulkhead
● Monitor & Adjust the limits
● Define fallbacks if possible, if not return 503

Rate Limiter
/ Controlling Traffic to Protect Services /
● The pattern applies both on server and
client side
● Limit the number of calls to certain service
● Spreads the load over time

Queue Based
Communication
Buffer tasks to smooth heavy loads,
decouples consumer and producer, and
ensure communication reliability

Kafka Error
Handling
Available with Spring Kafka
● Map transient and non-transient errors
● Make the system resilient to transient errors meaning
automatic recovery - no manual actions
● Transient error handler is implemented with
SeekToErrorHandler. It makes sure the consumer is
never kicked out of the consumer group
● Retry count is configurable
● Spring Kafka Error Handling

Spring Cloud
Circuit Breaker
Resilient4j
● Easy to use
● Everything is configurable
● Integrated in Spring boot
● Support reactive stack
● Support many resiliency patterns
● Integrated with micrometer -
exposes many useful metrics
● Functional support
● Annotation support
A resiliency library abstraction
on top of Resilient4j

Key Takeaways for Reliable Systems
Wrapping Up
Adopt Resiliency Patterns
Implement retries, timeouts, circuit
breakers, and bulkheads to protect
your services from failures and
overloads.
Test for failures
Validate your system’s ability to handle
real-world failure scenarios by
thoroughly testing resiliency patterns.
Monitor and Alert
Set up robust observability with
dashboards, alerts, and metrics to
detect and address issues early.

Microservices Resilient Engineering - Java meetup.pptx

More Related Content

Similar to Microservices Resilient Engineering - Java meetup.pptx (20)

Recently uploaded (20)

Microservices Resilient Engineering - Java meetup.pptx

Editor's Notes