SlideShare a Scribd company logo
Resilient
Microservices
at Scale
Stories and Strategies from Our Journey
● 5 years at Yotpo
● 14 years of experience
● Experienced in building scalable and
reliable systems
Hi
Elia Rohana
Tech Leader at Yotpo Email Group
👋
Raised
$400M
YoY Growth
20%
Customers
100K
Employees
1K
NUMBERS
INVESTORS
PRODUCTS
Reviews Loyalty SMS Email
What is Yotpo?
eCommerce Retention Marketing Platform
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
🔌
Latency Issues
Slow service
What can go wrong?
😓
Network failures
DNS,routing issues
⚠️
Unavailable / unresponsive
Service is down, overloaded
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
🪄
Seamlessly
without the user even noticing
Resiliency
The capability of a system to handle unexpected situations
🦢
Gracefully
degrading service with fallbacks in place
🤖
Automatically
recovering, as if nothing ever happened
Resiliency
Why do we need
our applications
to be
resilient? 🤔
Failure is inevitable
In a complex system such as a
distributed system with microservice
architecture, failure is the normal case.
⚙️
Customer
satisfaction 😊
Ensuring our applications remain
reliable keeps customers happy and
engaged
Quiet production
Oncalls 📞🤫
Reducing failures and incidents means
fewer late-night calls and a more
peaceful work-life balance for
engineers.
Resiliency Patterns
🌱
Timeout
Ensure every network call has a
timeout
Large timeouts can impact client experience and
system throughput
Timeout Breakdown
● Connection Timeout: Time to complete a
TCP handshake
● Read Timeout: Time to wait for data to be
read
Timeout
Best Practices 📋
1. Measure your HTTP client metrics and
configure timeout accordingly
2. Set very low connection timeout for
internal service communication ~100ms
3. Configure timeout per client
Retry & Limits
Best Practice
● Limit retries to 3-5 attempts
● Retry transient errors (e.g., 502, 503, 504,
timeouts)
● Only retry idempotent operations
● Use exponential backoff for retries
● Monitor retry count and total request time
to avoid excessive retries
● Use TimeLimiter to enforce a timeout for the
entire operation, including retries.
Handling Network Issues
Microservices Resilient Engineering - Java meetup.pptx
Retry Example
🔄
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
Fallbacks
/ A Better Alternative to Failing /
Fallbacks can be valuable, depending on business
needs.
Usually combined with other patterns
Fallback Techniques
● Hide feature
● Use default or alternative option
● Fallback to a Cache
● If no fallback -> fail fast
Limitations
● Write or batch operation
⚠️
Microservices Resilient Engineering - Java meetup.pptx
But what happens when
a service is completely
down or unresponsive ?
🚨
Microservices Resilient Engineering - Java meetup.pptx
In case of high load on the
system, We will keep
retrying until…
🚨
�
�
🚨
�
�
🚨
�
�
💀
Circuit Breaker
Circuit Breaker
A circuit breaker protects
systems by halting calls to
failing services, preventing
cascading failures, and
ensuring stability through fast
failures and fallback responses
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
Circuit Breakers
/ Calls to broken service
fails fast /
● Time-based sliding window
● Count based sliding window
Best Practices 👍
● Configure for slow or failed services
● Configure per operation with proper settings for every
backend/operation
● Define fallback when applicable
Microservices Resilient Engineering - Java meetup.pptx
Now what Happens When a
Service Is Too Slow but
Doesn’t Fail?
🐢🐢
🐢
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
Bulkhead
/ Protect against latency issues /
Limits the number of operations to specific
resource in order not to “flood” the entire service
and avoid cascading failures
Bulkhead
Resource Isolation
● Dedicated resources (threads or semaphores) for each
service or task.
● Prevents one service from consuming all resources.
Two Bulkhead Types
● Semaphore Bulkhead: Limits concurrent requests.
● Thread Pool Bulkhead: Uses a bounded thread pool
with a queue for overflow.
Handling Overload
● Requests beyond the limit are rejected (Semaphore) or
queued/rejected (Thread Pool).
● Protects system stability and prevents resource
starvation.
/ how it works /
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
It is possible to combine the patterns
Semaphore Bulkhead Threadpool Bulkhead
Feature Threadpool Bulkhead Semaphore Bulkhead
Execution Tasks run in separate threads Tasks run in the calling thread
Concurrency Control Thread pool size and queue
capacity
Semaphore permits
Task Queueing Supports task queueing
(limited by capacity)
No queueing; tasks block or
reject
Overhead Higher (thread pool
management)
Lower (simple semaphore)
Use Case Asynchronous or I/O-heavy
workloads
Synchronous or lightweight
tasks
Bulkhead
/ Helps protect latency issues /
Best Practices 🏆
● Prioritize critical consumers over standard ones
● Define the right limits for the bulkhead
● Monitor & Adjust the limits
● Define fallbacks if possible, if not return 503
Rate Limiter
/ Controlling Traffic to Protect Services /
● The pattern applies both on server and
client side
● Limit the number of calls to certain service
● Spreads the load over time
Microservices Resilient Engineering - Java meetup.pptx
Queue Based
Communication
Buffer tasks to smooth heavy loads,
decouples consumer and producer, and
ensure communication reliability
Microservices Resilient Engineering - Java meetup.pptx
Kafka Error
Handling
Available with Spring Kafka
● Map transient and non-transient errors
● Make the system resilient to transient errors meaning
automatic recovery - no manual actions
● Transient error handler is implemented with
SeekToErrorHandler. It makes sure the consumer is
never kicked out of the consumer group
● Retry count is configurable
● Spring Kafka Error Handling
Microservices Resilient Engineering - Java meetup.pptx
Microservices Resilient Engineering - Java meetup.pptx
Spring Cloud
Circuit Breaker
Resilient4j
● Easy to use
● Everything is configurable
● Integrated in Spring boot
● Support reactive stack
● Support many resiliency patterns
● Integrated with micrometer -
exposes many useful metrics
● Functional support
● Annotation support
A resiliency library abstraction
on top of Resilient4j
Key Takeaways for Reliable Systems
Wrapping Up
Adopt Resiliency Patterns
Implement retries, timeouts, circuit
breakers, and bulkheads to protect
your services from failures and
overloads.
Test for failures
Validate your system’s ability to handle
real-world failure scenarios by
thoroughly testing resiliency patterns.
Monitor and Alert
Set up robust observability with
dashboards, alerts, and metrics to
detect and address issues early.
Thank you!

More Related Content

PDF
Architecting for Failures in micro services: patterns and lessons learned
Bhakti Mehta
 
PDF
Resilience Planning & How the Empire Strikes Back
C4Media
 
PPTX
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Bhakti Mehta
 
PDF
Expect the unexpected: Prepare for failures in microservices
Bhakti Mehta
 
PPTX
Resilience planning and how the empire strikes back
Bhakti Mehta
 
PDF
Resilient service to-service calls in a post-Hystrix world
Rares Musina
 
PDF
Resilience4j with Spring Boot
Knoldus Inc.
 
PPTX
Resilience reloaded - more resilience patterns
Uwe Friedrichsen
 
Architecting for Failures in micro services: patterns and lessons learned
Bhakti Mehta
 
Resilience Planning & How the Empire Strikes Back
C4Media
 
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Bhakti Mehta
 
Expect the unexpected: Prepare for failures in microservices
Bhakti Mehta
 
Resilience planning and how the empire strikes back
Bhakti Mehta
 
Resilient service to-service calls in a post-Hystrix world
Rares Musina
 
Resilience4j with Spring Boot
Knoldus Inc.
 
Resilience reloaded - more resilience patterns
Uwe Friedrichsen
 

Similar to Microservices Resilient Engineering - Java meetup.pptx (20)

PDF
Patterns of resilience
Uwe Friedrichsen
 
PDF
Microservice Resilience Patterns @VoxxedCern'24
Victor Rentea
 
PDF
Resisting to The Shocks
Stefano Fago
 
PDF
Reliability and Resilience Patterns
Dmitry Chornyi
 
PDF
Latency Control And Supervision In Resilience Design Patterns
Tu Pham
 
PDF
The Anatomy of Failure - Lessons from running systems to serve millions of pe...
John Paul Alcala
 
PPTX
Designing Fault Tolerant Microservices
Orkhan Gasimov
 
PDF
Stability anti patterns in cloud-native applications
Ana-Maria Mihalceanu
 
PDF
The anatomy of a cascading failure
Rares Musina
 
PDF
Resilient microservices
Maxim Shelest
 
PDF
[WSO2Con EU 2017] Resilience Patterns with Ballerina
WSO2
 
PPTX
Fault Tolerance in Distributed Environment
Orkhan Gasimov
 
PPTX
Stability Patterns for Microservices
pflueras
 
PDF
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven Systems
Jonas Bonér
 
PDF
Resilient Functional Service Design
Uwe Friedrichsen
 
PPTX
Chapter 05: Eclipse Vert.x - Service Discovery, Resilience and Stability Patt...
Firmansyah, SCJP, OCEWCD, OCEWSD, TOGAF, OCMJEA, CEH
 
PDF
Patterns and practices for building resilient serverless applications
Yan Cui
 
PDF
Patterns and Practices for Building Resilient Serverless Applications
Yan Cui
 
PPTX
DoT NeT resiliency framework - Polly.pptx
Knoldus Inc.
 
PDF
Resilience-Patterns in Cloud-Applications
Kristian Köhler
 
Patterns of resilience
Uwe Friedrichsen
 
Microservice Resilience Patterns @VoxxedCern'24
Victor Rentea
 
Resisting to The Shocks
Stefano Fago
 
Reliability and Resilience Patterns
Dmitry Chornyi
 
Latency Control And Supervision In Resilience Design Patterns
Tu Pham
 
The Anatomy of Failure - Lessons from running systems to serve millions of pe...
John Paul Alcala
 
Designing Fault Tolerant Microservices
Orkhan Gasimov
 
Stability anti patterns in cloud-native applications
Ana-Maria Mihalceanu
 
The anatomy of a cascading failure
Rares Musina
 
Resilient microservices
Maxim Shelest
 
[WSO2Con EU 2017] Resilience Patterns with Ballerina
WSO2
 
Fault Tolerance in Distributed Environment
Orkhan Gasimov
 
Stability Patterns for Microservices
pflueras
 
Go Reactive: Building Responsive, Resilient, Elastic & Message-Driven Systems
Jonas Bonér
 
Resilient Functional Service Design
Uwe Friedrichsen
 
Chapter 05: Eclipse Vert.x - Service Discovery, Resilience and Stability Patt...
Firmansyah, SCJP, OCEWCD, OCEWSD, TOGAF, OCMJEA, CEH
 
Patterns and practices for building resilient serverless applications
Yan Cui
 
Patterns and Practices for Building Resilient Serverless Applications
Yan Cui
 
DoT NeT resiliency framework - Polly.pptx
Knoldus Inc.
 
Resilience-Patterns in Cloud-Applications
Kristian Köhler
 
Ad

Recently uploaded (20)

PDF
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
All chapters of Strength of materials.ppt
girmabiniyam1234
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PDF
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PPTX
quantum computing transition from classical mechanics.pptx
gvlbcy
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PDF
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
PPT
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
PPTX
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
PPTX
Tunnel Ventilation System in Kanpur Metro
220105053
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
All chapters of Strength of materials.ppt
girmabiniyam1234
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
quantum computing transition from classical mechanics.pptx
gvlbcy
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Inventory management chapter in automation and robotics.
atisht0104
 
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
Tunnel Ventilation System in Kanpur Metro
220105053
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
Ad

Microservices Resilient Engineering - Java meetup.pptx

  • 1. Resilient Microservices at Scale Stories and Strategies from Our Journey
  • 2. ● 5 years at Yotpo ● 14 years of experience ● Experienced in building scalable and reliable systems Hi Elia Rohana Tech Leader at Yotpo Email Group 👋
  • 3. Raised $400M YoY Growth 20% Customers 100K Employees 1K NUMBERS INVESTORS PRODUCTS Reviews Loyalty SMS Email What is Yotpo? eCommerce Retention Marketing Platform
  • 6. 🔌 Latency Issues Slow service What can go wrong? 😓 Network failures DNS,routing issues ⚠️ Unavailable / unresponsive Service is down, overloaded
  • 10. 🪄 Seamlessly without the user even noticing Resiliency The capability of a system to handle unexpected situations 🦢 Gracefully degrading service with fallbacks in place 🤖 Automatically recovering, as if nothing ever happened
  • 11. Resiliency Why do we need our applications to be resilient? 🤔 Failure is inevitable In a complex system such as a distributed system with microservice architecture, failure is the normal case. ⚙️ Customer satisfaction 😊 Ensuring our applications remain reliable keeps customers happy and engaged Quiet production Oncalls 📞🤫 Reducing failures and incidents means fewer late-night calls and a more peaceful work-life balance for engineers.
  • 13. Timeout Ensure every network call has a timeout Large timeouts can impact client experience and system throughput Timeout Breakdown ● Connection Timeout: Time to complete a TCP handshake ● Read Timeout: Time to wait for data to be read
  • 14. Timeout Best Practices 📋 1. Measure your HTTP client metrics and configure timeout accordingly 2. Set very low connection timeout for internal service communication ~100ms 3. Configure timeout per client
  • 15. Retry & Limits Best Practice ● Limit retries to 3-5 attempts ● Retry transient errors (e.g., 502, 503, 504, timeouts) ● Only retry idempotent operations ● Use exponential backoff for retries ● Monitor retry count and total request time to avoid excessive retries ● Use TimeLimiter to enforce a timeout for the entire operation, including retries. Handling Network Issues
  • 20. Fallbacks / A Better Alternative to Failing / Fallbacks can be valuable, depending on business needs. Usually combined with other patterns Fallback Techniques ● Hide feature ● Use default or alternative option ● Fallback to a Cache ● If no fallback -> fail fast Limitations ● Write or batch operation ⚠️
  • 22. But what happens when a service is completely down or unresponsive ? 🚨
  • 24. In case of high load on the system, We will keep retrying until… 🚨 � � 🚨 � � 🚨 � �
  • 25. 💀
  • 27. Circuit Breaker A circuit breaker protects systems by halting calls to failing services, preventing cascading failures, and ensuring stability through fast failures and fallback responses
  • 31. Circuit Breakers / Calls to broken service fails fast / ● Time-based sliding window ● Count based sliding window Best Practices 👍 ● Configure for slow or failed services ● Configure per operation with proper settings for every backend/operation ● Define fallback when applicable
  • 33. Now what Happens When a Service Is Too Slow but Doesn’t Fail? 🐢🐢 🐢
  • 37. Bulkhead / Protect against latency issues / Limits the number of operations to specific resource in order not to “flood” the entire service and avoid cascading failures
  • 38. Bulkhead Resource Isolation ● Dedicated resources (threads or semaphores) for each service or task. ● Prevents one service from consuming all resources. Two Bulkhead Types ● Semaphore Bulkhead: Limits concurrent requests. ● Thread Pool Bulkhead: Uses a bounded thread pool with a queue for overflow. Handling Overload ● Requests beyond the limit are rejected (Semaphore) or queued/rejected (Thread Pool). ● Protects system stability and prevents resource starvation. / how it works /
  • 41. It is possible to combine the patterns
  • 43. Feature Threadpool Bulkhead Semaphore Bulkhead Execution Tasks run in separate threads Tasks run in the calling thread Concurrency Control Thread pool size and queue capacity Semaphore permits Task Queueing Supports task queueing (limited by capacity) No queueing; tasks block or reject Overhead Higher (thread pool management) Lower (simple semaphore) Use Case Asynchronous or I/O-heavy workloads Synchronous or lightweight tasks
  • 44. Bulkhead / Helps protect latency issues / Best Practices 🏆 ● Prioritize critical consumers over standard ones ● Define the right limits for the bulkhead ● Monitor & Adjust the limits ● Define fallbacks if possible, if not return 503
  • 45. Rate Limiter / Controlling Traffic to Protect Services / ● The pattern applies both on server and client side ● Limit the number of calls to certain service ● Spreads the load over time
  • 47. Queue Based Communication Buffer tasks to smooth heavy loads, decouples consumer and producer, and ensure communication reliability
  • 49. Kafka Error Handling Available with Spring Kafka ● Map transient and non-transient errors ● Make the system resilient to transient errors meaning automatic recovery - no manual actions ● Transient error handler is implemented with SeekToErrorHandler. It makes sure the consumer is never kicked out of the consumer group ● Retry count is configurable ● Spring Kafka Error Handling
  • 52. Spring Cloud Circuit Breaker Resilient4j ● Easy to use ● Everything is configurable ● Integrated in Spring boot ● Support reactive stack ● Support many resiliency patterns ● Integrated with micrometer - exposes many useful metrics ● Functional support ● Annotation support A resiliency library abstraction on top of Resilient4j
  • 53. Key Takeaways for Reliable Systems Wrapping Up Adopt Resiliency Patterns Implement retries, timeouts, circuit breakers, and bulkheads to protect your services from failures and overloads. Test for failures Validate your system’s ability to handle real-world failure scenarios by thoroughly testing resiliency patterns. Monitor and Alert Set up robust observability with dashboards, alerts, and metrics to detect and address issues early.

Editor's Notes

  • #1: שלום לכולם תודה שבאתם היום אני הולך לדבר על איך בנינו מערכת מיקרו סרוויסים שתעמוד בסקאל של 100 אלף בקשות בדקה, מערכות ששולחת עשרות מליוני מיילים ביום ללקוחות קצה והאתגר האמיתי היה לבנות את המערכת שתהיה resilient עם מינימום אירועי פרודקשן והיום אדבר על הדרך להגיע לשם
  • #2: 14 שנות נסיון לפני 5 שנים יוטפו החליטה לפתח מוצר חדש - את מוצר האמיילים ולפתוח את סניף יוטפו בצפון ביוקנעם
  • #3: יוטפו היא פלטפורמת מארקטינג שעוזרת לעסקים באינטרנט לגדול ולהגדיל את המכירות דרך המוצרים השונים כמו emails, sms, review , loyaltly& referals
  • #4: בוא נחזור חודשיים אחורה לאירוע המכירות black friday ן cyber monday האירוע הכי חשוב מבחינת הלקוחות שלנו - שזה האירוע הכי רווחי לכל חנות אינטרנטית אפשר לראות את הגדילה המשוגעת של יוטפו משנה לשנה וזה ה money time שלנו אנחנו לא יכולים לאכזב את הלקוחות שלנו - צריך לדאוג שערות מליוני המיילים יגיעו ל shoppers ב zerp down time אלה נתונים אמיתיים שלנו
  • #5: להציג ארכיטקטורה מופשטת את המערכת שלנו להציג שיש תקשורת סינכרונית ו אסינכרונית יש גם אינטגרציות של סרביסים מחוץ ליוטפו כל הסרביסים רצים על kubernetes יש flow של execution שרץ ברקע ומריץ קמפיינים של מיליוני מיילים וחשוב מאוד שהמיילים יצאו מהר וגם ב אפס שגיאות
  • #7: פה רואים שאחד הסרביסים נפל והוא מפסיק להגיב
  • #8: הטימפלאט סרביס גם מפסיק להגיב במקרה של עשרות אלפי בקשות שמשוגרות לאותו סרביס שנפל ה thread pool של שרת ה tomcat מתחיל להתמלא והוא לא יכול לטפל ביותר בקשות - וכל זה קורה מהר מאוד וזה גם משפיע על ה throughput של המערכת כי אנחנו מחכים הרבה זמן עד שאותו סרביס יגיב
  • #9: ואז מגיעים למצב שכל הסרביסים בשרשת נופלים, וכל זה בגלל שסרביס אחד הפסיק להגיב או נפל וזה מה שנקרא cascading failure או אפקט הדומינו שבגלל בעיה אחת זה גרם לשרשרת של בעיות או נפילות בארכיטקטורת microservices אז אם רוצים מערכת שתעמוד בסקאל גדול של בקשרות וגם תהיה יציבה, צריך להימנע ולטפל במצבים האלה וזה מה שאני הולך להסביר עליו עכשיו
  • #11: אנחנו רוצים להיות מוכנים למצבים לא צפויים ולמצבים של בעיות בפרודקשן
  • #25: In case of high number of requests the timeouts will propagate to the render engine during sync execution the render engine resources will be blocked Such as threads, database connection, memory and might cause to partial failure or for the service to go down entirely Excessive retries will affect the performance because every operation will take much longer, and the tomcat threads are blocked so the application is not responding anymore
  • #27: אחד האתגרים שהיה לנו זה היה ברינדר אינג, כי זה סרביס שצורך הרבה שיורתים כדי לאסוף דטה לרנדור של אימייל (לתת דוגמאות), ואז בעקבות שירות כזה או אחר שהיה קורס בגלל העומס של מאות אלפי בקשות בזמן קצר, היה גורם לבעיות בשירות שלנו גם. ואת זה רצינו למנוע בעזרת ciruit breaker. וזה כדי לא לגרום לשירות שלנו גם ליפול, לשירות שקורא לנו וכך הלאה… בגדול נרצה להימנע ממצב של cascading failure שבגלל בעיה באחד השירותים זה יגרום לבעיה בכל שירותי יוטפו What happened if the retry is exhausted and the service is down ? there is no sense to keep trying The circuit breaker pattern can be used in conjunction with other patterns, such as retry, fallback, and timeout, to enhance fault tolerance in systems It prevents cascading failures in case of transient errors
  • #30: In half-open mode, we retry until certain thresholds are reached. If the calls was successful the state goes back to closed, otherwise it will go back to open
  • #37: "In a distributed system, one slow or overloaded service can consume all the resources, leaving no capacity for other services. This can lead to cascading failures where the whole system becomes unavailable."
  • #38: "In a distributed system, one slow or overloaded service can consume all the resources, leaving no capacity for other services. This can lead to cascading failures where the whole system becomes unavailable."
  • #40: לתת דוגמה שמחזירים 503 במקרה של כשלון
  • #41: אפשר להגדיר fallback לכל פעולה
  • #44: “The Bulkhead pattern is a type of application design that is tolerant of failure. In a bulkhead architecture, elements of an application are isolated into pools so that if one fails, the others will continue to function. It's named after the sectioned partitions (bulkheads) of a ship's hull. If the hull of a ship is compromised, only the damaged section fills with water, which prevents the ship from sinking.”
  • #46: 5 requests for 1 second, and timeout duration o 500 milliseconds
  • #47: I’m not going to deep dive about the pros and cons of async communication, but i want to emphasize additional resiliency patterns on top of it
  • #49: Although asyc communication is more stable than sync communication, but errors can happen and we need the system to be resilient - meaning NO manual interversion Not all error are equal We have different type of errors Transient ones - such as service not available, database not available, timeouts - those are critical resources that the service can process a message without communicating with them Other errors such as poisoned pills - nullpointer exception, or other exception
  • #50: לדבר על coreHttpException - בפועל זה מיפוי שלנו
  • #53: אנחנו רוצים להיות מוכנים למצבים לא צפויים ולמצבים של בעיות בפרודקשן