SlideShare a Scribd company logo
"Scaling in space and time with Temporal", Andriy Lupa.pdf
What is Temporal.io?
What is Temporal.io?
https://www.youtube.com/watch?v=0BdLUnay9ok&ab_channel=fwdays
What is Temporal.io?
#Short description based on documentation
What is Temporal Workflow?
Add screen
What is Day.io?
What is Day.io?
- Collect clock in/out events (punches) from mobile
and tablet devices, integrations and web widgets.
- For each employee for each shift calculate
employees basic metrics for salary payments:
worked time, rested time, night shifts time, overtime
and many more.
- Update aggregated metrics for payroll period report.
- Generate files with custom formatting of stored
JSON-data.
What is Day.io?
Requirements
- Bursty Load of events that triggers recalculations:
Requirements
- Deduplication & throttling
Previously we had huge queues and all the events were processed one by
one although we know there are duplicates.
Requirements
- Persist results on different stages
- Optimisations inside calculations pipeline
- Scalability
- Retry logic
- Transactions
Where is load?
- ~2M events per day
- Synchronous calculations:
- >30 metrics for given period (from - to)
- >15 metrics that depends on previous day (from - now):
changing a day 1 year ago triggers recalculation of entire year!
High-level setup
- 5 kafka consumers
- In-house Temporal cluster on top of
Postgres
- 2 workers for High Priority workflows
- 15 workers for Low Priority
Workflow, version 1
- Single workflow run for same
employee and same period
(controlled by Temporal)
- Policy: Terminate running if new
with same id arrived
Workflow, version 1
Outputs:
- faster than old setup as we started
interrupting calculations if same event
arrived
- Temporal cluster was almost at ~100%
CPU every time we had even small load
- Workers ~100% CPU
- Scaling didn’t help
Next steps:
- Reduce amount of workflows
- Fine tune configurations
Workflow, version 2
- Single workflow run for same
employee (controlled by Temporal)
- Notify running workflow about
consumed event for same
employee
- In-memory queue for each
workflow with deduplication logic
for intersection periods
Workflow, version 2
Outputs:
- Faster for clients again. This time also
because of deduplication
- Amount of operations on Temporal
dropped for ~30%
- Temporal cluster was almost at ~100%
CPU at peak time
- Scaling started working. But x2-2.5
max, not as promised
Next step: We cannot reduce amount of
workflows anymore. So let’s reduce amount
of activities
Workflow, version 3
- Merged all calculations, persisting and
sending events activities into single
Outputs:
- Faster for clients again due to processing less
system micro tasks inside each workflow.
- Temporal cluster finally not hitting 100% CPU
- Improved scaling from x2-2.5 to x5-7. But still
not what we were promised
Next step: We can merge last 2 activities but it’s not
a game changer. Let’s decrease latency inside
cluster by replacing database
Replacing Postgres with Cassandra
Benchmark results for 12K workflows:
Postgres based Cassandra based
Total time ~45 minutes ~15 minutes
Workflows starts, per sec ~8 ~40
History pods latency, P99 ~6ms, bursty ~0.5ms
GetTasks requests rate, per sec ~4 ~75
Other workflows types (convert into describe slides)
- Long running workflows to send alerts with delay
Other workflows types (convert into describe slides)
- Background bulk operations processing workflow
- Cron jobs
- Orchestrating data migrations
Next steps
- Merge our consecutive workflows to use power of signals and avoid
duplicated business logic runs
- Use Local Activities
- Move more cron jobs and business transactions into Temporal
- Implement Human-in-the-Loop business processes like onboarding
Pros:
- Retry logic and transactions. Your sagas will look like pure functions.
- Durable executions of long-running flows.
- Safe Cron jobs, especially long running or iterative.
- Triggers and workers can be implemented in different languages
- Easy to setup in-house cluster using Kubernetes.
- Great UI visualisation tools
- Great slack community, documentation and open-source code
Lessons learned
Temporal is nice tool if you know what you need.
Lessons learned
Temporal is NOT nice tool if you DON’T know what you need.
Cons:
- Not very performant for short-running workflows with huge amount
of activities.
- Postgres is enough for flat load. Use Cassandra for spikes
- Bugs with Cassandra
- CLI Batch operations on workflows are very slow
- Expensive to keep cluster if workflows are not running all the time
- High entry barrier
Sources
Temporal.io Slack - https://t.mp/slack
Official documentation - https://docs.temporal.io
Blog - https://temporal.io/blog
Samples - https://github.com/temporalio/samples-typescript (also for Go, Python, Java, .Net)
Q&A

More Related Content

Similar to "Scaling in space and time with Temporal", Andriy Lupa.pdf (20)

PDF
Experiences building a multi region cassandra operations orchestrator on aws
Diego Pacheco
 
PPTX
HPC Resource Management: Futures
rcastain
 
PPTX
Temporal intro and event loop
TihomirSurdilovic
 
PPTX
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
Opher Dubrovsky
 
PPTX
Using Processes and Timers for Long-Running Asynchronous Tasks
OutSystems
 
PDF
Cloudstate—Towards Stateful Serverless
Jonas Bonér
 
PDF
FireWorks overview
Anubhav Jain
 
PDF
Deliver Business Value Faster with AWS Step Functions
Daniel Zivkovic
 
PDF
Patterns and practices for building resilient Serverless applications
Yan Cui
 
PDF
Server fleet management using Camunda by Akhil Ahuja
camunda services GmbH
 
PDF
Cache, Workers, and Queues
Jason McCreary
 
PPTX
Using Processes and Timers for Long-Running Asynchronous Tasks
OutSystems
 
PDF
Tachyon memory centric, fault tolerance storage for cluster framworks
Viet-Trung TRAN
 
PDF
Hadoop Cluster Management
DataWorks Summit
 
PDF
Dev with github enterprise
Hiroshi Wada
 
PDF
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward
 
PDF
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
Craeg Strong
 
PDF
Migrating existing monolith to serverless in 8 steps
Yan Cui
 
PPTX
mapreduce-advanced.pptx
ShimoFcis
 
PDF
LFCS Questions and Answers pdf dumps.pdf
anam10379291
 
Experiences building a multi region cassandra operations orchestrator on aws
Diego Pacheco
 
HPC Resource Management: Futures
rcastain
 
Temporal intro and event loop
TihomirSurdilovic
 
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
Opher Dubrovsky
 
Using Processes and Timers for Long-Running Asynchronous Tasks
OutSystems
 
Cloudstate—Towards Stateful Serverless
Jonas Bonér
 
FireWorks overview
Anubhav Jain
 
Deliver Business Value Faster with AWS Step Functions
Daniel Zivkovic
 
Patterns and practices for building resilient Serverless applications
Yan Cui
 
Server fleet management using Camunda by Akhil Ahuja
camunda services GmbH
 
Cache, Workers, and Queues
Jason McCreary
 
Using Processes and Timers for Long-Running Asynchronous Tasks
OutSystems
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Viet-Trung TRAN
 
Hadoop Cluster Management
DataWorks Summit
 
Dev with github enterprise
Hiroshi Wada
 
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward
 
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
Craeg Strong
 
Migrating existing monolith to serverless in 8 steps
Yan Cui
 
mapreduce-advanced.pptx
ShimoFcis
 
LFCS Questions and Answers pdf dumps.pdf
anam10379291
 

More from Fwdays (20)

PPTX
"Як ми переписали Сільпо на Angular", Євген Русаков
Fwdays
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
"Validation and Observability of AI Agents", Oleksandr Denisyuk
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
Fwdays
 
PPTX
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
Fwdays
 
PDF
"AI is already here. What will happen to your team (and your role) tomorrow?"...
Fwdays
 
PPTX
"Is it worth investing in AI in 2025?", Alexander Sharko
Fwdays
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PDF
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
PPTX
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
Fwdays
 
PPTX
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
Fwdays
 
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
PPTX
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
Fwdays
 
PPTX
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
PPTX
"Confidential AI: zero trust concept", Hennadiy Karpov
Fwdays
 
PPTX
"Choosing Tensor Accelerators for Specific Tasks: Compute vs Memory Bound Mod...
Fwdays
 
PPTX
"Custom Voice Assistants: Infrastructure, Integrations, and Uniqueness", Yeho...
Fwdays
 
PPTX
"Different Facets of AI: Computer Vision and Large Language Models. How We De...
Fwdays
 
"Як ми переписали Сільпо на Angular", Євген Русаков
Fwdays
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
"Validation and Observability of AI Agents", Oleksandr Denisyuk
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
Fwdays
 
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
Fwdays
 
"AI is already here. What will happen to your team (and your role) tomorrow?"...
Fwdays
 
"Is it worth investing in AI in 2025?", Alexander Sharko
Fwdays
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
Fwdays
 
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
Fwdays
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
Fwdays
 
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
 
"Confidential AI: zero trust concept", Hennadiy Karpov
Fwdays
 
"Choosing Tensor Accelerators for Specific Tasks: Compute vs Memory Bound Mod...
Fwdays
 
"Custom Voice Assistants: Infrastructure, Integrations, and Uniqueness", Yeho...
Fwdays
 
"Different Facets of AI: Computer Vision and Large Language Models. How We De...
Fwdays
 
Ad

Recently uploaded (20)

PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Digital Circuits, important subject in CS
contactparinay1
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Ad

"Scaling in space and time with Temporal", Andriy Lupa.pdf

  • 4. What is Temporal.io? #Short description based on documentation
  • 5. What is Temporal Workflow? Add screen
  • 7. What is Day.io? - Collect clock in/out events (punches) from mobile and tablet devices, integrations and web widgets. - For each employee for each shift calculate employees basic metrics for salary payments: worked time, rested time, night shifts time, overtime and many more. - Update aggregated metrics for payroll period report. - Generate files with custom formatting of stored JSON-data.
  • 9. Requirements - Bursty Load of events that triggers recalculations:
  • 10. Requirements - Deduplication & throttling Previously we had huge queues and all the events were processed one by one although we know there are duplicates.
  • 11. Requirements - Persist results on different stages - Optimisations inside calculations pipeline - Scalability - Retry logic - Transactions
  • 12. Where is load? - ~2M events per day - Synchronous calculations: - >30 metrics for given period (from - to) - >15 metrics that depends on previous day (from - now): changing a day 1 year ago triggers recalculation of entire year!
  • 13. High-level setup - 5 kafka consumers - In-house Temporal cluster on top of Postgres - 2 workers for High Priority workflows - 15 workers for Low Priority
  • 14. Workflow, version 1 - Single workflow run for same employee and same period (controlled by Temporal) - Policy: Terminate running if new with same id arrived
  • 15. Workflow, version 1 Outputs: - faster than old setup as we started interrupting calculations if same event arrived - Temporal cluster was almost at ~100% CPU every time we had even small load - Workers ~100% CPU - Scaling didn’t help Next steps: - Reduce amount of workflows - Fine tune configurations
  • 16. Workflow, version 2 - Single workflow run for same employee (controlled by Temporal) - Notify running workflow about consumed event for same employee - In-memory queue for each workflow with deduplication logic for intersection periods
  • 17. Workflow, version 2 Outputs: - Faster for clients again. This time also because of deduplication - Amount of operations on Temporal dropped for ~30% - Temporal cluster was almost at ~100% CPU at peak time - Scaling started working. But x2-2.5 max, not as promised Next step: We cannot reduce amount of workflows anymore. So let’s reduce amount of activities
  • 18. Workflow, version 3 - Merged all calculations, persisting and sending events activities into single Outputs: - Faster for clients again due to processing less system micro tasks inside each workflow. - Temporal cluster finally not hitting 100% CPU - Improved scaling from x2-2.5 to x5-7. But still not what we were promised Next step: We can merge last 2 activities but it’s not a game changer. Let’s decrease latency inside cluster by replacing database
  • 19. Replacing Postgres with Cassandra Benchmark results for 12K workflows: Postgres based Cassandra based Total time ~45 minutes ~15 minutes Workflows starts, per sec ~8 ~40 History pods latency, P99 ~6ms, bursty ~0.5ms GetTasks requests rate, per sec ~4 ~75
  • 20. Other workflows types (convert into describe slides) - Long running workflows to send alerts with delay
  • 21. Other workflows types (convert into describe slides) - Background bulk operations processing workflow - Cron jobs - Orchestrating data migrations
  • 22. Next steps - Merge our consecutive workflows to use power of signals and avoid duplicated business logic runs - Use Local Activities - Move more cron jobs and business transactions into Temporal - Implement Human-in-the-Loop business processes like onboarding
  • 23. Pros: - Retry logic and transactions. Your sagas will look like pure functions. - Durable executions of long-running flows. - Safe Cron jobs, especially long running or iterative. - Triggers and workers can be implemented in different languages - Easy to setup in-house cluster using Kubernetes. - Great UI visualisation tools - Great slack community, documentation and open-source code Lessons learned Temporal is nice tool if you know what you need.
  • 24. Lessons learned Temporal is NOT nice tool if you DON’T know what you need. Cons: - Not very performant for short-running workflows with huge amount of activities. - Postgres is enough for flat load. Use Cassandra for spikes - Bugs with Cassandra - CLI Batch operations on workflows are very slow - Expensive to keep cluster if workflows are not running all the time - High entry barrier
  • 25. Sources Temporal.io Slack - https://t.mp/slack Official documentation - https://docs.temporal.io Blog - https://temporal.io/blog Samples - https://github.com/temporalio/samples-typescript (also for Go, Python, Java, .Net)
  • 26. Q&A