SlideShare a Scribd company logo
Serverless for HPC
Luciano Mammino
fourTheorem
@loige
Diamond Sponsor
Partner
Platinum Sponsor Gold Sponsor
👋 Hello, I am Luciano
Senior architect
nodejsdesignpatterns.com
Let’s connect:
🌎 loige.co
🐦 @loige
🎥 loige
🧳 lucianomammino
Middy Framework
SLIC Starter - Serverless Accelerator
SLIC Watch - Observability Plugin
Business focused technologists.
Accelerated Serverless | AI as a Service | Platform Modernisation
We host a podcast about AWS and Cloud computing
🔗 awsbites.com
🎬 YouTube Channel
🎙 Podcast
📅 Episodes every week
@loige
#CLOUDDAY2022
Get the slides: fth.link/cd22
@loige
#CLOUDDAY2022
Agenda
● The 6 Rs of Cloud Migration
● A serverless case study
○ The problem space and types of workflows
○ Original on premise implementation
○ The PoC
○ The final production version
○ The components of a serverless job scheduler
○ Challenges & Limits
fth.link/cd22
@loige
#CLOUDDAY2022
The 6 Rs of Cloud Migrations
🗑 🕸 🚚
Retire Retain Rehost
🏗 📐 💰
Replatform Refactor Repurchase
@loige
#CLOUDDAY2022
fth.link/cd22
A case study
Case study on AWS blog:
fth.link/awshpc
@loige
#CLOUDDAY2022
The workloads - Risk Rollup
🏦 Financial modeling to understand the portfolio of risk
🧠 Internal, custom-built risk model on all reinsurance deals
⚙ HPC (High-Performance Computing) workload
🗄 ~45TB data processed
⏱ 2/3 rollups per day (6-8 hours each!)
@loige
#CLOUDDAY2022
The workloads - Deal Analytics
⚡ Near real-time deal pricing using the same risk model
🗃 Lower data volumes
🔁 High frequency of execution – up to 1.000 per day
@loige
#CLOUDDAY2022
Original on-prem implementation
@loige
#CLOUDDAY2022
Challenges
🐢 Long execution times, constraining business agility
🥊 Competing workloads
📈 Limits our ability to support portfolio growth
😩 Can’t deliver new features
🧾 Very high total cost of ownership
@loige
#CLOUDDAY2022
Thinking Big
💭 Imagine a solution that would …
1. Offer a dramatic increase in performance
2. Provide consistent run times
3. Support more executions, more often
4. Support future portfolio growth and new
capabilities – 15x data volumes
@loige
#CLOUDDAY2022
The Goal ⚽
Run a Risk Rollup in 1 hour!
@loige
#CLOUDDAY2022
Architecture Options for Compute/Orchestration
AWS Lambda
Amazon SQS AWS Step Functions
AWS Fargate
Com t om :
Red he b to si l ,
s a l , ev -d i n
co n s
@loige
#CLOUDDAY2022
POC Architecture
AWS Batch
S3
Step Functions
Lambda
SQS
@loige
#CLOUDDAY2022
Measure Everything! 📏
⏱ Built metrics in from the start
󰤈 AWS metrics we wish existed out of the box:
- Number of running containers
- Success/failure counts
🎨 Custom metrics:
- Scheduler overhead
- Detailed timings (job duration, I/O time, algorithm steps)
🛠 Using CloudWatch, EMF
@loige
#CLOUDDAY2022
Measure Everything! 📏
👍 Rollup in 1 hour
☁ Running on AWS Batch
👎 Cluster utilisation was <50%
✅ Goal success
🤔 Understanding of what needs to
be addressed next!
@loige
#CLOUDDAY2022
Beyond the PoC
Production: optimise for unique workload characteristics
@loige
#CLOUDDAY2022
Job Plan
@loige
#CLOUDDAY2022
In reality, not all jobs are alike!
@loige
#CLOUDDAY2022
Horizontal scaling 🚀
1000’s of jobs
Duration: 1 second – 45 minutes
Scaling horizontally = splitting jobs
Jobs split according to their
complexity/duration
Resulting in >1 million jobs
@loige
#CLOUDDAY2022
Moving to production 🚢
@loige
#CLOUDDAY2022
Scope
@loige
#CLOUDDAY2022
Actual End to End overview
@loige
#CLOUDDAY2022
Modelling Worker
@loige
#CLOUDDAY2022
Compute Services
Scales to 1000’s of tasks (containers)
Little management overhead
Up to 4 vCPUs and 30GB Memory
Up to 200GB ephemeral storage
Scales to 1000’s of function containers (in seconds!)
Very little management overhead
Up to 6 vCPUs and 10GB Memory
Up to 10GB ephemeral storage
It wasn’t always this way!
@loige
#CLOUDDAY2022
Store all the things in S3!
The source of truth for:
● Input Data (JSON, Parquet)
● Intermediate Data (Parquet)
● Results (Parquet)
● Aggregates (Parquet)
Input data: 20GB
Output data: ~1 TB
Reads and writes: 10,000s of objects per second.
@loige
#CLOUDDAY2022
Scheduling and Orchestration
✅ We have our cluster (Fargate or Lambda)
✅ We have a plan! (list of jobs, parameters and
dependencies)
🤔 How do we feed this plan to the cluster?!
🤨 Existing schedulers use traditional clusters – there
is no serverless job scheduler for workloads like this!
@loige
#CLOUDDAY2022
Lifecycle of a Job
A new job
get queued
here 👇
A worker
picks up the
job and
executes it
The worker
emits the
job state
(success or
failure)
@loige
#CLOUDDAY2022
Event-Driven Scheduler
Job states are pulled
from a Kinesis Data
Stream
Redis stores:
- Job states
- Dependencies
This scheduler checks
new job states against
the state in Redis and
figures out if there are
new jobs that can be
scheduled next
@loige
#CLOUDDAY2022
Dynamic Runtime
Handling
We also need to handle
system failures!
@loige
#CLOUDDAY2022
Outcomes 🙌
Business
● Rollup in 1 hour
● Removed limits on number of runs
● Faster, more consistent deal analytics
● Business spending more time on
revenue-generating activities
● Support portfolio growth and deliver new
capabilities
Technology
● Brought serverless to HPC financial
modeling
● Reduced codebase by ~70%
● Lowered total cost of ownership
● Increased dev team agility
● Reduced carbon footprint
@loige
#CLOUDDAY2022
Hitting the limits 😰
@loige
#CLOUDDAY2022
S3 Throughput
@loige
#CLOUDDAY2022
S3 Partitioning
S3 cleverly detects high-throughput prefixes and creates partitions
….normally
If this does not happen…
🚨Please reduce your request rate;
Status Code: 503; Error Code: SlowDown
@loige
#CLOUDDAY2022
The Solution
Explicit Partitioning:
○Figure out how many partitions you need
○Update code to create keys uniformly distributed over all partitions
/part/0…
/part/1…
/part/2…
/part/3…
…
/part/f…
1. Talk (a lot) to AWS SAs, Support, Account
Manager for special requirements like this!
2. Think ahead if you have multiple accounts
for different environments!
@loige
#CLOUDDAY2022
Fargate Scaling
●We want to run 3000 containers ASAP
●This took > 1 hour!
●We built a custom Fargate scaler
○Using the RunTask API (no ECS Service)
○Hidden quota increases
○Step Function + Lambda
●3000 containers in ~20 minutes
The AWS ECS team since made lots of
improvements, making it possible to scale to
3,000 containers in under 5 minutes
@loige
#CLOUDDAY2022
How high can we go today?
🚀 10,000 concurrent Lambda functions in seconds
🎢 10,000 Fargate containers in 10 minutes
💸 No additional cost
vladionescu.me/posts/scaling-containers-on-aws-in-2022
@loige
#CLOUDDAY2022
Wrapping up 🎁
● "Serverless supercomputer" lets you do HPC with
commodity AWS compute
● Plenty of challenges, but it's doable!
● Agility and innovation benefits are massive
● Customer is now serverless-first and expert in AWS
Other interesting case studies:
☁ AWS HTC Grid - 🧬 COVID genome research
@loige
#CLOUDDAY2022
Special thanks to @eoins and @cmthorne10
fth.link/cd22
@loige
#CLOUDDAY2022

More Related Content

Similar to Serverless for High Performance Computing (20)

PDF
Introduction to DevOps and the Practical Use Cases at Credit OK
Kriangkrai Chaonithi
 
PDF
There is something about serverless
gjdevos
 
PDF
Efficient Django
David Arcos
 
PDF
Aws uk ug #8 not everything that happens in vegas stay in vegas
Peter Mounce
 
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
PPTX
Fandogh Cloud workshop slides
ssarabadani
 
PDF
Antoine Coetsier - billing the cloud
ShapeBlue
 
PPTX
Truemotion Adventures in Containerization
Ryan Hunter
 
PDF
Serverless? How (not) to develop, deploy and operate serverless applications.
gjdevos
 
PPTX
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Pablo Garbossa
 
PDF
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
PDF
[AWS Builders] Effective AWS Glue
Amazon Web Services Korea
 
PDF
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Garindra Prahandono
 
PDF
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
vanphp
 
PPTX
Eko10 Workshop Opensource Database Auditing
Juan Berner
 
PDF
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
PDF
Introduction to serverless computing on Google Cloud
wesley chun
 
PDF
Writing and deploying serverless python applications
Cesar Cardenas Desales
 
PDF
Cloud arch patterns
Corey Huinker
 
PDF
PyConIE 2017 Writing and deploying serverless python applications
Cesar Cardenas Desales
 
Introduction to DevOps and the Practical Use Cases at Credit OK
Kriangkrai Chaonithi
 
There is something about serverless
gjdevos
 
Efficient Django
David Arcos
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Peter Mounce
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
Fandogh Cloud workshop slides
ssarabadani
 
Antoine Coetsier - billing the cloud
ShapeBlue
 
Truemotion Adventures in Containerization
Ryan Hunter
 
Serverless? How (not) to develop, deploy and operate serverless applications.
gjdevos
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Pablo Garbossa
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
[AWS Builders] Effective AWS Glue
Amazon Web Services Korea
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Garindra Prahandono
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
vanphp
 
Eko10 Workshop Opensource Database Auditing
Juan Berner
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
Introduction to serverless computing on Google Cloud
wesley chun
 
Writing and deploying serverless python applications
Cesar Cardenas Desales
 
Cloud arch patterns
Corey Huinker
 
PyConIE 2017 Writing and deploying serverless python applications
Cesar Cardenas Desales
 

More from Luciano Mammino (20)

PDF
Serverless Rust: Your Low-Risk Entry Point to Rust in Production (and the ben...
Luciano Mammino
 
PDF
Did you know JavaScript has iterators? DublinJS
Luciano Mammino
 
PDF
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
Luciano Mammino
 
PDF
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Luciano Mammino
 
PDF
From Node.js to Design Patterns - BuildPiper
Luciano Mammino
 
PDF
Let's build a 0-cost invite-only website with Next.js and Airtable!
Luciano Mammino
 
PDF
Everything I know about S3 pre-signed URLs
Luciano Mammino
 
PDF
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
Luciano Mammino
 
PDF
Building an invite-only microsite with Next.js & Airtable
Luciano Mammino
 
PDF
Let's take the monolith to the cloud 🚀
Luciano Mammino
 
PDF
A look inside the European Covid Green Certificate - Rust Dublin
Luciano Mammino
 
PDF
Monoliths to the cloud!
Luciano Mammino
 
PDF
The senior dev
Luciano Mammino
 
PDF
Node.js: scalability tips - Azure Dev Community Vijayawada
Luciano Mammino
 
PDF
A look inside the European Covid Green Certificate (Codemotion 2021)
Luciano Mammino
 
PDF
AWS Observability Made Simple
Luciano Mammino
 
PDF
Semplificare l'observability per progetti Serverless
Luciano Mammino
 
PDF
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Luciano Mammino
 
PDF
Finding a lost song with Node.js and async iterators - EnterJS 2021
Luciano Mammino
 
PDF
How to send gzipped requests with boto3
Luciano Mammino
 
Serverless Rust: Your Low-Risk Entry Point to Rust in Production (and the ben...
Luciano Mammino
 
Did you know JavaScript has iterators? DublinJS
Luciano Mammino
 
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
Luciano Mammino
 
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Luciano Mammino
 
From Node.js to Design Patterns - BuildPiper
Luciano Mammino
 
Let's build a 0-cost invite-only website with Next.js and Airtable!
Luciano Mammino
 
Everything I know about S3 pre-signed URLs
Luciano Mammino
 
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
Luciano Mammino
 
Building an invite-only microsite with Next.js & Airtable
Luciano Mammino
 
Let's take the monolith to the cloud 🚀
Luciano Mammino
 
A look inside the European Covid Green Certificate - Rust Dublin
Luciano Mammino
 
Monoliths to the cloud!
Luciano Mammino
 
The senior dev
Luciano Mammino
 
Node.js: scalability tips - Azure Dev Community Vijayawada
Luciano Mammino
 
A look inside the European Covid Green Certificate (Codemotion 2021)
Luciano Mammino
 
AWS Observability Made Simple
Luciano Mammino
 
Semplificare l'observability per progetti Serverless
Luciano Mammino
 
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Luciano Mammino
 
Finding a lost song with Node.js and async iterators - EnterJS 2021
Luciano Mammino
 
How to send gzipped requests with boto3
Luciano Mammino
 
Ad

Recently uploaded (20)

PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Ad

Serverless for High Performance Computing

  • 1. Serverless for HPC Luciano Mammino fourTheorem @loige
  • 3. 👋 Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect: 🌎 loige.co 🐦 @loige 🎥 loige 🧳 lucianomammino
  • 4. Middy Framework SLIC Starter - Serverless Accelerator SLIC Watch - Observability Plugin Business focused technologists. Accelerated Serverless | AI as a Service | Platform Modernisation
  • 5. We host a podcast about AWS and Cloud computing 🔗 awsbites.com 🎬 YouTube Channel 🎙 Podcast 📅 Episodes every week @loige #CLOUDDAY2022
  • 6. Get the slides: fth.link/cd22 @loige #CLOUDDAY2022
  • 7. Agenda ● The 6 Rs of Cloud Migration ● A serverless case study ○ The problem space and types of workflows ○ Original on premise implementation ○ The PoC ○ The final production version ○ The components of a serverless job scheduler ○ Challenges & Limits fth.link/cd22 @loige #CLOUDDAY2022
  • 8. The 6 Rs of Cloud Migrations 🗑 🕸 🚚 Retire Retain Rehost 🏗 📐 💰 Replatform Refactor Repurchase @loige #CLOUDDAY2022 fth.link/cd22
  • 9. A case study Case study on AWS blog: fth.link/awshpc @loige #CLOUDDAY2022
  • 10. The workloads - Risk Rollup 🏦 Financial modeling to understand the portfolio of risk 🧠 Internal, custom-built risk model on all reinsurance deals ⚙ HPC (High-Performance Computing) workload 🗄 ~45TB data processed ⏱ 2/3 rollups per day (6-8 hours each!) @loige #CLOUDDAY2022
  • 11. The workloads - Deal Analytics ⚡ Near real-time deal pricing using the same risk model 🗃 Lower data volumes 🔁 High frequency of execution – up to 1.000 per day @loige #CLOUDDAY2022
  • 13. Challenges 🐢 Long execution times, constraining business agility 🥊 Competing workloads 📈 Limits our ability to support portfolio growth 😩 Can’t deliver new features 🧾 Very high total cost of ownership @loige #CLOUDDAY2022
  • 14. Thinking Big 💭 Imagine a solution that would … 1. Offer a dramatic increase in performance 2. Provide consistent run times 3. Support more executions, more often 4. Support future portfolio growth and new capabilities – 15x data volumes @loige #CLOUDDAY2022
  • 15. The Goal ⚽ Run a Risk Rollup in 1 hour! @loige #CLOUDDAY2022
  • 16. Architecture Options for Compute/Orchestration AWS Lambda Amazon SQS AWS Step Functions AWS Fargate Com t om : Red he b to si l , s a l , ev -d i n co n s @loige #CLOUDDAY2022
  • 17. POC Architecture AWS Batch S3 Step Functions Lambda SQS @loige #CLOUDDAY2022
  • 18. Measure Everything! 📏 ⏱ Built metrics in from the start 󰤈 AWS metrics we wish existed out of the box: - Number of running containers - Success/failure counts 🎨 Custom metrics: - Scheduler overhead - Detailed timings (job duration, I/O time, algorithm steps) 🛠 Using CloudWatch, EMF @loige #CLOUDDAY2022
  • 19. Measure Everything! 📏 👍 Rollup in 1 hour ☁ Running on AWS Batch 👎 Cluster utilisation was <50% ✅ Goal success 🤔 Understanding of what needs to be addressed next! @loige #CLOUDDAY2022
  • 20. Beyond the PoC Production: optimise for unique workload characteristics @loige #CLOUDDAY2022
  • 22. In reality, not all jobs are alike! @loige #CLOUDDAY2022
  • 23. Horizontal scaling 🚀 1000’s of jobs Duration: 1 second – 45 minutes Scaling horizontally = splitting jobs Jobs split according to their complexity/duration Resulting in >1 million jobs @loige #CLOUDDAY2022
  • 24. Moving to production 🚢 @loige #CLOUDDAY2022
  • 26. Actual End to End overview @loige #CLOUDDAY2022
  • 28. Compute Services Scales to 1000’s of tasks (containers) Little management overhead Up to 4 vCPUs and 30GB Memory Up to 200GB ephemeral storage Scales to 1000’s of function containers (in seconds!) Very little management overhead Up to 6 vCPUs and 10GB Memory Up to 10GB ephemeral storage It wasn’t always this way! @loige #CLOUDDAY2022
  • 29. Store all the things in S3! The source of truth for: ● Input Data (JSON, Parquet) ● Intermediate Data (Parquet) ● Results (Parquet) ● Aggregates (Parquet) Input data: 20GB Output data: ~1 TB Reads and writes: 10,000s of objects per second. @loige #CLOUDDAY2022
  • 30. Scheduling and Orchestration ✅ We have our cluster (Fargate or Lambda) ✅ We have a plan! (list of jobs, parameters and dependencies) 🤔 How do we feed this plan to the cluster?! 🤨 Existing schedulers use traditional clusters – there is no serverless job scheduler for workloads like this! @loige #CLOUDDAY2022
  • 31. Lifecycle of a Job A new job get queued here 👇 A worker picks up the job and executes it The worker emits the job state (success or failure) @loige #CLOUDDAY2022
  • 32. Event-Driven Scheduler Job states are pulled from a Kinesis Data Stream Redis stores: - Job states - Dependencies This scheduler checks new job states against the state in Redis and figures out if there are new jobs that can be scheduled next @loige #CLOUDDAY2022
  • 33. Dynamic Runtime Handling We also need to handle system failures! @loige #CLOUDDAY2022
  • 34. Outcomes 🙌 Business ● Rollup in 1 hour ● Removed limits on number of runs ● Faster, more consistent deal analytics ● Business spending more time on revenue-generating activities ● Support portfolio growth and deliver new capabilities Technology ● Brought serverless to HPC financial modeling ● Reduced codebase by ~70% ● Lowered total cost of ownership ● Increased dev team agility ● Reduced carbon footprint @loige #CLOUDDAY2022
  • 35. Hitting the limits 😰 @loige #CLOUDDAY2022
  • 37. S3 Partitioning S3 cleverly detects high-throughput prefixes and creates partitions ….normally If this does not happen… 🚨Please reduce your request rate; Status Code: 503; Error Code: SlowDown @loige #CLOUDDAY2022
  • 38. The Solution Explicit Partitioning: ○Figure out how many partitions you need ○Update code to create keys uniformly distributed over all partitions /part/0… /part/1… /part/2… /part/3… … /part/f… 1. Talk (a lot) to AWS SAs, Support, Account Manager for special requirements like this! 2. Think ahead if you have multiple accounts for different environments! @loige #CLOUDDAY2022
  • 39. Fargate Scaling ●We want to run 3000 containers ASAP ●This took > 1 hour! ●We built a custom Fargate scaler ○Using the RunTask API (no ECS Service) ○Hidden quota increases ○Step Function + Lambda ●3000 containers in ~20 minutes The AWS ECS team since made lots of improvements, making it possible to scale to 3,000 containers in under 5 minutes @loige #CLOUDDAY2022
  • 40. How high can we go today? 🚀 10,000 concurrent Lambda functions in seconds 🎢 10,000 Fargate containers in 10 minutes 💸 No additional cost vladionescu.me/posts/scaling-containers-on-aws-in-2022 @loige #CLOUDDAY2022
  • 41. Wrapping up 🎁 ● "Serverless supercomputer" lets you do HPC with commodity AWS compute ● Plenty of challenges, but it's doable! ● Agility and innovation benefits are massive ● Customer is now serverless-first and expert in AWS Other interesting case studies: ☁ AWS HTC Grid - 🧬 COVID genome research @loige #CLOUDDAY2022
  • 42. Special thanks to @eoins and @cmthorne10 fth.link/cd22 @loige #CLOUDDAY2022