✅ Managing Terabytes of Data with Amazon S3.pdf

Managing Terabytes of Data with Amazon S3
Dhaval Nagar
AWS Hero, AWS SME
TRACK 01 - MODERN APPS - SERVERLESS

● Founder @ APPGAMBIT, AWS Consulting Partner
● 12x AWS Certified
● AWS Hero (since 2020)
● AWS Certification SME
● AWS Surat User Group Lead
● Practicing Barista
Introduction
Dhaval Nagar

Agenda
● Amazon S3 - Storage Powerhouse
● Use Case
● Storage Optimisation
● Storage Cost Breakdown
● Key Learnings

Amazon S3 - Storage Powerhouse
● One of the earliest and oldest AWS services, released in 2006
● Managed Object Storage Service
● Designed to be used over HTTPS APIs to do simple file operations (Simple Storage Service)
● Works as a backbone for many AWS Services
● Dozens of features from Different Storage Tiers, Versioning, Web Hosting, In-Place Data Queries,
Event Notifications, etc
● Manages over 100 trillion objects and manages millions of requests per second
● Famously known for 11 9’s of Durability

Use Case
● Application captures and processes Large 3D Video Files
● Requires Multi-tenant Storage with current data of over 150TB
● Amazon S3 is used to save the raw data files before processing
● Raw files are fewer in quantity but quite large in size
● Raw files are then processed to generate output files
● Output files are smaller in size but millions in quantity
● Key Objectives:
○ Achieve optimal storage organization based on the processing workflow
○ Generate Per-Tenant Storage cost breakdown

Job
Scheduled
(Auto / Manual)
Video Files
Uploaded
Processed Files
Validated
(Auto / Manual)
Original Video is
ready for
Archival
Use Case - Status

● Amazon S3 offers many storage classes like Standard, Standard
Infrequent Access, One-Zoned and different Cold Storage classes in
Glacier
● Each Tier emulates a real-world use case to keep data in specific
storage type
● Standard is the most common and the most expensive tier
● The starting point is to enable the Intelligent Tier!! However, that is
not the optimal choice in all conditions.
Storage Optimisation for Raw Files
S3 Standaard
S3 Intelligent-Tiering
S3 Standard-IA
S3 One Zone-IA
S3 Glacier Instant, Flexible, Retrieval
and Deep Archive

Intelligent Tier
● Intelligent Tier is perfect for unpredictable usage patterns
● Internally using multiple tiers to automatically move objects based on the age of the object

Intelligent Tier Internal Transitions
Frequent Access Same as Standard Tier This is default tier when object storage class is set
to Intelligent Tier
Infrequent Access Same as Standard Infrequent
Access Tier
IA is applied when an object is not accessed for
30 consecutive days
Archive Instant
Access
Same as Glacier Instant Object is archived if not access for
90 consecutive days
Archive Access Same as Glacier Flexible This is optional transition
Deep Archive Same as Glacier Deep This is optional transition

● With Intelligent Tier, will need to wait for next 30 days to switch the tier
● With Manual Storage Class update, the Infrequent Access can be switched directly and avoid the 30 days of
standard cost - which is 50%
● With average 10 TB of monthly uploads, the Standard tier will cost around $230 for new data
Reconsider Intelligent Tier for Predictable Flow
File Uploaded File Processed File Unused
Day 1 Day N+30
Day 3-5
Move to IA
File Uploaded File Processed
Day 1 Day 3-5
Move to IA

● Output files are configured to use the Standard Intelligent Tier
● The consumption of the processed images are infrequent and unpredictable
● Intelligent Tier gives the best of both the worlds

Storage Tier Decision Tree
(CloudHealth by VMware)

● Standard Infrequent Access is another possible storage option
● It is 50% cheaper compared to Standard Frequent Access tier
● However, AWS charges prorated monthly cost for the first 30 days if the objects in the
Infrequent Access tier are moved to another tier or deleted.
● Again for us the access pattern was not entirely, Infrequent.

● By applying multiple transition and calculation strategies, we were able to reduce around 40% of
storage cost per Month
● Current strategies are aligned with the business and operation flow, but they can change from time to
time
● Schedule Periodic Optimisation Exercises - Not Too Early, Not Too Late
Current State

● One bucket Per Customer is not a scalable solution (not a practical solution for SaaS company)
● S3 has hard limit of 1000 buckets Per Account
Per-Tenant Storage Breakdown
Storage Bucket
/tenant1/**
/tenant2/**
/tenantN/**

● AWS provides tags to organize account resources, and cost allocation tags to track your AWS costs
on a detailed level
● S3 allows to set Tags for Buckets as well as individual objects inside the bucket
● Cost Allocation Tags is the most efficient practice to create logical partitioning within the resources
● However, Tags has additional COST
Cost Analysis for Cost Analysis

● S3 allows to set Tags for individual objects
● For millions of objects settings multiple tags was not a desired solution
● While the cost is not huge, we needed to run additional automation on objects
● So we decided to build and store Object Metadata into CSV files and DynamoDB
$0.01 for 10,000 Tags
And we have Millions of Objects

● As large number of objects are located in the Intelligent Tier, using the LIST API is very cheap
● Metadata storage helps to build a Cost Breakdown Per Tenant with Per Storage Category
○ Count of Raw Files, Size and Cost
○ Count of Output Files, Size and Cost
● Helped to create granular cost reports like Per Month or Per Project
$0.01 for 10,000 Tags
Per Month
$0.0055 for 1000
LIST Object API
Each API call can return
Max 1000 Objects Entries

● Listing Objects is not a frequent operation, only required when we need to re-build the
Metadata
● This is a very tiny optimisation, but the output of Metadata helps in different analysis easily
● This only includes the Storage cost and there is no Data Transfer cost calculated

● Understand the flow and state transition of your data
● Do the periodic cost and operational analysis
● At scale, small optimisations results in big savings
● There will always be an opportunity to optimise, in cost or operations
● S3 is by far one of the most reliable AWS services - It was designed for the ultimate Performance,
Scale and Reliability.
Key Learnings

Thank You
Request to share feedback and join AWS User Groups

✅ Managing Terabytes of Data with Amazon S3.pdf

More Related Content

Similar to ✅ Managing Terabytes of Data with Amazon S3.pdf (20)

More from Dhaval Nagar (20)

Recently uploaded (20)

✅ Managing Terabytes of Data with Amazon S3.pdf