SlideShare a Scribd company logo
Using S3 Select to Deliver 100X
Performance Improvements
Versus the Public Cloud
Frank Wessels
CTO, MinIO
S3 Select
▪ Recent addition to S3 API
○ Offload filtering to storage
○ Formats: CSV, JSON, Parquet
▪ Advantages
○ Faster
○ Less network traffic
○ Smaller compute nodes
■ S3 Select for Spark
○ https://github.com/minio/spark-select
Before
After
Up to 400% faster
Up to 80% Cheaper
Applications
Applications
S3 SELECT
2
3
MinIO is a high performance, distributed object storage server,
designed for peta-scale data infrastructure.
S3-Compatible Scalable PerformantSimple Optimized for Intel/
ARM/Power9 CPUs
Introduction to MinIO
4
Global Scale
5
Focus on Performance
6
S3 Select Performance on AWS
Format Time (s) Records Throughput
csv 5.46 733K/s 94 MB/s
json 14.28 280K/s 98 MB/s
parquet 32.25 124K/s 4.3 MB/s
7
Evaluation (“where”)
Processing (“select”)
CSV JSON Parquet
Parsing Parsing Loading
Accelerating S3 Select on minio
8
Manage memory allocations: garbage collected vs. non-garbage collected
Source:
https://bitbucket.org/ewanhiggs/csv-game
First 10X Acceleration: Zero Copy
9
▪ SIMD = Single Instruction Multiple Data
○ Intel: AVX2
▪ Process 32 bytes in parallel
○ delimiter / separator detection
○ bitmap handling & parsing
○ string compares
▪ Performance (single core)
Second 10X Acceleration: SIMD
10
▪ Same queries as before
○ minio with select-simd vs AWS S3
Results using select-simd
Demo
■ Source data
○ parking-citations.csv (25M rows / 3.5 GB)
■ AWS region
○ us-east-1
■ minio with select-simd-integration branch
running on a single instance: c5.2xlarge (8 vCPUs)
■ mc client running in same region on c5.large instance
12
▪ Works in progress
○ Initial focus on CSV
▪ Next: add support for
○ Parquet
○ JSON: https://github.com/lemire/simdjson
▪ Investigate AVX-512
○ erasure coding
▫ AVX-512 4x speedup over AVX2
○ k-registers are great /
2KB on-core register space
▪ Dynamic code generation (think LLVM)
Status and what’s next
Power9 CPUs
PCIe Gen4
24x NVMe
Dual Mellanox CX5 (4x100 GbE/s)
High performance object storage
13
▪ Benefits
○ Faster queries
○ Less network traffic
○ Smaller compute needs
▪ Stay tuned for overall impact
○ S3 “plain” vs S3 Select
○ minio/simd-select vs AWS S3 Select
S3 Select benefits for Spark
Questions?
Visit our booth #509
@minio
https://github.com/minio/minio
https://slack.minio.io
https://minio.io

More Related Content

What's hot (20)

PDF
A crash course in CRUSH
Sage Weil
 
PDF
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
James Anderson
 
PDF
Ceph RBD Update - June 2021
Ceph Community
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PPTX
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
PPTX
Modeling Data and Queries for Wide Column NoSQL
ScyllaDB
 
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
PDF
Storing 16 Bytes at Scale
Fabian Reinartz
 
PDF
Cassandra serving netflix @ scale
Vinay Kumar Chella
 
PDF
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Vietnam Open Infrastructure User Group
 
PDF
cLoki: Like Loki but for ClickHouse
Altinity Ltd
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Ceph and RocksDB
Sage Weil
 
PDF
Ceph scale testing with 10 Billion Objects
Karan Singh
 
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
충섭 김
 
PPTX
Introduction to Redis
Arnab Mitra
 
PDF
Ceph Month 2021: RADOS Update
Ceph Community
 
PDF
Introduction and Deep Dive Into Containerd
Kohei Tokunaga
 
A crash course in CRUSH
Sage Weil
 
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
James Anderson
 
Ceph RBD Update - June 2021
Ceph Community
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
Modeling Data and Queries for Wide Column NoSQL
ScyllaDB
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
Storing 16 Bytes at Scale
Fabian Reinartz
 
Cassandra serving netflix @ scale
Vinay Kumar Chella
 
Room 1 - 1 - Benoit TELLIER - On premise email inbound service with Apache James
Vietnam Open Infrastructure User Group
 
cLoki: Like Loki but for ClickHouse
Altinity Ltd
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Ceph and RocksDB
Sage Weil
 
Ceph scale testing with 10 Billion Objects
Karan Singh
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
충섭 김
 
Introduction to Redis
Arnab Mitra
 
Ceph Month 2021: RADOS Update
Ceph Community
 
Introduction and Deep Dive Into Containerd
Kohei Tokunaga
 

Similar to Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud (20)

PDF
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
PPTX
PMCS22-Roy-Evolving-Storage-For-New-Generation.pptx
RifqiMultazamOfficia
 
PDF
Module 1 - CP Datalake on AWS
Lam Le
 
PPTX
Zeppelin and spark sql demystified
Omid Vahdaty
 
PDF
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
javier ramirez
 
PPTX
How to Choose The Right Database on AWS - Berlin Summit - 2019
Randall Hunt
 
PPTX
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
PPTX
AWS Lake Formation Deep Dive
Cobus Bernard
 
PDF
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
Amazon Web Services Korea
 
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
PDF
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Summits
 
PDF
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Emprovise
 
PDF
MinIO January 2020 Briefing
Jonathan Symonds
 
PPTX
Using AWS To Build A Scalable Machine Data Analytics Service
Christian Beedgen
 
PDF
Simplify Big Data with AWS
Julien SIMON
 
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
saidbilgen
 
PDF
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
SasikumarPalanivel3
 
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
PDF
Big Data on AWS
Szilveszter Molnár
 
PPTX
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
javier ramirez
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
PMCS22-Roy-Evolving-Storage-For-New-Generation.pptx
RifqiMultazamOfficia
 
Module 1 - CP Datalake on AWS
Lam Le
 
Zeppelin and spark sql demystified
Omid Vahdaty
 
Building a Modern Data Platform in the Cloud. AWS Initiate Portugal
javier ramirez
 
How to Choose The Right Database on AWS - Berlin Summit - 2019
Randall Hunt
 
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
AWS Lake Formation Deep Dive
Cobus Bernard
 
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
Amazon Web Services Korea
 
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Summits
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Emprovise
 
MinIO January 2020 Briefing
Jonathan Symonds
 
Using AWS To Build A Scalable Machine Data Analytics Service
Christian Beedgen
 
Simplify Big Data with AWS
Julien SIMON
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
saidbilgen
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
SasikumarPalanivel3
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Big Data on AWS
Szilveszter Molnár
 
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
javier ramirez
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
big data eco system fundamentals of data science
arivukarasi
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 

Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud

  • 1. Using S3 Select to Deliver 100X Performance Improvements Versus the Public Cloud Frank Wessels CTO, MinIO
  • 2. S3 Select ▪ Recent addition to S3 API ○ Offload filtering to storage ○ Formats: CSV, JSON, Parquet ▪ Advantages ○ Faster ○ Less network traffic ○ Smaller compute nodes ■ S3 Select for Spark ○ https://github.com/minio/spark-select Before After Up to 400% faster Up to 80% Cheaper Applications Applications S3 SELECT 2
  • 3. 3 MinIO is a high performance, distributed object storage server, designed for peta-scale data infrastructure. S3-Compatible Scalable PerformantSimple Optimized for Intel/ ARM/Power9 CPUs Introduction to MinIO
  • 6. 6 S3 Select Performance on AWS Format Time (s) Records Throughput csv 5.46 733K/s 94 MB/s json 14.28 280K/s 98 MB/s parquet 32.25 124K/s 4.3 MB/s
  • 7. 7 Evaluation (“where”) Processing (“select”) CSV JSON Parquet Parsing Parsing Loading Accelerating S3 Select on minio
  • 8. 8 Manage memory allocations: garbage collected vs. non-garbage collected Source: https://bitbucket.org/ewanhiggs/csv-game First 10X Acceleration: Zero Copy
  • 9. 9 ▪ SIMD = Single Instruction Multiple Data ○ Intel: AVX2 ▪ Process 32 bytes in parallel ○ delimiter / separator detection ○ bitmap handling & parsing ○ string compares ▪ Performance (single core) Second 10X Acceleration: SIMD
  • 10. 10 ▪ Same queries as before ○ minio with select-simd vs AWS S3 Results using select-simd
  • 11. Demo ■ Source data ○ parking-citations.csv (25M rows / 3.5 GB) ■ AWS region ○ us-east-1 ■ minio with select-simd-integration branch running on a single instance: c5.2xlarge (8 vCPUs) ■ mc client running in same region on c5.large instance
  • 12. 12 ▪ Works in progress ○ Initial focus on CSV ▪ Next: add support for ○ Parquet ○ JSON: https://github.com/lemire/simdjson ▪ Investigate AVX-512 ○ erasure coding ▫ AVX-512 4x speedup over AVX2 ○ k-registers are great / 2KB on-core register space ▪ Dynamic code generation (think LLVM) Status and what’s next
  • 13. Power9 CPUs PCIe Gen4 24x NVMe Dual Mellanox CX5 (4x100 GbE/s) High performance object storage 13
  • 14. ▪ Benefits ○ Faster queries ○ Less network traffic ○ Smaller compute needs ▪ Stay tuned for overall impact ○ S3 “plain” vs S3 Select ○ minio/simd-select vs AWS S3 Select S3 Select benefits for Spark
  • 15. Questions? Visit our booth #509 @minio https://github.com/minio/minio https://slack.minio.io https://minio.io