SlideShare a Scribd company logo
Modernizing Your Data Platform
for Analytics and AI
Across a single cloud, hybrid cloud or multi-cloud
About Me
ALLUXIO 2
Product Management, Alluxio, Inc.
PMC member, Alluxio Open Source Project
MS from Carnegie Mellon University
BS from Indian Institute of Technology - Delhi
Adit Madan
Co-located
DATA STACK JOURNEY AND INNOVATION PATHS
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the same cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
Burst HDFS data in
the cloud,
public or private
Support Presto, Spark
Tensorflow, PyTorch
across DCs without
app changes
Enable & accelerate
big data on
object stores
Transition to Object store
HDFS for Hybrid Cloud
Support more frameworks
▪ Typically compute-bound
clusters over 100% capacity
▪ Compute & I/O need to be
scaled together even when
not needed
▪ Compute & I/O can be
scaled independently but I/O
still needed on HDFS which
is expensive
ALLUXIO 3
Data Orchestration for
Analytics & AI in the Cloud
Available:
ALLUXIO 5
DATA ACCESSIBILITY
Access any storage using any compute
ALLUXIO 6
BRING DATA CLOSER TO COMPUTE ACROSS SILOS
Access based data movement for compute and storage spread across environments
v
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
DATACENTER 1
Hive
DATA ORCHESTRATION WITHIN A SINGLE DATACENTER
OR CLOUD REGION
Consistent SLAs, Performance, and Cost
Savings on cloud storage
CASE 01: CLOUD CASE 02: ON PREM
PUBLIC CLOUD
Tensorflow
Alluxio
Speed-up analytics on on-prem
object stores
ON PREMISE
Spark
Alluxio
OR OR
ALLUXIO 7
DATA ORCHESTRATION ACROSS DATACENTERS
Burst compute to a public cloud
and gradually migrate
CASE 03: HYBRID
Hive
Alluxio
PUBLIC CLOUD
ON PREMISE
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
CASE 04: HYBRID
Alluxio
Pytorch
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
CASE 05: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
ALLUXIO 8
Alluxio - Key Innovations
ALLUXIO 9
Performance acceleration with
efficient representation and
caching of data close to compute
EFFICIENT ACCESS &
DATA LOCALITY
Orchestrate a data platform with
agility across regions for private,
hybrid or multi-cloud with policy
based data management
ENVIRONMENT AGNOSTIC
DATA MANAGEMENT
Support multiple APIs for
analytics and AI with storage
abstraction and streamlined data
movement across the pipeline
UNIFY DATA LAKES
ALLUXIO 10
Unified
namespace
Mount HDFS and object
storage into a common
Alluxio cluster
1
Object store
analytics
Caching layer to speed up
Presto and Spark Jobs
2
Hybrid-cloud
Burst Compute to a single
public cloud first
Run managed Hadoop.
K8s and cloud native AI for
model training
3
Multi-cloud
Replicate setup on AWS to
Google Cloud
Choose the right tool for the
job, regardless of the cloud
provider
4
EXAMPLE JOURNEY 01
On-premises HDFS to Object Storage to Hybrid Cloud
ALLUXIO 11
EXAMPLE JOURNEY 01
On-premises object storage as the source of truth
v
REGION A
v
REGION B
REGION B
PRIVATE DATA
CENTERS
Amazon
EMR
Cloud
Dataproc
Kubernetes
Engine
Compute
Engine
DATACENTER 2
INGESTION ETL
Hive
Burst analytics
in the cloud
Presto and Alluxio in the cloud
accessing on-prem HDFS and
cloud storage
1
EXAMPLE JOURNEY 02
Hybrid Cloud to Multi Cloud
Efficient data caching
and representation
High availability &
modernization
Data replication across HDFS
clusters in different data
centers
2
Seamless data
synchronization
Multi-cloud with
Azure and AWS
Storage abstraction
regardless of cloud provider
3
Multi-cloud fabric
Why data orchestration?
ALLUXIO 12
With data pinning capability
for cache control
For always active data across
the data pipeline
Abstraction for infrastructure
spanning multiple data centers
and clouds
v
ALLUXIO 13
REGION A
v
REGION B
REGION A REGION B
PRIVATE DATA
CENTERS
DATACENTER 2
DATACENTER 1
Hive
INGESTION ETL
EXAMPLE JOURNEY 02
Global data platform for analytics & AI built on data and container orchestration
Hive
ETL
ALLUXIO 14
Enabling a Hybrid Data Lake
Core Features
ALLUXIO 15
DATA LOCALITY
Local performance for remote data with intelligent multi-tiering
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
ALLUXIO 16
METADATA LOCALITY
Synchronization of changes across clusters
Old File at path
/file1 ->
New File at path
/file1 ->
Alluxio Master
Policies for pinning,
promotion/demotion,TTL
Metadata Synchronization
Mutation
On-premises
Public Cloud
Model
Training
Big Data ETL
Big Data Query
ALLUXIO 17
ASYNCHRONOUS DATA OPERATIONS
Data pre-loading and fast durable writes
Distributed Load
Alluxio Data Orchestration and Control
Service
Preload Cache
File A
File B
File C
(3 replicas, 3 blocks) / file
File A
(1 replica, 3 blocks)
Async write
Fast Durable Write
Alluxio Data Orchestration and Control
Service
File D
(3 replicas, 3 blocks) / file
File D
(3 replicas, 3 blocks until HDFS write completed)
(1 replica, 3 blocks) Tmp files not written to HDFS
.staging
.tmp
ALLUXIO 18
POLICY DRIVEN DATA MANAGEMENT
Unified namespace for live data migration
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
ALLUXIO 19
SEAMLESS CATALOG DEFINITIONS
No table redefinitions required using “Transparent URI”
Example Scenario
I. Initial state
A. Data in HDFS
B. Hive Metastore table definitions pointing to HDFS
II. Compute cluster with Alluxio
A. Catalog points to Hive Metastore
B. Alluxio intercepts Presto calls to HDFS
III. Query execution
A. Accesses to HDFS are served by Alluxio
B. No manual data copies or application re-writes
Presto Catalog
Hive
Metastore
Hive Connector
hdfs://ns/table
1.
1I.
Presto
Alluxio
III.
Public Cloud
On-premise
s Hive
Metastore
HDFS
Open Source Started From UC Berkeley AMPLab
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
Million+ Download;
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Join the conversation
on Slack
slackin.alluxio.io
#9
Most critical open
source Java projects
(Google OpenSSF)
ALLUXIO 21
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
Twitter.com/alluxio
Linkedin.com/alluxio
Website
www.alluxio.io
Slack
http://slackin.alluxio.io/
@
Social Media

More Related Content

What's hot (20)

PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.
 
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
PDF
Enabling big data & AI workloads on the object store at DBS
Alluxio, Inc.
 
PDF
Accelerating Data Computation on Ceph Objects
Alluxio, Inc.
 
PDF
Hands-on with Alluxio Structured Data Management
Alluxio, Inc.
 
PDF
Introducing the Hub for Data Orchestration
Alluxio, Inc.
 
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
PDF
Orchestrate a Data Symphony
Alluxio, Inc.
 
PPTX
Kubernetes with Docker Enterprise for multi and hybrid cloud strategy
Ashnikbiz
 
PDF
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Alluxio, Inc.
 
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Alluxio, Inc.
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
PDF
Running Spark & Alluxio in Kubernetes
Alluxio, Inc.
 
PDF
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
PDF
Presto on Alluxio Hands-On Lab
Alluxio, Inc.
 
PDF
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
PDF
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Alluxio, Inc.
 
PDF
Alluxio 2 Community Update
Alluxio, Inc.
 
PDF
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Alluxio, Inc.
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
Enabling big data & AI workloads on the object store at DBS
Alluxio, Inc.
 
Accelerating Data Computation on Ceph Objects
Alluxio, Inc.
 
Hands-on with Alluxio Structured Data Management
Alluxio, Inc.
 
Introducing the Hub for Data Orchestration
Alluxio, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Orchestrate a Data Symphony
Alluxio, Inc.
 
Kubernetes with Docker Enterprise for multi and hybrid cloud strategy
Ashnikbiz
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Alluxio, Inc.
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Alluxio, Inc.
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio, Inc.
 
Running Spark & Alluxio in Kubernetes
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Presto on Alluxio Hands-On Lab
Alluxio, Inc.
 
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Alluxio, Inc.
 
Alluxio 2 Community Update
Alluxio, Inc.
 
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Alluxio, Inc.
 

Similar to Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era (20)

PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
Alluxio, Inc.
 
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio, Inc.
 
PDF
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Alluxio, Inc.
 
PDF
Achieving compute and storage independence for data-driven workloads
Alluxio, Inc.
 
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
PDF
Enabling Apache Spark for Hybrid Cloud
Alluxio, Inc.
 
PDF
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
PDF
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 
PDF
Alluxio @ Uber Seattle Meetup
Alluxio, Inc.
 
PDF
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
PDF
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Alluxio, Inc.
 
PDF
Embracing hybrid cloud for data-intensive analytic workloads
Alluxio, Inc.
 
PDF
Alluxio 2.9 Release Overview
Alluxio, Inc.
 
PDF
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
PDF
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Alluxio, Inc.
 
PDF
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
PDF
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Alluxio, Inc.
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio, Inc.
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Alluxio, Inc.
 
Achieving compute and storage independence for data-driven workloads
Alluxio, Inc.
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Alluxio, Inc.
 
Enabling Apache Spark for Hybrid Cloud
Alluxio, Inc.
 
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 
Alluxio @ Uber Seattle Meetup
Alluxio, Inc.
 
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Alluxio, Inc.
 
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Alluxio, Inc.
 
Embracing hybrid cloud for data-intensive analytic workloads
Alluxio, Inc.
 
Alluxio 2.9 Release Overview
Alluxio, Inc.
 
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...
Alluxio, Inc.
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Ad

More from Alluxio, Inc. (20)

PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
Ad

Recently uploaded (20)

PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PDF
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
PPTX
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 

Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud Era

  • 1. Modernizing Your Data Platform for Analytics and AI Across a single cloud, hybrid cloud or multi-cloud
  • 2. About Me ALLUXIO 2 Product Management, Alluxio, Inc. PMC member, Alluxio Open Source Project MS from Carnegie Mellon University BS from Indian Institute of Technology - Delhi Adit Madan
  • 3. Co-located DATA STACK JOURNEY AND INNOVATION PATHS Co-located compute & HDFS on the same cluster Disaggregated compute & HDFS on the same cluster MR / Hive HDFS Hive HDFS Disaggregated Burst HDFS data in the cloud, public or private Support Presto, Spark Tensorflow, PyTorch across DCs without app changes Enable & accelerate big data on object stores Transition to Object store HDFS for Hybrid Cloud Support more frameworks ▪ Typically compute-bound clusters over 100% capacity ▪ Compute & I/O need to be scaled together even when not needed ▪ Compute & I/O can be scaled independently but I/O still needed on HDFS which is expensive ALLUXIO 3
  • 4. Data Orchestration for Analytics & AI in the Cloud Available:
  • 5. ALLUXIO 5 DATA ACCESSIBILITY Access any storage using any compute
  • 6. ALLUXIO 6 BRING DATA CLOSER TO COMPUTE ACROSS SILOS Access based data movement for compute and storage spread across environments v REGION A v REGION B REGION A REGION B PRIVATE DATA CENTERS Amazon EMR Cloud Dataproc Kubernetes Engine Compute Engine DATACENTER 2 DATACENTER 1 Hive
  • 7. DATA ORCHESTRATION WITHIN A SINGLE DATACENTER OR CLOUD REGION Consistent SLAs, Performance, and Cost Savings on cloud storage CASE 01: CLOUD CASE 02: ON PREM PUBLIC CLOUD Tensorflow Alluxio Speed-up analytics on on-prem object stores ON PREMISE Spark Alluxio OR OR ALLUXIO 7
  • 8. DATA ORCHESTRATION ACROSS DATACENTERS Burst compute to a public cloud and gradually migrate CASE 03: HYBRID Hive Alluxio PUBLIC CLOUD ON PREMISE Hybrid Cloud Gateway to utilize on-prem compute for data in the cloud CASE 04: HYBRID Alluxio Pytorch PUBLIC CLOUD ON PREMISE Cross Datacenter Access without changing Ingest Pipeline across regions CASE 05: MULTI-DATACENTER Presto Alluxio DATACENTER 1 DATACENTER 2 INGESTION ALLUXIO 8
  • 9. Alluxio - Key Innovations ALLUXIO 9 Performance acceleration with efficient representation and caching of data close to compute EFFICIENT ACCESS & DATA LOCALITY Orchestrate a data platform with agility across regions for private, hybrid or multi-cloud with policy based data management ENVIRONMENT AGNOSTIC DATA MANAGEMENT Support multiple APIs for analytics and AI with storage abstraction and streamlined data movement across the pipeline UNIFY DATA LAKES
  • 10. ALLUXIO 10 Unified namespace Mount HDFS and object storage into a common Alluxio cluster 1 Object store analytics Caching layer to speed up Presto and Spark Jobs 2 Hybrid-cloud Burst Compute to a single public cloud first Run managed Hadoop. K8s and cloud native AI for model training 3 Multi-cloud Replicate setup on AWS to Google Cloud Choose the right tool for the job, regardless of the cloud provider 4 EXAMPLE JOURNEY 01 On-premises HDFS to Object Storage to Hybrid Cloud
  • 11. ALLUXIO 11 EXAMPLE JOURNEY 01 On-premises object storage as the source of truth v REGION A v REGION B REGION B PRIVATE DATA CENTERS Amazon EMR Cloud Dataproc Kubernetes Engine Compute Engine DATACENTER 2 INGESTION ETL Hive
  • 12. Burst analytics in the cloud Presto and Alluxio in the cloud accessing on-prem HDFS and cloud storage 1 EXAMPLE JOURNEY 02 Hybrid Cloud to Multi Cloud Efficient data caching and representation High availability & modernization Data replication across HDFS clusters in different data centers 2 Seamless data synchronization Multi-cloud with Azure and AWS Storage abstraction regardless of cloud provider 3 Multi-cloud fabric Why data orchestration? ALLUXIO 12 With data pinning capability for cache control For always active data across the data pipeline Abstraction for infrastructure spanning multiple data centers and clouds
  • 13. v ALLUXIO 13 REGION A v REGION B REGION A REGION B PRIVATE DATA CENTERS DATACENTER 2 DATACENTER 1 Hive INGESTION ETL EXAMPLE JOURNEY 02 Global data platform for analytics & AI built on data and container orchestration Hive ETL
  • 14. ALLUXIO 14 Enabling a Hybrid Data Lake Core Features
  • 15. ALLUXIO 15 DATA LOCALITY Local performance for remote data with intelligent multi-tiering Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL On-premises Public Cloud Model Training Big Data ETL Big Data Query
  • 16. ALLUXIO 16 METADATA LOCALITY Synchronization of changes across clusters Old File at path /file1 -> New File at path /file1 -> Alluxio Master Policies for pinning, promotion/demotion,TTL Metadata Synchronization Mutation On-premises Public Cloud Model Training Big Data ETL Big Data Query
  • 17. ALLUXIO 17 ASYNCHRONOUS DATA OPERATIONS Data pre-loading and fast durable writes Distributed Load Alluxio Data Orchestration and Control Service Preload Cache File A File B File C (3 replicas, 3 blocks) / file File A (1 replica, 3 blocks) Async write Fast Durable Write Alluxio Data Orchestration and Control Service File D (3 replicas, 3 blocks) / file File D (3 replicas, 3 blocks until HDFS write completed) (1 replica, 3 blocks) Tmp files not written to HDFS .staging .tmp
  • 18. ALLUXIO 18 POLICY DRIVEN DATA MANAGEMENT Unified namespace for live data migration hdfs://host:port/directory/ Reports Sales • Single Alluxio path backed by multiple storage systems • Example policy: Migrate data older than 7 days from HDFS to S3
  • 19. ALLUXIO 19 SEAMLESS CATALOG DEFINITIONS No table redefinitions required using “Transparent URI” Example Scenario I. Initial state A. Data in HDFS B. Hive Metastore table definitions pointing to HDFS II. Compute cluster with Alluxio A. Catalog points to Hive Metastore B. Alluxio intercepts Presto calls to HDFS III. Query execution A. Accesses to HDFS are served by Alluxio B. No manual data copies or application re-writes Presto Catalog Hive Metastore Hive Connector hdfs://ns/table 1. 1I. Presto Alluxio III. Public Cloud On-premise s Hive Metastore HDFS
  • 20. Open Source Started From UC Berkeley AMPLab 1000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed Million+ Download; GitHub’s Top 100 Most Valuable Repositories Out of 96 Million Join the conversation on Slack slackin.alluxio.io #9 Most critical open source Java projects (Google OpenSSF)
  • 21. ALLUXIO 21 COMPANIES USING ALLUXIO INTERNET PUBLIC CLOUD PROVIDERS GENERAL E-COMMERCE OTHERS TECHNOLOGY FINANCIAL SERVICES TELCO & MEDIA LEARN MORE