SlideShare a Scribd company logo
Lessons learned processing
70 billion data points a day
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
Shankar Pasupathy Pranoop Erasani
Technical Director Senior Technical Director
Active IQ Data Science ONTAP NFS
DataWorks Summit, San Jose
June 2018
Agenda
o What is Active IQ ?
o 5 Data Management challenges with Hadoop
o Hybrid cloud analytics architecture
o Why NFS for Hadoop and AI ?
o Performance and Scale of shared storage
o NetApp’s In-Place Analytics Module
o Summary
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---2
What is Active IQ ?
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---3
Active IQ
platform
AutoSupport (ASUP)
• Configuration data
• Performance counters
• System logs
Active IQ: Predictive Analytics for NetApp storage systems
4 © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
Use cases Active IQ
Predict disk drive failures
Predict outages, performance problems
Detect misconfigured storage (ARS)
Automate problem diagnosis
Use community wisdom to guide best practices
Guide future product design
The NetApp Active IQ Ecosystem
5 © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
Data growth: 2x every 8 months
300,000 Storage
controllers
70 Billion
Data points
processed daily
135 TB
Data processed
per month
3.7 PB Data lake
Large # of Users
6+ Hadoop clusters
5 data management challenges
1. Storage for Hadoop doubling year over year
2. The need to use the cloud in a cost-effective and secure manner
3. Separate storage architectures for AI and Hadoop
4. Multiple sources of data, each with their own access rights
5. The need for data provenance
© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —6
Traditional on-premises Hadoop architecture
Stream analytics
Users
Hadoop Data Lake
NoSQL/SQL
AI and ML models
Web tier/ App server
IoT Data
Data Lake
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
Challenge 1: Problems caused by storage growth
o Poor utilization of compute
o Disk failures at scale
o Too many copies of the data
o Tiers of storage and QoS
o (HDFS 3.0)
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
Batch processing QA
Realtime cluster
CPU
Disks
CPU
Disks
CPU
Disks
Switch
3x data
copies
POC
8
Our solution: Separation of compute and storage
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---9
 Not a new idea
 LADDIS 2009, Usenix 2012, IEEE Big Data 2013, IDC 2018
 Hadoop in the cloud
 Rack space and throughput
 Modern all flash shared storage ~ 25 GB/s in 4U (4PB effective
space)
 Need 350 traditional DAS servers for 25 GB/s aggregate bandwidth
 Network Latency
 40 Gbit/Ethernet in 2018: 1 – 5 microseconds iWARP/RDMA
 Cloud
 Freedom from IT – ease of use
 Remove operations pain (Hadoop as a service)
 Provision compute instantly
 Cost effective ?
 Inhibitors
 Security and fear: “Data is my most valuable asset”
 Regulations – GDPR, HIPAA, …
 Prohibitive cost for storage (60 PB of data ?)
 Cloud lock in and egress costs
Challenge 2: Hadoop on-premises vs the cloud ?
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---10
Our solution: Cloud connected storage
© 2018 NetApp, Inc. All rights reserved. NetApp Internal Use
Efficient
Data
Copy
NetApp Storage
Hadoop
On-premise
NetApp Cloud Volumes
Google
Latency: 1-2 ms
Bandwidth: Links x 10 Gbps
Choosing Hadoop in the cloud vs on-premises
12 © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
NetApp
Data Fabric
On-premise
24x7 real-time
processing, high
throughput jobs
AWS/Azure/GCP
QA, POCs, AI/ML
Bursty workloads | Choose your
Cloud
Unified Data
Lake
Cloud Connected
Storage
Secure
IoT Data
24x7
Edge
Efficient
 HDFS
 Sequential I/O
 Throughput oriented
 Large files
 AI
 Needs Random I/O
 IOPS oriented
 Shared file system for distributed training
Challenge 3: What is the right storage architecture for AI ?
13 © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
 Tiered storage (SSD, SATA)
 Storage QoS for different workloads
 Ability to rapidly ”clone” data for QA
 In built compression
 Triple parity RAID hides disk failures
 For >4TB SATA disks
Our solution: Build a unified, shared Datalake
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---14
Active IQ
Unified
Data Lake
NFS
Active IQ analytics architecture using the hybrid cloud and NFS
storage
12x reduction in storage space, 30x improvement in performance, 3x reduction in compute nodes
In-place analytics
module
NFS
On Premises
HDInsight
In-place analytics
module
Databricks/EMR
In-place analytics
module
Cloud
Connected
Storage
Archive
Data Lake
Unified Data Lake
Active IQ
Telemetry
Data
Cluster
NetApp Cloud Volumes
In the CloudEdge
NetApp
Data Fabric
In-place analytics
module
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
Why NFS for Hadoop and AI ?
1. Performance
2. Scale
3. Manageability
© 2018 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only16
NFS Performance: High throughput at Low latency
17 © 2018 NetApp, Inc. All Rights Reserved≈
500µs
latency
25GB/s
throughput
11.4M IOPS
300GB/s throughput
1M
IOPS
24-node
Cluster
NFS Scale: PB-scale data lake with high file count
18 © 2018 NetApp, Inc. All Rights Reserved≈
20PB
size
400B
files
Tested
10
nodes
172PB
size
47T
files
Supported
24
nodes
NFS Manageability: NetApp In-Place Analytics Module
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---19
In-Place Analytics Module
HDFS Amazon S3 GlusterFS Azure NFS
Batch
MapReduce
Interactive
TEZ
Online
HBase
In-Memory
Spark
Graph
Giraph
YARN
(Cluster Resource Management)
FileSystem
(Interfaces to interact with storage systems)
(Computation Framework)
 Available as a drop-in JAR
file
 Integrated with Hortonworks Ambari
 NFS Filesystem
Implementation
 Buffered Input and Output stream
 14 of 22 NFSv3 Operations
 Simplified configuration
 Set fs.defaultFS to NFS path (e.g. IP:/path)
 Tunables configured via a JSON file
 Integrated with LDAP directory services
 Roadmap
 Ranger, Kerberos and HCFS
Additional Benefits of NetApp In-Place Analytics Module
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---20
1. No changes to Hadoop applications
 Analytics Jobs run seamlessly over NFS
2. No copy sprawl
 Primary data copy is the data lake; Moreover,1x copy vs 3x HDFS copies
3. Leverage Data Management
 Snapshots, Data protection copies and Clones for point-in-time analytics
4. Optimized for streaming throughput
 NFS Multi-pathing, High concurrency, Prefetching, Data and Metadata caching
5. NFS and HDFS could co-exist
 E.g. HDFS as primary and NFS as secondary or vice-versa
5 data management challenges
1. Storage for Hadoop doubling year over year
2. The need to use the cloud in a cost-effective and secure manner
3. Separate storage architectures for AI and Hadoop
4. Multiple sources of data, each with their own access rights
5. The need for data provenance
© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —21
Summary
1. Disaggregate compute from storage for analytics
2. Unified data lake for ease of management and Lower TCO
3. Hybrid cloud architecture for access to cloud innovation
© 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —22
© 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---23
Thank You
4. NFS TCO: Ease of data lifecycle management at scale
After
• Automatic tiering
• Zero-touch management
• Preserves file system semantics
• Preserves storage efficiencies
• Data encrypted in-flight
• 1 copy vs 3 HDFS copies
On-PremisesFootprint
FabricPool
Inactive
Data
Object Storage
Performance
Tier
CapacityTier
80%
Before
Active Data Inactive Data
24 © 2018 NetApp, Inc. All rights reserved. NETAPP CONFIDENTIAL

More Related Content

What's hot (20)

PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Owning Your Own (Data) Lake House
Data Con LA
 
PDF
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
Markus Michalewicz
 
PDF
Data lake benefits
Ricky Barron
 
PDF
Exadata_X10M-Hardware-Overview.pdf
Koko842772
 
PPTX
Data as a service
Khushbu Joshi
 
PPTX
What to Expect From Oracle database 19c
Maria Colgan
 
PDF
Top use cases for 2022 with Data in Motion and Apache Kafka
confluent
 
PDF
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Markus Michalewicz
 
PDF
Net App Unified Storage Architecture
nburgett
 
PDF
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
HostedbyConfluent
 
PDF
Log Structured Merge Tree
University of California, Santa Cruz
 
PPT
Use the SAP Content Server for Your Document Imaging and Archiving Needs!
Verbella CMG
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Glen Hawkins
 
PPTX
Well Architected Framework - Data
Craig Milroy
 
PDF
Oracle Database Migration to Oracle Cloud Infrastructure
SinanPetrusToma
 
PDF
Data Lake: A simple introduction
IBM Analytics
 
PPTX
How to size up an Apache Cassandra cluster (Training)
DataStax Academy
 
PDF
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Kai Wähner
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Owning Your Own (Data) Lake House
Data Con LA
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
Markus Michalewicz
 
Data lake benefits
Ricky Barron
 
Exadata_X10M-Hardware-Overview.pdf
Koko842772
 
Data as a service
Khushbu Joshi
 
What to Expect From Oracle database 19c
Maria Colgan
 
Top use cases for 2022 with Data in Motion and Apache Kafka
confluent
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Markus Michalewicz
 
Net App Unified Storage Architecture
nburgett
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
HostedbyConfluent
 
Log Structured Merge Tree
University of California, Santa Cruz
 
Use the SAP Content Server for Your Document Imaging and Archiving Needs!
Verbella CMG
 
Data Lakehouse Symposium | Day 4
Databricks
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Glen Hawkins
 
Well Architected Framework - Data
Craig Milroy
 
Oracle Database Migration to Oracle Cloud Infrastructure
SinanPetrusToma
 
Data Lake: A simple introduction
IBM Analytics
 
How to size up an Apache Cassandra cluster (Training)
DataStax Academy
 
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Kai Wähner
 

Similar to Lessons learned processing 70 billion data points a day using the hybrid cloud (20)

PDF
HPE Solutions for Challenges in AI and Big Data
Lviv Startup Club
 
PDF
Saviak lviv ai-2019-e-mail (1)
Lviv Startup Club
 
PPTX
Macroview Netapp Overview
Alex Tsui
 
PPTX
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Storage Switzerland
 
PPTX
Bridging Your Business Across the Enterprise and Cloud with MongoDB and NetApp
MongoDB
 
PPTX
OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017
Cloud Native Day Tel Aviv
 
PPTX
Big Data and HPC
NetApp
 
PDF
NetApp IT Data Center Strategies to Enable Digital Transformation
NetApp
 
PPTX
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
Ceph's journey at SUSE
Ceph Community
 
PPTX
Instantaneous Replication of Build Artifacts with NetApp
NetApp
 
PDF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
SUSE Italy
 
PDF
NetApp IT Efficiencies Gained with Flash, NetApp ONTAP, OnCommand Insight, Al...
NetApp
 
PPTX
NetApp Fabric Pool Deck
Alex Tsui
 
PDF
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
Scott Shadley, MBA,PMC-III
 
PDF
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
PDF
#FMS2018 NGD Systems Real World Results with #ComputationalStorage
Scott Shadley, MBA,PMC-III
 
PPTX
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
Paul Hofmann
 
PDF
Ibm integrated analytics system
ModusOptimum
 
PDF
Containers and Kubernetes
Altoros
 
HPE Solutions for Challenges in AI and Big Data
Lviv Startup Club
 
Saviak lviv ai-2019-e-mail (1)
Lviv Startup Club
 
Macroview Netapp Overview
Alex Tsui
 
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Storage Switzerland
 
Bridging Your Business Across the Enterprise and Cloud with MongoDB and NetApp
MongoDB
 
OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017
Cloud Native Day Tel Aviv
 
Big Data and HPC
NetApp
 
NetApp IT Data Center Strategies to Enable Digital Transformation
NetApp
 
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
Ceph's journey at SUSE
Ceph Community
 
Instantaneous Replication of Build Artifacts with NetApp
NetApp
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
SUSE Italy
 
NetApp IT Efficiencies Gained with Flash, NetApp ONTAP, OnCommand Insight, Al...
NetApp
 
NetApp Fabric Pool Deck
Alex Tsui
 
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
Scott Shadley, MBA,PMC-III
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
#FMS2018 NGD Systems Real World Results with #ComputationalStorage
Scott Shadley, MBA,PMC-III
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
Paul Hofmann
 
Ibm integrated analytics system
ModusOptimum
 
Containers and Kubernetes
Altoros
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 

Lessons learned processing 70 billion data points a day using the hybrid cloud

  • 1. Lessons learned processing 70 billion data points a day © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- Shankar Pasupathy Pranoop Erasani Technical Director Senior Technical Director Active IQ Data Science ONTAP NFS DataWorks Summit, San Jose June 2018
  • 2. Agenda o What is Active IQ ? o 5 Data Management challenges with Hadoop o Hybrid cloud analytics architecture o Why NFS for Hadoop and AI ? o Performance and Scale of shared storage o NetApp’s In-Place Analytics Module o Summary © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---2
  • 3. What is Active IQ ? © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---3 Active IQ platform AutoSupport (ASUP) • Configuration data • Performance counters • System logs
  • 4. Active IQ: Predictive Analytics for NetApp storage systems 4 © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- Use cases Active IQ Predict disk drive failures Predict outages, performance problems Detect misconfigured storage (ARS) Automate problem diagnosis Use community wisdom to guide best practices Guide future product design
  • 5. The NetApp Active IQ Ecosystem 5 © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- Data growth: 2x every 8 months 300,000 Storage controllers 70 Billion Data points processed daily 135 TB Data processed per month 3.7 PB Data lake Large # of Users 6+ Hadoop clusters
  • 6. 5 data management challenges 1. Storage for Hadoop doubling year over year 2. The need to use the cloud in a cost-effective and secure manner 3. Separate storage architectures for AI and Hadoop 4. Multiple sources of data, each with their own access rights 5. The need for data provenance © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —6
  • 7. Traditional on-premises Hadoop architecture Stream analytics Users Hadoop Data Lake NoSQL/SQL AI and ML models Web tier/ App server IoT Data Data Lake © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
  • 8. Challenge 1: Problems caused by storage growth o Poor utilization of compute o Disk failures at scale o Too many copies of the data o Tiers of storage and QoS o (HDFS 3.0) © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- Batch processing QA Realtime cluster CPU Disks CPU Disks CPU Disks Switch 3x data copies POC 8
  • 9. Our solution: Separation of compute and storage © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---9  Not a new idea  LADDIS 2009, Usenix 2012, IEEE Big Data 2013, IDC 2018  Hadoop in the cloud  Rack space and throughput  Modern all flash shared storage ~ 25 GB/s in 4U (4PB effective space)  Need 350 traditional DAS servers for 25 GB/s aggregate bandwidth  Network Latency  40 Gbit/Ethernet in 2018: 1 – 5 microseconds iWARP/RDMA
  • 10.  Cloud  Freedom from IT – ease of use  Remove operations pain (Hadoop as a service)  Provision compute instantly  Cost effective ?  Inhibitors  Security and fear: “Data is my most valuable asset”  Regulations – GDPR, HIPAA, …  Prohibitive cost for storage (60 PB of data ?)  Cloud lock in and egress costs Challenge 2: Hadoop on-premises vs the cloud ? © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---10
  • 11. Our solution: Cloud connected storage © 2018 NetApp, Inc. All rights reserved. NetApp Internal Use Efficient Data Copy NetApp Storage Hadoop On-premise NetApp Cloud Volumes Google Latency: 1-2 ms Bandwidth: Links x 10 Gbps
  • 12. Choosing Hadoop in the cloud vs on-premises 12 © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL --- NetApp Data Fabric On-premise 24x7 real-time processing, high throughput jobs AWS/Azure/GCP QA, POCs, AI/ML Bursty workloads | Choose your Cloud Unified Data Lake Cloud Connected Storage Secure IoT Data 24x7 Edge Efficient
  • 13.  HDFS  Sequential I/O  Throughput oriented  Large files  AI  Needs Random I/O  IOPS oriented  Shared file system for distributed training Challenge 3: What is the right storage architecture for AI ? 13 © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
  • 14.  Tiered storage (SSD, SATA)  Storage QoS for different workloads  Ability to rapidly ”clone” data for QA  In built compression  Triple parity RAID hides disk failures  For >4TB SATA disks Our solution: Build a unified, shared Datalake © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---14 Active IQ Unified Data Lake NFS
  • 15. Active IQ analytics architecture using the hybrid cloud and NFS storage 12x reduction in storage space, 30x improvement in performance, 3x reduction in compute nodes In-place analytics module NFS On Premises HDInsight In-place analytics module Databricks/EMR In-place analytics module Cloud Connected Storage Archive Data Lake Unified Data Lake Active IQ Telemetry Data Cluster NetApp Cloud Volumes In the CloudEdge NetApp Data Fabric In-place analytics module © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---
  • 16. Why NFS for Hadoop and AI ? 1. Performance 2. Scale 3. Manageability © 2018 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only16
  • 17. NFS Performance: High throughput at Low latency 17 © 2018 NetApp, Inc. All Rights Reserved≈ 500µs latency 25GB/s throughput 11.4M IOPS 300GB/s throughput 1M IOPS 24-node Cluster
  • 18. NFS Scale: PB-scale data lake with high file count 18 © 2018 NetApp, Inc. All Rights Reserved≈ 20PB size 400B files Tested 10 nodes 172PB size 47T files Supported 24 nodes
  • 19. NFS Manageability: NetApp In-Place Analytics Module © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---19 In-Place Analytics Module HDFS Amazon S3 GlusterFS Azure NFS Batch MapReduce Interactive TEZ Online HBase In-Memory Spark Graph Giraph YARN (Cluster Resource Management) FileSystem (Interfaces to interact with storage systems) (Computation Framework)  Available as a drop-in JAR file  Integrated with Hortonworks Ambari  NFS Filesystem Implementation  Buffered Input and Output stream  14 of 22 NFSv3 Operations  Simplified configuration  Set fs.defaultFS to NFS path (e.g. IP:/path)  Tunables configured via a JSON file  Integrated with LDAP directory services  Roadmap  Ranger, Kerberos and HCFS
  • 20. Additional Benefits of NetApp In-Place Analytics Module © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---20 1. No changes to Hadoop applications  Analytics Jobs run seamlessly over NFS 2. No copy sprawl  Primary data copy is the data lake; Moreover,1x copy vs 3x HDFS copies 3. Leverage Data Management  Snapshots, Data protection copies and Clones for point-in-time analytics 4. Optimized for streaming throughput  NFS Multi-pathing, High concurrency, Prefetching, Data and Metadata caching 5. NFS and HDFS could co-exist  E.g. HDFS as primary and NFS as secondary or vice-versa
  • 21. 5 data management challenges 1. Storage for Hadoop doubling year over year 2. The need to use the cloud in a cost-effective and secure manner 3. Separate storage architectures for AI and Hadoop 4. Multiple sources of data, each with their own access rights 5. The need for data provenance © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —21
  • 22. Summary 1. Disaggregate compute from storage for analytics 2. Unified data lake for ease of management and Lower TCO 3. Hybrid cloud architecture for access to cloud innovation © 2018 NetApp, Inc. All rights reserved. — NETAPP CONFIDENTIAL —22
  • 23. © 2018 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---23 Thank You
  • 24. 4. NFS TCO: Ease of data lifecycle management at scale After • Automatic tiering • Zero-touch management • Preserves file system semantics • Preserves storage efficiencies • Data encrypted in-flight • 1 copy vs 3 HDFS copies On-PremisesFootprint FabricPool Inactive Data Object Storage Performance Tier CapacityTier 80% Before Active Data Inactive Data 24 © 2018 NetApp, Inc. All rights reserved. NETAPP CONFIDENTIAL