SlideShare a Scribd company logo
© 2020 Snowflake Inc. All Rights Reserved
Data Science and
AI/ML at Scale
Scott Hruby | Snowflake Solutions Engineer
© 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
Snowflake 101
© 2020 Snowflake Inc. All Rights Reserved
“The rapid rise of gathered/analyzed digital data
is often core to the holistic success of the fastest growing &
most successful companies of our time around the world.”
– Mary Meeker, Bond Capital
DATA:
THE WORLD’S MOST VALUABLE RESOURCE
© 2020 Snowflake Inc. All Rights Reserved
DATA:
THE NEW SUPERPOWER
© 2020 Snowflake Inc. All Rights Reserved
NEW TECHNOLOGY TRENDS
CHANGE HOW WE USE DATA
5
Analytics is growing in
importance, everywhere,
and for everyone
IoT, mobile, and
social open up new
opportunities for insight
Cloud gives us the
ability to scale and
centralize data
Rise of the Cloud Explosion of Data Diversification of Analytics
© 2020 Snowflake Inc. All Rights Reserved
On Premises
EDW
1st Gen Cloud
EDW
Data Lake,
Hadoop
Cloud Data
Platform
All Data
All Users
Fast Answers
SQL Database
JOURNEY TO A CLOUD DATA PLATFORM
© 2020 Snowflake Inc. All Rights Reserved
TRADITIONAL DATA ARCHITECTURE
Complex, Costly & Constrained
OLTP
Databases
Enterprise
Applications
Third-Party
Web/Log
Data
IoT
Data
Integration
Data
Transformation
Data
Analytics
Normalization
& Aggregation
Ad Hoc
Analysis
Real-time
Analytics
Operational
Reporting
DATA
SOURCES
DATA
CONSUMERS
ELT
Streaming
ELT
Data Marts
Data Warehouses
Backups
File Sharing
CubesData Lake
CDC Data Science
7
© 2020 Snowflake Inc. All Rights Reserved
SNOWFLAKE CLOUD DATA PLATFORM
ONE PLATFORM, ONE COPY OF DATA,
MANY WORKLOADS
8
DATA
SOURCES
OLTP DATABASES
ENTERPRISE
APPLICATIONS
THIRD-PARTY
WEB/LOG DATA
IoT
DATA
CONSUMERS
DATA MONETIZATION
OPERATIONAL
REPORTING
AD HOC ANALYSIS
REAL-TIME ANALYTICS
© 2020 Snowflake Inc. All Rights Reserved
SNOWFLAKE ARCHITECTURE
9
© 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
Data Science & AI
© 2020 Snowflake Inc. All Rights Reserved
DATA SCIENCE IS PREDICTIVE ANALYTICS
Descriptive
Analytics
Diagnostic
Analytics
What
happened?
Why did it
happen?
reports self driving car
Predictive
Analytics
Prescriptive
Analytics
What will
happen?
How can we
make it happen?
© 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
85%of #bigdata
projects fail to move
past preliminary
stages
Gartner - Nov. 2017
80% of analytics
insights will not
deliver business
outcomes through
2022
Gartner - Jan 2019
87% of data
science projects
never make it into
production
VentureBeat AI - July 2019
80% of AI projects
will “remain alchemy,
run by wizards”
through 2020
Gartner - Jan 2019
© 2020 Snowflake Inc. All Rights Reserved
EFFICIENT
DATA
PREPARATION
EXTENSIVE
PARTNER
ECOSYSTEM
CONSOLIDATED
SOURCE FOR
ALL DATA
KEY REQUIREMENTS FOR DATA SCIENCE
Structured, Semi structured, and
Unstructured data
3rd Party Data Sharing
Streaming & Batch
Dedicated compute clusters
for each team
No resource contention
Integration with the latest
ML tools and libraries
Consistent experience for
BI and ML users
© 2020 Snowflake Inc. All Rights Reserved
ValueofData DATA SCIENCE MATURITY
● Defining business use cases
● Focused on BI
● Exploring tools
● Experimenting with languages
● Training and hiring
● Testing data pipelines
DATA CHALLENGES
● 15% data utilization
● Understanding data
● Preparing data
OVERALL CHALLENGES
● Acquiring talent
● Designing & building consistent
processes & pipelines
● Defining business value
● Working on creating efficiencies
● Mapping ML to business value
● Planning growth & new use cases
● Testing various languages & tools
● Training and Hiring
● Optimizing data pipelines
DATA CHALLENGES
● 70% data utilization + external
● Data wrangling, ETL + data
validation
OVERALL CHALLENGES
● Training and Hiring
● Scalability of tools & team
● Cost containment
● Creating ROI consistently
● Large data science team
● ML built into product/service
● Standardized on tools/languages
● Advanced ML for video, audio,
and images
● Automated data pipelines
DATA CHALLENGES
● 130% data utilization
● Obtaining more data
● Streaming data
OVERALL CHALLENGES
● Retaining talent
● Optimize, simulate, & test model
efficacy
● Staying ahead of competitors
1. IDENTIFYING USE CASES 2. LIMITED PRODUCTION 3. COMPETITIVE ADVANTAGE
© 2020 Snowflake Inc. All Rights Reserved 15
DATA
SOURCES
OLTP DATABASES
ENTERPRISE
APPLICATIONS
THIRD-PARTY
WEB/LOG DATA
IoT
DATA
CONSUMERS
DATA MONETIZATION
OPERATIONAL
REPORTING
AD HOC ANALYSIS
REAL-TIME ANALYTICS
SNOWFLAKE CLOUD DATA PLATFORM
© 2020 Snowflake Inc. All Rights Reserved
DATA SCIENCE WITH SNOWFLAKE
© 2020 Snowflake Inc. All Rights Reserved
DATA SCIENCE WITH SNOWFLAKE
© 2020 Snowflake Inc. All Rights Reserved
DATA SCIENCE WITH SNOWFLAKE
© 2020 Snowflake Inc. All Rights Reserved
DATA SCIENCE WITH SNOWFLAKE
© 2020 Snowflake Inc. All Rights Reserved
DATA SCIENCE WITH SNOWFLAKE
© 2020 Snowflake Inc. All Rights Reserved
DATA SCIENCE WITH SNOWFLAKE
© 2020 Snowflake Inc. All Rights Reserved
DATA SCIENCE WITH SNOWFLAKE
© 2020 Snowflake Inc. All Rights Reserved
ML FRAMEWORKS & LIBRARIES
STATISTICAL ALGORITHMS NEURAL NETWORK / DEEP LEARNING
● Suitable for most Data Science problems involving
structured and semi-structured data.
● Good performance with small amounts of data
● Mature - have been around for a while
● Mostly used with unstructured data like images, audio,
and video
● Performance increases with the size of the training data
© 2020 Snowflake Inc. All Rights Reserved
NOTEBOOK-BASEDAUTOMATION-BASED (AutoML)
ML PARTNER ECOSYSTEM
Powerful but ComplexFast and Easy but Less Customizable
MLlib
© 2020 Snowflake Inc. All Rights Reserved 25
PROVEN BY OVER 5,400 CUSTOMERS
© 2020 Snowflake Inc. All Rights Reserved
EVER EXPANDING ECOSYSTEM
© 2020 Snowflake Computing Inc. All Rights Reserved© 2020 Snowflake Computing Inc. All Rights Reserved
SECURE AND
GOVERNED ACCESS
TO ALL DATA
© 2020 Snowflake Computing Inc. All Rights Reserved
SNOWFLAKE SECURITY AT A GLANCE
Snowflake
Operational Controls
• NIST 800-53
• SOC2 Type 2
• HIPAA
• PCI
• FedRAMP
Access
• All communication
secured & encrypted
• TLS 1.2 encryption
in both trusted and
untrusted networks
• IP Whitelisting
Authentication
• Password Policy
enforcement
• Multifactor Authentication
• SAML 2.0 support for
Federated Authentication
Application
• Flexible user
management
• Role-based
access control for
granular control
• RBAC for data
and actions
Data
• Encrypted at rest
• Hierarchical key model
rooted in Cloud HSM
• Automatic key rotation
• Time Travel 1-90 days
• Tri-Secret Secure
• Query statement
encryption
Infrastructure
• AWS, Azure Physical Security
• AWS, Azure Redundancy
• Regional Data Centers
▪ US
▪ EU
▪ AP
28
© 2020 Snowflake Computing Inc. All Rights Reserved
COMPREHENSIVE DATA PROTECTION
Protection against infrastructure failures
All data transparently & synchronously replicated
3+ ways across independent infrastructure
Protection against corruption & user errors
“Time travel” feature enables instant roll-back to
any point in time during chosen retention window
Long-term data protection
Zero-copy clones + optional export to cloud
object storage enable user-managed data copies
SELECT* FROM T0…
T0 T1 T2
New data Modified data
Daily
Weekly
29
© 2020 Snowflake Computing Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
Dynamically Mask Protected (PII, PHI)
Column Data at Query Time
• No change to the stored data
• Mask or partial mask using constant
value, hash, and custom functions
• Unmask for authorized users only
Policy Based Control
• Table/View owners and privileged users
(such as accountadmin) unauthorized
by default
• Centralized policy mgt
Ease of Management
• Apply single policy to multiple columns
• Prevent secure view explosion
Alice
(Unauthorized)
Bob
(Authorized)
ID Phone SSN
101 ***-***-5534 *********
102 ***-***-3564 *********
103 ***-***-9787 *********
ID Phone SSN
101 408-123-5534 *********
102 510-335-3564 *********
103 214-553-9787 *********
POLICIES
INGEST RAW DATA
GOVERNANCE AND SECURITY
Dynamic data masking
PRIVATE PUBLIC GA
© 2020 Snowflake Computing Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
DB 1
Table 1
Column 1
DB 1
View 1
Column 1
DB n
Table n
Column n
<policy condition>
<masking function>
Masking Policy
Resource(s)
Policy
Admin
Apply
CASE
WHEN invoker_role() IN (‘pii_role’) THEN val
WHEN invoker_role() IN (‘support’) THEN
regexp_replace(val,'.+@','*****@')
ELSE ‘********’
END;
Masking Policy Example
Unmask
Partial mask
Mask
Masking Policy
• Policy contains condition(s) and masking
function to apply under those conditions
• Policy is applied to one or more table,
view, or external table columns in an
account
• Nested policy execution for views - policy
on table executed before policy on view(s)
Supports
• All data types
• Data sharing
• Streams
• Clone carries over policy associations
GOVERNANCE AND SECURITY
Dynamic data masking policies
PRIVATE PUBLIC GA
© 2020 Snowflake Computing Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved
ID Phone SSN
101 408-123-5534 387-78-3456
102 510-334-3564 226-44-8908
103 214-553-9787 359-9987-0098
Ingest tokenized data
Ingest Protected (PII/PHI) Data
as Externally Tokenized
• Using Protegrity agents on ETL
tools.
De-tokenize for Authorized
Users at Query Time
• Protegrity DSG called using external
functions to de-tokenize data.
• For unauthorized users, Protegrity
DSG is not called.
Policy Based Control
• Table/View owners and privileged
users (such as accountadmin)
unauthorized by default
• Centralized policy mgt
Customer VPC / VNet
Data Security
Gateway
(DSG)
POLICIES
Tokenized
De-tokenized
De-tokenized
Tokenized
ID Phone SSN
101 111-222-3333 000-78-9999
102 002-778-9904 779-66-8908
103 100-887-8888 111-00-8888
REST API
Alice
(Unauthorized)
Bob
(Authorized)
EXTERNAL
FUNCTION
GOVERNANCE AND SECURITY
External tokenization using third party
PRIVATE PUBLIC GA
© 2020 Snowflake Inc. All Rights Reserved
“I can only show you the door. You’re the one
that has to walk through it”
© 2020 Snowflake Inc. All Rights Reserved
THANK YOU

More Related Content

What's hot (20)

PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PPTX
Zero to Snowflake Presentation
Brett VanderPlaats
 
PPTX
Snowflake essentials
qureshihamid
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PPTX
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
PDF
Time to Talk about Data Mesh
LibbySchulze
 
PDF
Lakehouse in Azure
Sergio Zenatti Filho
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Actionable Insights with AI - Snowflake for Data Science
Harald Erb
 
PDF
Data Mesh
Piethein Strengholt
 
PDF
Intro to Delta Lake
Databricks
 
PDF
Snowflake Company Presentation
AndrewJiang18
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
PDF
Data Pipline Observability meetup
Omid Vahdaty
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Zero to Snowflake Presentation
Brett VanderPlaats
 
Snowflake essentials
qureshihamid
 
Data Lakehouse Symposium | Day 4
Databricks
 
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
Time to Talk about Data Mesh
LibbySchulze
 
Lakehouse in Azure
Sergio Zenatti Filho
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Databricks Fundamentals
Dalibor Wijas
 
Modernizing to a Cloud Data Architecture
Databricks
 
Modern Data architecture Design
Kujambu Murugesan
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Actionable Insights with AI - Snowflake for Data Science
Harald Erb
 
Intro to Delta Lake
Databricks
 
Snowflake Company Presentation
AndrewJiang18
 
Databricks Platform.pptx
Alex Ivy
 
Data Pipline Observability meetup
Omid Vahdaty
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 

Similar to Snowflake Data Science and AI/ML at Scale (20)

PPTX
Delivering Data Democratization in the Cloud with Snowflake
Kent Graziano
 
PDF
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
IDERA Software
 
PDF
Does it only have to be ML + AI?
Harald Erb
 
PDF
Snowflake Data Cloud Differentiators !!!
waydebiz
 
PDF
Dataiku & Snowflake Meetup Berlin 2020
Harald Erb
 
PDF
eBook: 5 Steps to Secure Cloud Data Governance
Kim Cook
 
PDF
Modernize your Infrastructure and Mobilize Your Data
Precisely
 
PDF
Maturing Your Organization's Information Risk Management Strategy
Privacera
 
PDF
Introduction to Snowflake for Multi-cloud Data World
XiaoweiChen24
 
PPTX
Snowflake’s Cloud Data Platform and Modern Analytics
Senturus
 
PDF
Demystifying Data Warehousing as a Service (GLOC 2019)
Kent Graziano
 
DOCX
Global Data Management: Governance, Security and Usefulness in a Hybrid World
Neil Raden
 
PDF
Laboratorio práctico: Data warehouse en la nube
Software Guru
 
PDF
Snowflake Ohio Valley User Group Meeting - June 2022
Snowflake User Groups
 
PDF
Rise of the Data Cloud
Kent Graziano
 
PDF
Isaca new delhi india - privacy and big data
Ulf Mattsson
 
PDF
Isaca new delhi india privacy and big data
Ulf Mattsson
 
PDF
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
PPTX
Elastic Data Warehousing
Snowflake Computing
 
PPTX
Snowflake Training in Hyderabad Snowflake Training - Enroll Now.pptx
pravinvisualpath
 
Delivering Data Democratization in the Cloud with Snowflake
Kent Graziano
 
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
IDERA Software
 
Does it only have to be ML + AI?
Harald Erb
 
Snowflake Data Cloud Differentiators !!!
waydebiz
 
Dataiku & Snowflake Meetup Berlin 2020
Harald Erb
 
eBook: 5 Steps to Secure Cloud Data Governance
Kim Cook
 
Modernize your Infrastructure and Mobilize Your Data
Precisely
 
Maturing Your Organization's Information Risk Management Strategy
Privacera
 
Introduction to Snowflake for Multi-cloud Data World
XiaoweiChen24
 
Snowflake’s Cloud Data Platform and Modern Analytics
Senturus
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Kent Graziano
 
Global Data Management: Governance, Security and Usefulness in a Hybrid World
Neil Raden
 
Laboratorio práctico: Data warehouse en la nube
Software Guru
 
Snowflake Ohio Valley User Group Meeting - June 2022
Snowflake User Groups
 
Rise of the Data Cloud
Kent Graziano
 
Isaca new delhi india - privacy and big data
Ulf Mattsson
 
Isaca new delhi india privacy and big data
Ulf Mattsson
 
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
Elastic Data Warehousing
Snowflake Computing
 
Snowflake Training in Hyderabad Snowflake Training - Enroll Now.pptx
pravinvisualpath
 
Ad

More from Adam Doyle (20)

PPTX
ML Ops.pptx
Adam Doyle
 
PPTX
Data Engineering Roles
Adam Doyle
 
PPTX
Managed Cluster Services
Adam Doyle
 
PPTX
Delta lake and the delta architecture
Adam Doyle
 
PPTX
Great Expectations Presentation
Adam Doyle
 
PDF
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Adam Doyle
 
PDF
Automate your data flows with Apache NIFI
Adam Doyle
 
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
PPTX
Localized Hadoop Development
Adam Doyle
 
PDF
The new big data
Adam Doyle
 
PDF
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Adam Doyle
 
PPTX
Operationalizing Data Science St. Louis Big Data IDEA
Adam Doyle
 
PPTX
Retooling on the Modern Data and Analytics Tech Stack
Adam Doyle
 
PDF
Stl meetup cloudera platform - january 2020
Adam Doyle
 
PPTX
How stlrda does data
Adam Doyle
 
PPTX
Tailoring machine learning practices to support prescriptive analytics
Adam Doyle
 
PPTX
Synthesis of analytical methods data driven decision-making
Adam Doyle
 
PPTX
Big Data IDEA 101 2019
Adam Doyle
 
PPTX
Data Engineering and the Data Science Lifecycle
Adam Doyle
 
PDF
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
ML Ops.pptx
Adam Doyle
 
Data Engineering Roles
Adam Doyle
 
Managed Cluster Services
Adam Doyle
 
Delta lake and the delta architecture
Adam Doyle
 
Great Expectations Presentation
Adam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Adam Doyle
 
Automate your data flows with Apache NIFI
Adam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Localized Hadoop Development
Adam Doyle
 
The new big data
Adam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Adam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Adam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Adam Doyle
 
Stl meetup cloudera platform - january 2020
Adam Doyle
 
How stlrda does data
Adam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Adam Doyle
 
Synthesis of analytical methods data driven decision-making
Adam Doyle
 
Big Data IDEA 101 2019
Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Adam Doyle
 
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
Ad

Recently uploaded (20)

PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
big data eco system fundamentals of data science
arivukarasi
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 

Snowflake Data Science and AI/ML at Scale

  • 1. © 2020 Snowflake Inc. All Rights Reserved Data Science and AI/ML at Scale Scott Hruby | Snowflake Solutions Engineer
  • 2. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved Snowflake 101
  • 3. © 2020 Snowflake Inc. All Rights Reserved “The rapid rise of gathered/analyzed digital data is often core to the holistic success of the fastest growing & most successful companies of our time around the world.” – Mary Meeker, Bond Capital DATA: THE WORLD’S MOST VALUABLE RESOURCE
  • 4. © 2020 Snowflake Inc. All Rights Reserved DATA: THE NEW SUPERPOWER
  • 5. © 2020 Snowflake Inc. All Rights Reserved NEW TECHNOLOGY TRENDS CHANGE HOW WE USE DATA 5 Analytics is growing in importance, everywhere, and for everyone IoT, mobile, and social open up new opportunities for insight Cloud gives us the ability to scale and centralize data Rise of the Cloud Explosion of Data Diversification of Analytics
  • 6. © 2020 Snowflake Inc. All Rights Reserved On Premises EDW 1st Gen Cloud EDW Data Lake, Hadoop Cloud Data Platform All Data All Users Fast Answers SQL Database JOURNEY TO A CLOUD DATA PLATFORM
  • 7. © 2020 Snowflake Inc. All Rights Reserved TRADITIONAL DATA ARCHITECTURE Complex, Costly & Constrained OLTP Databases Enterprise Applications Third-Party Web/Log Data IoT Data Integration Data Transformation Data Analytics Normalization & Aggregation Ad Hoc Analysis Real-time Analytics Operational Reporting DATA SOURCES DATA CONSUMERS ELT Streaming ELT Data Marts Data Warehouses Backups File Sharing CubesData Lake CDC Data Science 7
  • 8. © 2020 Snowflake Inc. All Rights Reserved SNOWFLAKE CLOUD DATA PLATFORM ONE PLATFORM, ONE COPY OF DATA, MANY WORKLOADS 8 DATA SOURCES OLTP DATABASES ENTERPRISE APPLICATIONS THIRD-PARTY WEB/LOG DATA IoT DATA CONSUMERS DATA MONETIZATION OPERATIONAL REPORTING AD HOC ANALYSIS REAL-TIME ANALYTICS
  • 9. © 2020 Snowflake Inc. All Rights Reserved SNOWFLAKE ARCHITECTURE 9
  • 10. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved Data Science & AI
  • 11. © 2020 Snowflake Inc. All Rights Reserved DATA SCIENCE IS PREDICTIVE ANALYTICS Descriptive Analytics Diagnostic Analytics What happened? Why did it happen? reports self driving car Predictive Analytics Prescriptive Analytics What will happen? How can we make it happen?
  • 12. © 2020 Snowflake Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved 85%of #bigdata projects fail to move past preliminary stages Gartner - Nov. 2017 80% of analytics insights will not deliver business outcomes through 2022 Gartner - Jan 2019 87% of data science projects never make it into production VentureBeat AI - July 2019 80% of AI projects will “remain alchemy, run by wizards” through 2020 Gartner - Jan 2019
  • 13. © 2020 Snowflake Inc. All Rights Reserved EFFICIENT DATA PREPARATION EXTENSIVE PARTNER ECOSYSTEM CONSOLIDATED SOURCE FOR ALL DATA KEY REQUIREMENTS FOR DATA SCIENCE Structured, Semi structured, and Unstructured data 3rd Party Data Sharing Streaming & Batch Dedicated compute clusters for each team No resource contention Integration with the latest ML tools and libraries Consistent experience for BI and ML users
  • 14. © 2020 Snowflake Inc. All Rights Reserved ValueofData DATA SCIENCE MATURITY ● Defining business use cases ● Focused on BI ● Exploring tools ● Experimenting with languages ● Training and hiring ● Testing data pipelines DATA CHALLENGES ● 15% data utilization ● Understanding data ● Preparing data OVERALL CHALLENGES ● Acquiring talent ● Designing & building consistent processes & pipelines ● Defining business value ● Working on creating efficiencies ● Mapping ML to business value ● Planning growth & new use cases ● Testing various languages & tools ● Training and Hiring ● Optimizing data pipelines DATA CHALLENGES ● 70% data utilization + external ● Data wrangling, ETL + data validation OVERALL CHALLENGES ● Training and Hiring ● Scalability of tools & team ● Cost containment ● Creating ROI consistently ● Large data science team ● ML built into product/service ● Standardized on tools/languages ● Advanced ML for video, audio, and images ● Automated data pipelines DATA CHALLENGES ● 130% data utilization ● Obtaining more data ● Streaming data OVERALL CHALLENGES ● Retaining talent ● Optimize, simulate, & test model efficacy ● Staying ahead of competitors 1. IDENTIFYING USE CASES 2. LIMITED PRODUCTION 3. COMPETITIVE ADVANTAGE
  • 15. © 2020 Snowflake Inc. All Rights Reserved 15 DATA SOURCES OLTP DATABASES ENTERPRISE APPLICATIONS THIRD-PARTY WEB/LOG DATA IoT DATA CONSUMERS DATA MONETIZATION OPERATIONAL REPORTING AD HOC ANALYSIS REAL-TIME ANALYTICS SNOWFLAKE CLOUD DATA PLATFORM
  • 16. © 2020 Snowflake Inc. All Rights Reserved DATA SCIENCE WITH SNOWFLAKE
  • 17. © 2020 Snowflake Inc. All Rights Reserved DATA SCIENCE WITH SNOWFLAKE
  • 18. © 2020 Snowflake Inc. All Rights Reserved DATA SCIENCE WITH SNOWFLAKE
  • 19. © 2020 Snowflake Inc. All Rights Reserved DATA SCIENCE WITH SNOWFLAKE
  • 20. © 2020 Snowflake Inc. All Rights Reserved DATA SCIENCE WITH SNOWFLAKE
  • 21. © 2020 Snowflake Inc. All Rights Reserved DATA SCIENCE WITH SNOWFLAKE
  • 22. © 2020 Snowflake Inc. All Rights Reserved DATA SCIENCE WITH SNOWFLAKE
  • 23. © 2020 Snowflake Inc. All Rights Reserved ML FRAMEWORKS & LIBRARIES STATISTICAL ALGORITHMS NEURAL NETWORK / DEEP LEARNING ● Suitable for most Data Science problems involving structured and semi-structured data. ● Good performance with small amounts of data ● Mature - have been around for a while ● Mostly used with unstructured data like images, audio, and video ● Performance increases with the size of the training data
  • 24. © 2020 Snowflake Inc. All Rights Reserved NOTEBOOK-BASEDAUTOMATION-BASED (AutoML) ML PARTNER ECOSYSTEM Powerful but ComplexFast and Easy but Less Customizable MLlib
  • 25. © 2020 Snowflake Inc. All Rights Reserved 25 PROVEN BY OVER 5,400 CUSTOMERS
  • 26. © 2020 Snowflake Inc. All Rights Reserved EVER EXPANDING ECOSYSTEM
  • 27. © 2020 Snowflake Computing Inc. All Rights Reserved© 2020 Snowflake Computing Inc. All Rights Reserved SECURE AND GOVERNED ACCESS TO ALL DATA
  • 28. © 2020 Snowflake Computing Inc. All Rights Reserved SNOWFLAKE SECURITY AT A GLANCE Snowflake Operational Controls • NIST 800-53 • SOC2 Type 2 • HIPAA • PCI • FedRAMP Access • All communication secured & encrypted • TLS 1.2 encryption in both trusted and untrusted networks • IP Whitelisting Authentication • Password Policy enforcement • Multifactor Authentication • SAML 2.0 support for Federated Authentication Application • Flexible user management • Role-based access control for granular control • RBAC for data and actions Data • Encrypted at rest • Hierarchical key model rooted in Cloud HSM • Automatic key rotation • Time Travel 1-90 days • Tri-Secret Secure • Query statement encryption Infrastructure • AWS, Azure Physical Security • AWS, Azure Redundancy • Regional Data Centers ▪ US ▪ EU ▪ AP 28
  • 29. © 2020 Snowflake Computing Inc. All Rights Reserved COMPREHENSIVE DATA PROTECTION Protection against infrastructure failures All data transparently & synchronously replicated 3+ ways across independent infrastructure Protection against corruption & user errors “Time travel” feature enables instant roll-back to any point in time during chosen retention window Long-term data protection Zero-copy clones + optional export to cloud object storage enable user-managed data copies SELECT* FROM T0… T0 T1 T2 New data Modified data Daily Weekly 29
  • 30. © 2020 Snowflake Computing Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved Dynamically Mask Protected (PII, PHI) Column Data at Query Time • No change to the stored data • Mask or partial mask using constant value, hash, and custom functions • Unmask for authorized users only Policy Based Control • Table/View owners and privileged users (such as accountadmin) unauthorized by default • Centralized policy mgt Ease of Management • Apply single policy to multiple columns • Prevent secure view explosion Alice (Unauthorized) Bob (Authorized) ID Phone SSN 101 ***-***-5534 ********* 102 ***-***-3564 ********* 103 ***-***-9787 ********* ID Phone SSN 101 408-123-5534 ********* 102 510-335-3564 ********* 103 214-553-9787 ********* POLICIES INGEST RAW DATA GOVERNANCE AND SECURITY Dynamic data masking PRIVATE PUBLIC GA
  • 31. © 2020 Snowflake Computing Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved DB 1 Table 1 Column 1 DB 1 View 1 Column 1 DB n Table n Column n <policy condition> <masking function> Masking Policy Resource(s) Policy Admin Apply CASE WHEN invoker_role() IN (‘pii_role’) THEN val WHEN invoker_role() IN (‘support’) THEN regexp_replace(val,'.+@','*****@') ELSE ‘********’ END; Masking Policy Example Unmask Partial mask Mask Masking Policy • Policy contains condition(s) and masking function to apply under those conditions • Policy is applied to one or more table, view, or external table columns in an account • Nested policy execution for views - policy on table executed before policy on view(s) Supports • All data types • Data sharing • Streams • Clone carries over policy associations GOVERNANCE AND SECURITY Dynamic data masking policies PRIVATE PUBLIC GA
  • 32. © 2020 Snowflake Computing Inc. All Rights Reserved© 2020 Snowflake Inc. All Rights Reserved ID Phone SSN 101 408-123-5534 387-78-3456 102 510-334-3564 226-44-8908 103 214-553-9787 359-9987-0098 Ingest tokenized data Ingest Protected (PII/PHI) Data as Externally Tokenized • Using Protegrity agents on ETL tools. De-tokenize for Authorized Users at Query Time • Protegrity DSG called using external functions to de-tokenize data. • For unauthorized users, Protegrity DSG is not called. Policy Based Control • Table/View owners and privileged users (such as accountadmin) unauthorized by default • Centralized policy mgt Customer VPC / VNet Data Security Gateway (DSG) POLICIES Tokenized De-tokenized De-tokenized Tokenized ID Phone SSN 101 111-222-3333 000-78-9999 102 002-778-9904 779-66-8908 103 100-887-8888 111-00-8888 REST API Alice (Unauthorized) Bob (Authorized) EXTERNAL FUNCTION GOVERNANCE AND SECURITY External tokenization using third party PRIVATE PUBLIC GA
  • 33. © 2020 Snowflake Inc. All Rights Reserved “I can only show you the door. You’re the one that has to walk through it”
  • 34. © 2020 Snowflake Inc. All Rights Reserved THANK YOU