Are You Killing
the Benefits of
Your Data Lake?
Speakers
Rick van der Lans
Independent Business
Intelligence Analyst
R20 Consultancy
Lakshmi Randall
Director of Product Marketing
Denodo
@LakshmiLJ@rick_vanderlans
Copyright © 2018 R20/Consultancy B.V., The Netherlands 3
Wikipedia: Data science is an interdisciplinary
field of scientific methods, processes,
algorithms and systems to extract
knowledge or insights from data in various
forms, either structured or unstructured,
similar to data mining.
Copyright © 2018 R20/Consultancy B.V., The Netherlands 4
Data Science Steps and Data Preparation
Defining goals
Data selection
Data understanding
Data enrichment
Data cleansing
Data coding
Creating analytical model
Analytics
Understanding results
Copyright © 2018 R20/Consultancy B.V., The Netherlands 5
Data Preparation is Time-Consuming
Source: Gill Press, “Cleaning Big Data: Most Time-Consuming,
Least Enjoyable Data Science Task, Survey Says”, March 2016
Copyright © 2018 R20/Consultancy B.V., The Netherlands 6
Common Definition of Data Lake
James Serra:
A “data lake” is a storage repository, usually in Hadoop, that holds a
vast amount of raw data in its native format until it is needed. It’s a
great place for investigating, exploring, experimenting, and refining
data, in addition to archiving data.
Source: http://www.jamesserra.com/archive/2015/04/what-is-a-data-lake/
Copyright © 2018 R20/Consultancy B.V., The Netherlands 7
The Logical Data Lake
All
data sources
Investigative
analyticsData lake
Data science
Copyright © 2018 R20/Consultancy B.V., The Netherlands 8
Challenges of a Physical Data Lake
Complex “T” moved to data usage
Big data too big to move
• Too slow to copy and bandwidth issues
Uncooperative departments - company politics
Restricting data privacy and protection regulations
Data in data lake is stored outside original security realm
Missing metadata to describe data
Some sources are hard to copy
• For example, mainframe data
Refreshing of data lake
Management of data lake required
…
Data lake
Copyright © 2018 R20/Consultancy B.V., The Netherlands 9
The Logical (Virtual) Data Lake
Data sources
ETL ETL Cached Cached
Logical Data Lake
Data science and
investigative users
Copyright © 2018 R20/Consultancy B.V., The Netherlands 10
Data is too valuable an
asset to be used for
reporting only.
Copyright © 2018 R20/Consultancy B.V., The Netherlands 11
A Multitude of Data Delivery Systems
The classic data warehouse
architecture
The data lake
The data marketplace
Data services
Managed file transfer
Data streaming
…
Copyright © 2018 R20/Consultancy B.V., The Netherlands 12
Drawback: Replicated Specifications
Data warehouse
Data lake
Data marketplace
Data streaming
Data file transfer
Data services
Copyright © 2018 R20/Consultancy B.V., The Netherlands 13
Drawback: Replicated Specifications
Source
System 1
Source
System 2
Data warehouse
Data lake
Data services
Analytics & reporting
Data science
App
=
=
Copyright © 2018 R20/Consultancy B.V., The Netherlands 14
Siloed Data Delivery Systems
Copyright © 2018 R20/Consultancy B.V., The Netherlands 15
Landing Zone
Curated Zone
Production Zone
Data sources
Business users
A Physical Data Lake With Multiple Zones
Copyright © 2018 R20/Consultancy B.V., The Netherlands 16
The Logical Data Warehouse Architecture
Enterprise data layer
Data consumption
layer
Data source
layer
DataViertualization
Copyright © 2018 R20/Consultancy B.V., The Netherlands 17
DataVirtualizationServer
Source systems
Curated zone
Production
zone
Landing zone
Data Scientists and
other Business users
The Logical, Multi-Purpose Data Lake
Copyright © 2018 R20/Consultancy B.V., The Netherlands 18
Key Features Missing in SQL-on-Hadoop Engines
Allowing applications and users to access all the data
through another interface than SQL
Allowing all types of data sources to be accessed
Detailed lineage and impact analysis capabilities
A searchable data catalog
Advanced query optimization techniques for
federated queries
Advanced query pushdown and parallel processing
capabilities
Centralized data security
Copyright © 2018 R20/Consultancy B.V., The Netherlands 19
Single-Purpose versus Multi-Purpose Data Lake (1)
The Single-Purpose Data Lake
• Not always practical or feasible
• The data in a data lake is potentially too valuable to be used by data
scientists exclusively
• Other user groups may be interested in the data lake
• Siloed data delivery system operating independently of others
• Multiple physical layers of lakes is complex
Copyright © 2018 R20/Consultancy B.V., The Netherlands 20
Single-Purpose versus Multi-Purpose Data Lake (2)
The Multi-Purpose Data Lake
• Some data is physically stored centrally (through copying or caching), and
some is accessed remotely
• The data offered can be accessed by any type of business user
• The data in the data sources can be transformed to any form that is required
by other user groups
• A logical, multi-purpose data lake can be the foundation for several data
delivery systems
• Working with logical layers is easy to manage and maintain
Copyright © 2018 R20/Consultancy B.V., The Netherlands 21
Advantages Multi-Purpose Data Lakes
Reduction of development costs
• Metadata specifications are defined once and reused many times
• Analytical solutions developed by one data scientist can easily be reused
• Data-related solutions developed by non-data scientists can be reused
Acceleration of development
• Data scientists don’t need to spend time on data selection
• Physically copying data is not mandatory, but optional
• Business user don’t have to learn the technical languages and APIs of the original data sources
Increase report and analytical consistency
• Reusing analytical and data-related solutions improve the reporting and analytical consistency
• Definitions, descriptions, tags, and categories can be centrally cataloged
• Access to all the data can be centrally secured
Copyright © 2018 R20/Consultancy B.V., The Netherlands 22
Time to Tear the Silos Down!
Data Virtualization
24
Shhh… the ugly little secret is that big data deployment is hard!
Big Data Hadoop Deployments
25
Fifty Shades of Data Management
Transactional
Systems
Data Warehouses
SAS Applications
Data Hubs
OLAP Sources
Data Catalogs
Micro Services
Data Marts
Streams/
Queues/CDC
Data Lakes
Black Hole
Copyright © Intelligent Solutions, Inc. 2018 All Rights Reserved
▪ For companies to benefit from their analytic efforts, data must be:
▪ Easily located (wherever it resides)
▪ Easily understood (with all its context in place)
▪ Easily accessed (query performance is critical)
▪ Easily audited (its lifecycle is clear to both IT and business users)
▪ Appropriately provisioned for analysis (its management is known)
26
The Analytics Environment Must Have a Brain…
27
A Few Simple Rules…
1. Build a business strategy rather than a big data strategy
2. Big data is really about small
3. Users come in all shapes and sizes
• Who are they? What data do they need? What flexibility do
they want?
4. Connect to all of the data (but start with the most important)
• What data is needed by the users? Open access or pre-
aggregated or pre-calculated?
5. Use the language that the business understands
• Don’t force people to change terminology…support multiple
models, e.g., to Finance it’s an ‘account’, to Customer Care
it’s a ‘customer’.
27
28
Self-Service With Guardrails
• Don’t build just for the ‘data cowboys’
• Create pre-integrated, pre-calculated data
• Eliminating this burden from the users.
• Ensures consistency of calculations, etc.
• But allow the cowboys to ‘roam and wrangle’
• Even the cowboys can only access ‘approved’ data
sources
29
A Single, Logical, Multi-Purpose Data Lake
Product Traceability
Product Innovation
Risk Management
Pricing Optimization
Virtual Sandbox
Single View
Data as a Service
Operational Excellence
M & A
BI/Reporting
DATA
VIRTUALIZATION
Data Scientist
Call Center Analyst
Store Manager
Compliance Analyst
Shop Floor Supervisor
30
Multi-Purpose Data Lake With Data Virtualization
ENABLING TECHNOLOGIES
Virtualize Data, Don’t Migrate it
• Distributed heterogeneity is a challenge for the MDA
– Plague of data standards, models, quality metrics, interfaces…
• Consolidating diverse data is not a compelling solution
– Migration & consolidation alleviate complexity, but have other problems
– Time consuming, risky, disruptive, distracting
• DV is effective alternative to consolidation
– Fraction of the time, risk, cost and disruption of
migration and consolidation projects
– Software/hardware advances give DV the
speed/scale required of most SLAs & use cases
32
Big Data Queries Faster With Denodo Platform
Performance comparison of 5 different queries
1. Data virtualization delivers better performance without the need to replicate data into Hadoop.
2. Data virtualization leverages data source architectures for what they are good at.
Impala Hadoop-
only Runtime (s)
Denodo Runtime
(s)
Denodo Runtime
w/ Cache (s)
Data Volumes
Query 1 199 120 68 Queries 1,2,3,5
•Exadata Row
Count: ~5M
•Impala Row
Count: ~500k
Query 4
•Exadata Row
Count: ~5M
•Impala Row
Count: ~2M
Query 2 187 96 88
Query 3 120 212 115
Query 4 timeout 328 69
Query 5 46 91 56
w w w . a n a d a r k o . c o m A N A D A R K O P E T R O L E U M C O R P O R A T I O N 33
Anadarko employs
approximately 4,500 men and
women and invested about $4
billion in 2017 to find and
develop the oil and natural gas
resources that are essential to
modern life
COMPANY PROFILE
w w w . a n a d a r k o . c o m A N A D A R K O P E T R O L E U M C O R P O R A T I O N
Changing Commodity Cycle
34
better data
HONED FOCUS
faster data
ADJUSTED ORG
more data
ENHANCED TECH
w w w . a n a d a r k o . c o m A N A D A R K O P E T R O L E U M C O R P O R A T I O N
Self-Service Data Delivery Environment
Examples
• reduced ad valorem taxes for finance
• improved (production) completion design from multi-variate analysis using virtual views
• more (combined) access to vendor subscription data exploration for competitor
intelligence
C O R P O R A T I O N 35
To create and use data services for analytics, reports, and apps
Results (from 2017 roll-out/implementation)…
20corporate repositories; several non-corporate
200+ corporate views; 100+ user-defined views
30developers using/trained
150direct users; ∼700 indirect users
w w w . a n a d a r k o . c o m A N A D A R K O P E T R O L E U M C O R P O R A T I O N
Data Architecture at Anadarko
36
37
Why Multi-Purpose Data Lake?
• Surface all company data without the need to replicate
all data to the Hadoop lake
• Improve governance and metadata management to avoid
“data swamps”
• Allow for on-demand combination of real-time (from the
original sources) with historical data (in the cluster)
• Leverage the processing power of the existing data lake
clusters using Denodo’s optimizer
38
- Source: “Forrester Wave™: Big Data Fabric Q4 2016”
Denodo’s key strength is delivering a unified and centralized data
services fabric with security and real-time integration across
multiple traditional and big data sources, including Hadoop,
NoSQL, cloud, and software-as-a-service (SaaS).”
39
Gartner Gives DV Its Highest Maturity Rating
“Data
Virtualization can
be deployed with
low risk and
effort to achieve
maximum value.”
Q&A
Thank you!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorization from Denodo Technologies.

Are You Killing the Benefits of Your Data Lake?

  • 1.
    Are You Killing theBenefits of Your Data Lake?
  • 2.
    Speakers Rick van derLans Independent Business Intelligence Analyst R20 Consultancy Lakshmi Randall Director of Product Marketing Denodo @LakshmiLJ@rick_vanderlans
  • 3.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 3 Wikipedia: Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.
  • 4.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 4 Data Science Steps and Data Preparation Defining goals Data selection Data understanding Data enrichment Data cleansing Data coding Creating analytical model Analytics Understanding results
  • 5.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 5 Data Preparation is Time-Consuming Source: Gill Press, “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”, March 2016
  • 6.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 6 Common Definition of Data Lake James Serra: A “data lake” is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed. It’s a great place for investigating, exploring, experimenting, and refining data, in addition to archiving data. Source: http://www.jamesserra.com/archive/2015/04/what-is-a-data-lake/
  • 7.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 7 The Logical Data Lake All data sources Investigative analyticsData lake Data science
  • 8.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 8 Challenges of a Physical Data Lake Complex “T” moved to data usage Big data too big to move • Too slow to copy and bandwidth issues Uncooperative departments - company politics Restricting data privacy and protection regulations Data in data lake is stored outside original security realm Missing metadata to describe data Some sources are hard to copy • For example, mainframe data Refreshing of data lake Management of data lake required … Data lake
  • 9.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 9 The Logical (Virtual) Data Lake Data sources ETL ETL Cached Cached Logical Data Lake Data science and investigative users
  • 10.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 10 Data is too valuable an asset to be used for reporting only.
  • 11.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 11 A Multitude of Data Delivery Systems The classic data warehouse architecture The data lake The data marketplace Data services Managed file transfer Data streaming …
  • 12.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 12 Drawback: Replicated Specifications Data warehouse Data lake Data marketplace Data streaming Data file transfer Data services
  • 13.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 13 Drawback: Replicated Specifications Source System 1 Source System 2 Data warehouse Data lake Data services Analytics & reporting Data science App = =
  • 14.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 14 Siloed Data Delivery Systems
  • 15.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 15 Landing Zone Curated Zone Production Zone Data sources Business users A Physical Data Lake With Multiple Zones
  • 16.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 16 The Logical Data Warehouse Architecture Enterprise data layer Data consumption layer Data source layer DataViertualization
  • 17.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 17 DataVirtualizationServer Source systems Curated zone Production zone Landing zone Data Scientists and other Business users The Logical, Multi-Purpose Data Lake
  • 18.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 18 Key Features Missing in SQL-on-Hadoop Engines Allowing applications and users to access all the data through another interface than SQL Allowing all types of data sources to be accessed Detailed lineage and impact analysis capabilities A searchable data catalog Advanced query optimization techniques for federated queries Advanced query pushdown and parallel processing capabilities Centralized data security
  • 19.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 19 Single-Purpose versus Multi-Purpose Data Lake (1) The Single-Purpose Data Lake • Not always practical or feasible • The data in a data lake is potentially too valuable to be used by data scientists exclusively • Other user groups may be interested in the data lake • Siloed data delivery system operating independently of others • Multiple physical layers of lakes is complex
  • 20.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 20 Single-Purpose versus Multi-Purpose Data Lake (2) The Multi-Purpose Data Lake • Some data is physically stored centrally (through copying or caching), and some is accessed remotely • The data offered can be accessed by any type of business user • The data in the data sources can be transformed to any form that is required by other user groups • A logical, multi-purpose data lake can be the foundation for several data delivery systems • Working with logical layers is easy to manage and maintain
  • 21.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 21 Advantages Multi-Purpose Data Lakes Reduction of development costs • Metadata specifications are defined once and reused many times • Analytical solutions developed by one data scientist can easily be reused • Data-related solutions developed by non-data scientists can be reused Acceleration of development • Data scientists don’t need to spend time on data selection • Physically copying data is not mandatory, but optional • Business user don’t have to learn the technical languages and APIs of the original data sources Increase report and analytical consistency • Reusing analytical and data-related solutions improve the reporting and analytical consistency • Definitions, descriptions, tags, and categories can be centrally cataloged • Access to all the data can be centrally secured
  • 22.
    Copyright © 2018R20/Consultancy B.V., The Netherlands 22 Time to Tear the Silos Down!
  • 23.
  • 24.
    24 Shhh… the uglylittle secret is that big data deployment is hard! Big Data Hadoop Deployments
  • 25.
    25 Fifty Shades ofData Management Transactional Systems Data Warehouses SAS Applications Data Hubs OLAP Sources Data Catalogs Micro Services Data Marts Streams/ Queues/CDC Data Lakes Black Hole
  • 26.
    Copyright © IntelligentSolutions, Inc. 2018 All Rights Reserved ▪ For companies to benefit from their analytic efforts, data must be: ▪ Easily located (wherever it resides) ▪ Easily understood (with all its context in place) ▪ Easily accessed (query performance is critical) ▪ Easily audited (its lifecycle is clear to both IT and business users) ▪ Appropriately provisioned for analysis (its management is known) 26 The Analytics Environment Must Have a Brain…
  • 27.
    27 A Few SimpleRules… 1. Build a business strategy rather than a big data strategy 2. Big data is really about small 3. Users come in all shapes and sizes • Who are they? What data do they need? What flexibility do they want? 4. Connect to all of the data (but start with the most important) • What data is needed by the users? Open access or pre- aggregated or pre-calculated? 5. Use the language that the business understands • Don’t force people to change terminology…support multiple models, e.g., to Finance it’s an ‘account’, to Customer Care it’s a ‘customer’. 27
  • 28.
    28 Self-Service With Guardrails •Don’t build just for the ‘data cowboys’ • Create pre-integrated, pre-calculated data • Eliminating this burden from the users. • Ensures consistency of calculations, etc. • But allow the cowboys to ‘roam and wrangle’ • Even the cowboys can only access ‘approved’ data sources
  • 29.
    29 A Single, Logical,Multi-Purpose Data Lake Product Traceability Product Innovation Risk Management Pricing Optimization Virtual Sandbox Single View Data as a Service Operational Excellence M & A BI/Reporting DATA VIRTUALIZATION Data Scientist Call Center Analyst Store Manager Compliance Analyst Shop Floor Supervisor
  • 30.
    30 Multi-Purpose Data LakeWith Data Virtualization
  • 31.
    ENABLING TECHNOLOGIES Virtualize Data,Don’t Migrate it • Distributed heterogeneity is a challenge for the MDA – Plague of data standards, models, quality metrics, interfaces… • Consolidating diverse data is not a compelling solution – Migration & consolidation alleviate complexity, but have other problems – Time consuming, risky, disruptive, distracting • DV is effective alternative to consolidation – Fraction of the time, risk, cost and disruption of migration and consolidation projects – Software/hardware advances give DV the speed/scale required of most SLAs & use cases
  • 32.
    32 Big Data QueriesFaster With Denodo Platform Performance comparison of 5 different queries 1. Data virtualization delivers better performance without the need to replicate data into Hadoop. 2. Data virtualization leverages data source architectures for what they are good at. Impala Hadoop- only Runtime (s) Denodo Runtime (s) Denodo Runtime w/ Cache (s) Data Volumes Query 1 199 120 68 Queries 1,2,3,5 •Exadata Row Count: ~5M •Impala Row Count: ~500k Query 4 •Exadata Row Count: ~5M •Impala Row Count: ~2M Query 2 187 96 88 Query 3 120 212 115 Query 4 timeout 328 69 Query 5 46 91 56
  • 33.
    w w w. a n a d a r k o . c o m A N A D A R K O P E T R O L E U M C O R P O R A T I O N 33 Anadarko employs approximately 4,500 men and women and invested about $4 billion in 2017 to find and develop the oil and natural gas resources that are essential to modern life COMPANY PROFILE
  • 34.
    w w w. a n a d a r k o . c o m A N A D A R K O P E T R O L E U M C O R P O R A T I O N Changing Commodity Cycle 34 better data HONED FOCUS faster data ADJUSTED ORG more data ENHANCED TECH
  • 35.
    w w w. a n a d a r k o . c o m A N A D A R K O P E T R O L E U M C O R P O R A T I O N Self-Service Data Delivery Environment Examples • reduced ad valorem taxes for finance • improved (production) completion design from multi-variate analysis using virtual views • more (combined) access to vendor subscription data exploration for competitor intelligence C O R P O R A T I O N 35 To create and use data services for analytics, reports, and apps Results (from 2017 roll-out/implementation)… 20corporate repositories; several non-corporate 200+ corporate views; 100+ user-defined views 30developers using/trained 150direct users; ∼700 indirect users
  • 36.
    w w w. a n a d a r k o . c o m A N A D A R K O P E T R O L E U M C O R P O R A T I O N Data Architecture at Anadarko 36
  • 37.
    37 Why Multi-Purpose DataLake? • Surface all company data without the need to replicate all data to the Hadoop lake • Improve governance and metadata management to avoid “data swamps” • Allow for on-demand combination of real-time (from the original sources) with historical data (in the cluster) • Leverage the processing power of the existing data lake clusters using Denodo’s optimizer
  • 38.
    38 - Source: “ForresterWave™: Big Data Fabric Q4 2016” Denodo’s key strength is delivering a unified and centralized data services fabric with security and real-time integration across multiple traditional and big data sources, including Hadoop, NoSQL, cloud, and software-as-a-service (SaaS).”
  • 39.
    39 Gartner Gives DVIts Highest Maturity Rating “Data Virtualization can be deployed with low risk and effort to achieve maximum value.”
  • 40.
  • 41.
    Thank you! www.denodo.com [email protected] ©Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.