SlideShare a Scribd company logo
@joe_Caserta
Balancing
Data Governance
and
Innovation
Presented by:
Joe Caserta
Harrisburg University
Data Analytics Summit II
December 14-16, 2015
@joe_Caserta
@joe_Caserta
About Caserta Concepts
• Consulting Data Innovation and Modern Data Engineering
• Award-winning company
• Internationally recognized work force
• Strategy, Architecture, Implementation, Governance
• Innovation Partner
• Strategic Consulting
• Advanced Architecture
• Build & Deploy
• Leader in Enterprise Data Solutions
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Data Science
• Cloud Computing
• Data Governance
@joe_Caserta
Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance
@joe_Caserta
The Future of Data is Today
As a Mindful Cyborg, Chris
Dancy utilizes up to
700 sensors, devices,
applications, and services to
track, analyze, and optimize as
many areas of his existence.
Data quantification enables
him to see the connections of
otherwise invisible data,
resulting in dramatic upgrades
to his health, productivity, and
quality of life.
@joe_Caserta
The Evolution of Data Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
Reports  Correlations  Predictions  Recommendations
@joe_Caserta
The Evolution of Data Analytics
Source: Gartner
Reports  Correlations  Predictions  Recommendations
Cognitive Computing / Cognitive Data Analytics
@joe_Caserta
Traditional Data Analytics Methods
• Design – Top Down, Bottom Up
• Customer Interviews and requirements gathering
• Data Profiling
• Create Data Models
• Facts and Dimensions
• Extract Transform Load (ETL)
• Copy data from sources to data warehouse
• Data Governance
• Stewardship, business rules, data quality
• Put a BI Tool on Top
• Design semantic layer
• Develop reports
@joe_Caserta
A Day in the Life
• Onboarding new data is difficult!
• Rigid Structures and Data Governance
• Disconnected/removed from business requirements:
“Hey – I need to analyze some new data”
 IT Conforms and profiles the data
 Loads it into dimensional models
 Builds a semantic layer nobody is going to use
 Creates a dashboard we hope someone will notice
..and then you can access your data 3-6 months later to see if it has value!
@joe_Caserta
Houston, we have a Problem: Data Sprawl
• There is one application for every 5-10 employees generating copies of
the same files leading to massive amounts of duplicate idle data strewn all
across the enterprise. - Michael Vizard, ITBusinessEdge.com
• Employees spend 35% of their work time searching for information...
finding what they seek 50% of the time or less.
- “The High Cost of Not Finding Information,” IDC
@joe_Caserta
@joe_Caserta
@joe_Caserta
OLD WAY:
• Structure  Ingest  Analyze
• Fixed Capacity
• Monolithic
NEW WAY:
• Ingest  Analyze  Structure
• Dynamic Capacity
• Ecosystem
RECIPE:
• Data Lake
• Cloud
• Polyglot Data Landscape
The Paradigm Shift
Big Data is not the problem,
It’s the Change Agent
@joe_Caserta
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Data Lake
Canned Reporting
Big Data Analytics
NoSQL
DatabasesETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
The Evolution of Data Analytics
Data Science
@joe_Caserta
Innovation is the only sustainable competitive advantage a company can have
Innovations may fail, but companies that don’t innovate will fail
@joe_Caserta
@joe_Caserta
Technology:
• Scalable distributed storage  Hadoop, S3
• Pluggable fit-for-purpose processing  Spark, EMR
Functional Capabilities:
• Remove barriers from data ingestion and analysis
• Storage and processing for all data
• Tunable Governance
@joe_Caserta
@joe_Caserta
Govern
Speed
Access
Ensure
• Govern/Secure Data
• Make Accessible/Available
• Ensure Quality
• Instant Delivery
The CDO Quandary
@joe_Caserta
Data Munging Versus Reporting
Data Governance
AvailabilityRequirement
Fast
Slow
Minimum Maximum
Does Data munging in a data science
lab need the same restrictive
governance and enterprise reporting?
@joe_Caserta
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Data Governance
@joe_Caserta
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
for Big Data
@joe_Caserta
The Big Data Pyramid
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries and
Reporting
Usage Pattern Data Governance
Metadata, ILM,
Security
@joe_Caserta
The Data Refinery
• The feedback loop between Data Science, Data Warehouse and Data Lake is
critical
• Ephemeral Data Science Workbench
• Successful work products of science must Graduate into the appropriate layers
of the Data Lake
Cool New
Data
New
Insights
Governance
Refinery
@joe_Caserta
Define and Find Your Data
• Data Classification
• Import/Define business taxonomy
• Capture/Automate relationships between data sets
• Integrate metadata with other systems
• Centralized Auditing
• Security access information for every application with data
• Operational information for execution
• Search & Lineage (Browse)
• Predefined navigation paths to explore data
• Text-based search for data elements across data ecosystem
• Browse visualization of data lineage
• Security & Policy Engine
• Rationalize compliance policy at run-time
• Prevent data derivation based on classification (re-classification)
Key Requirements
• Automatic data-
discovery
• Metadata tagging
• Classification
@joe_Caserta
Caution: Assembly Required
 Some of the most hopeful tools are brand new or in
incubation!
 Enterprise big data implementations typically combine
products with custom built components
Tools
People, Processes and Business commitment is still critical!
Data Integration Data Catalog & Governance Emerging Solutions
@joe_Caserta
Existing On-Premise Solution
• Challenges with operations of data servers in Data Center
• Increasing infrastructure complexity
• Keeping up with data growth
Cloud Advantages
• Reduced upfront capital investment
• Faster speed to value
• Elasticity
“Those that go out and buy expensive
infrastructure find that the problem scope and
domain shift really quickly. By the time they get
around to answering the original question, the
business has moved on.” - Matt Wood, AWS
Move to the Cloud?
@joe_Caserta
Come out and Play
CIL - Caserta
Innovations Lab
Experience
Big Data Warehousing Meetup
• Meet monthly to share data best
practices, experiences
• 3,300+ Members
http://www.meetup.com/Big-Data-Warehousing/
Examples of Previous Topics
• Data Governance, Compliance &
Security in Hadoop w/Cloudera
• Real Time Trade Data Monitoring
with Storm & Cassandra
• Predictive Analytics
• Exploring Big Data Analytics
Techniques w/Datameer
• Using a Graph DB for MDM &
Relationship Mgmt
• Data Science w/Claudia
Perlcih & Revolution Analytics
• Processing 1.4 Trillion Events
in Hadoop
• Building a Relevance Engine
using Hadoop, Mahout & Pig
• Big Data 2.0 – YARN Distributed
ETL & SQL w/Hadoop
• Intro to NoSQL w/10GEN
@joe_Caserta
Thank You / Q&A
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
@joe_Caserta
@joe_Caserta
The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Business
Expertise
Advanced
Mathematics/
Statistics
@joe_Caserta
Electronic Medical Records (EMR) Analytics
Hadoop Data LakeEdge Node
`
100k
files
variant 1..n
…
variant 1..n
HDFS
Put
Netezza DW
Sqoop
Pig EMR
Processor
UDF
Library
Provider table
(parquet)
Member table
(parquet)
Python Wrapper
Provider table
Member table
Forqlift
Sequence
Files
…
variant 1..n
Sequence
Files
…
15 More
Entities
(parquet)
More
Dimensions
And
Facts
• Receive Electronic Medial Records from various providers in various formats
• Address Hadoop ‘small file’ problem
• No barrier for onboarding and analysis of new data
• Blend new data with Data Lake and Big Data Warehouse
• Machine Learning
• Text Analytics
• Natural Language Processing
• Reporting
• Ad-hoc queries
• File ingestion
• Information Lifecycle Mgmt

More Related Content

PDF
What Data Do You Have and Where is It?
Caserta
 
PDF
Balancing Data Governance and Innovation
Caserta
 
PDF
Setting Up the Data Lake
Caserta
 
PDF
The Emerging Role of the Data Lake
Caserta
 
PPTX
Defining and Applying Data Governance in Today’s Business Environment
Caserta
 
PPTX
Big Data's Impact on the Enterprise
Caserta
 
PDF
Intro to Data Science on Hadoop
Caserta
 
PDF
Making Big Data Easy for Everyone
Caserta
 
What Data Do You Have and Where is It?
Caserta
 
Balancing Data Governance and Innovation
Caserta
 
Setting Up the Data Lake
Caserta
 
The Emerging Role of the Data Lake
Caserta
 
Defining and Applying Data Governance in Today’s Business Environment
Caserta
 
Big Data's Impact on the Enterprise
Caserta
 
Intro to Data Science on Hadoop
Caserta
 
Making Big Data Easy for Everyone
Caserta
 

What's hot (20)

PPTX
Enterprise Data Management
Syed Jahanzaib Bin Hassan - JBH Syed
 
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
PDF
Moving Past Infrastructure Limitations
Caserta
 
PDF
Benefits of the Azure Cloud
Caserta
 
PDF
You're the New CDO, Now What?
Caserta
 
PDF
The Data Lake - Balancing Data Governance and Innovation
Caserta
 
PDF
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Caserta
 
PDF
The Rise of the CDO in Today's Enterprise
Caserta
 
PPTX
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
PDF
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
 
PPTX
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
TamrMarketing
 
PDF
Building an Effective Data & Analytics Operating Model A Data Modernization G...
Mark Hewitt
 
PDF
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
PPTX
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Caserta
 
PDF
Mastering Customer Data on Apache Spark
Caserta
 
PDF
Building a New Platform for Customer Analytics
Caserta
 
PPTX
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
TamrMarketing
 
PDF
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
PDF
ADV Slides: The Data Needed to Evolve an Enterprise Artificial Intelligence S...
DATAVERSITY
 
Enterprise Data Management
Syed Jahanzaib Bin Hassan - JBH Syed
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
Moving Past Infrastructure Limitations
Caserta
 
Benefits of the Azure Cloud
Caserta
 
You're the New CDO, Now What?
Caserta
 
The Data Lake - Balancing Data Governance and Innovation
Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Caserta
 
The Rise of the CDO in Today's Enterprise
Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
TamrMarketing
 
Building an Effective Data & Analytics Operating Model A Data Modernization G...
Mark Hewitt
 
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Caserta
 
Mastering Customer Data on Apache Spark
Caserta
 
Building a New Platform for Customer Analytics
Caserta
 
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
TamrMarketing
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
ADV Slides: The Data Needed to Evolve an Enterprise Artificial Intelligence S...
DATAVERSITY
 
Ad

Viewers also liked (9)

PDF
Data Quality in the Data Hub with RedPointGlobal
Caserta
 
PDF
Data Driven Decisions - Big Data Warehousing Meetup, FICO
Caserta
 
PDF
Neo4j Solutions - Master Data Management
Caserta
 
PDF
DGIQ 2015 The Fundamentals of Data Quality
Caserta
 
PPTX
Data Governance, Compliance and Security in Hadoop with Cloudera
Caserta
 
PPTX
Deploying a Governed Data Lake
WaterlineData
 
PPTX
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Caserta
 
PDF
Webinar: Initiating a Customer MDM/Data Governance Program
DATAVERSITY
 
PPTX
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Caserta
 
Data Quality in the Data Hub with RedPointGlobal
Caserta
 
Data Driven Decisions - Big Data Warehousing Meetup, FICO
Caserta
 
Neo4j Solutions - Master Data Management
Caserta
 
DGIQ 2015 The Fundamentals of Data Quality
Caserta
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Caserta
 
Deploying a Governed Data Lake
WaterlineData
 
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Caserta
 
Webinar: Initiating a Customer MDM/Data Governance Program
DATAVERSITY
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Caserta
 
Ad

Similar to Balancing Data Governance and Innovation (20)

PDF
Incorporating the Data Lake into Your Analytic Architecture
Caserta
 
PPTX
Big Data: Setting Up the Big Data Lake
Caserta
 
PPTX
Architecting for Big Data: Trends, Tips, and Deployment Options
Caserta
 
PPTX
Introduction to Data Science
Caserta
 
PPTX
bigdata- Introduction for pg students fo
DharaniMani4
 
PPTX
Deliveinrg explainable AI
Gary Allemann
 
PPTX
bigdata introduction for students pg msc
DharaniMani4
 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
PPTX
Big Data, NoSQL, NewSQL & The Future of Data Management
Tony Bain
 
PDF
All Together Now: A Recipe for Successful Data Governance
Inside Analysis
 
PDF
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
PPTX
Big_Data.pptx
mohamedibrahim946387
 
PPTX
Derfor skal du bruge en DataLake
Microsoft
 
PPTX
Big Data Analytics with Microsoft
Caserta
 
PDF
Big Data Evolution
itnewsafrica
 
PPTX
Digital intelligence satish bhatia
Satish Bhatia
 
PDF
MT101 Dell OCIO: Delivering data and analytics in real time
Dell EMC World
 
PPTX
Data Mesh using Microsoft Fabric
Nathan Bijnens
 
PPTX
Predictive Analytics - Big Data Warehousing Meetup
Caserta
 
PPTX
Hadoop and Your Data Warehouse
Caserta
 
Incorporating the Data Lake into Your Analytic Architecture
Caserta
 
Big Data: Setting Up the Big Data Lake
Caserta
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Caserta
 
Introduction to Data Science
Caserta
 
bigdata- Introduction for pg students fo
DharaniMani4
 
Deliveinrg explainable AI
Gary Allemann
 
bigdata introduction for students pg msc
DharaniMani4
 
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Tony Bain
 
All Together Now: A Recipe for Successful Data Governance
Inside Analysis
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
Big_Data.pptx
mohamedibrahim946387
 
Derfor skal du bruge en DataLake
Microsoft
 
Big Data Analytics with Microsoft
Caserta
 
Big Data Evolution
itnewsafrica
 
Digital intelligence satish bhatia
Satish Bhatia
 
MT101 Dell OCIO: Delivering data and analytics in real time
Dell EMC World
 
Data Mesh using Microsoft Fabric
Nathan Bijnens
 
Predictive Analytics - Big Data Warehousing Meetup
Caserta
 
Hadoop and Your Data Warehouse
Caserta
 

More from Caserta (9)

PPTX
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 
PPTX
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Caserta
 
PDF
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
PDF
Introduction to Data Science (Data Summit, 2017)
Caserta
 
PDF
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Caserta
 
PDF
Big Data Analytics on the Cloud
Caserta
 
PDF
Not Your Father's Database by Databricks
Caserta
 
PDF
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
PPTX
Real Time Big Data Processing on AWS
Caserta
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
Introduction to Data Science (Data Summit, 2017)
Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Caserta
 
Big Data Analytics on the Cloud
Caserta
 
Not Your Father's Database by Databricks
Caserta
 
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Real Time Big Data Processing on AWS
Caserta
 

Recently uploaded (20)

PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
The Future of Artificial Intelligence (AI)
Mukul
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 

Balancing Data Governance and Innovation

  • 1. @joe_Caserta Balancing Data Governance and Innovation Presented by: Joe Caserta Harrisburg University Data Analytics Summit II December 14-16, 2015
  • 3. @joe_Caserta About Caserta Concepts • Consulting Data Innovation and Modern Data Engineering • Award-winning company • Internationally recognized work force • Strategy, Architecture, Implementation, Governance • Innovation Partner • Strategic Consulting • Advanced Architecture • Build & Deploy • Leader in Enterprise Data Solutions • Big Data Analytics • Data Warehousing • Business Intelligence • Data Science • Cloud Computing • Data Governance
  • 4. @joe_Caserta Client Portfolio Retail/eCommerce & Manufacturing Digital Media/AdTech Education & Services Finance. Healthcare & Insurance
  • 5. @joe_Caserta The Future of Data is Today As a Mindful Cyborg, Chris Dancy utilizes up to 700 sensors, devices, applications, and services to track, analyze, and optimize as many areas of his existence. Data quantification enables him to see the connections of otherwise invisible data, resulting in dramatic upgrades to his health, productivity, and quality of life.
  • 6. @joe_Caserta The Evolution of Data Analytics Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Source: Gartner Reports  Correlations  Predictions  Recommendations
  • 7. @joe_Caserta The Evolution of Data Analytics Source: Gartner Reports  Correlations  Predictions  Recommendations Cognitive Computing / Cognitive Data Analytics
  • 8. @joe_Caserta Traditional Data Analytics Methods • Design – Top Down, Bottom Up • Customer Interviews and requirements gathering • Data Profiling • Create Data Models • Facts and Dimensions • Extract Transform Load (ETL) • Copy data from sources to data warehouse • Data Governance • Stewardship, business rules, data quality • Put a BI Tool on Top • Design semantic layer • Develop reports
  • 9. @joe_Caserta A Day in the Life • Onboarding new data is difficult! • Rigid Structures and Data Governance • Disconnected/removed from business requirements: “Hey – I need to analyze some new data”  IT Conforms and profiles the data  Loads it into dimensional models  Builds a semantic layer nobody is going to use  Creates a dashboard we hope someone will notice ..and then you can access your data 3-6 months later to see if it has value!
  • 10. @joe_Caserta Houston, we have a Problem: Data Sprawl • There is one application for every 5-10 employees generating copies of the same files leading to massive amounts of duplicate idle data strewn all across the enterprise. - Michael Vizard, ITBusinessEdge.com • Employees spend 35% of their work time searching for information... finding what they seek 50% of the time or less. - “The High Cost of Not Finding Information,” IDC
  • 13. @joe_Caserta OLD WAY: • Structure  Ingest  Analyze • Fixed Capacity • Monolithic NEW WAY: • Ingest  Analyze  Structure • Dynamic Capacity • Ecosystem RECIPE: • Data Lake • Cloud • Polyglot Data Landscape The Paradigm Shift Big Data is not the problem, It’s the Change Agent
  • 14. @joe_Caserta Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Data Lake Canned Reporting Big Data Analytics NoSQL DatabasesETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… The Evolution of Data Analytics Data Science
  • 15. @joe_Caserta Innovation is the only sustainable competitive advantage a company can have Innovations may fail, but companies that don’t innovate will fail
  • 17. @joe_Caserta Technology: • Scalable distributed storage  Hadoop, S3 • Pluggable fit-for-purpose processing  Spark, EMR Functional Capabilities: • Remove barriers from data ingestion and analysis • Storage and processing for all data • Tunable Governance
  • 19. @joe_Caserta Govern Speed Access Ensure • Govern/Secure Data • Make Accessible/Available • Ensure Quality • Instant Delivery The CDO Quandary
  • 20. @joe_Caserta Data Munging Versus Reporting Data Governance AvailabilityRequirement Fast Slow Minimum Maximum Does Data munging in a data science lab need the same restrictive governance and enterprise reporting?
  • 21. @joe_Caserta •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Data Governance
  • 22. @joe_Caserta •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations for Big Data
  • 23. @joe_Caserta The Big Data Pyramid Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Usage Pattern Data Governance Metadata, ILM, Security
  • 24. @joe_Caserta The Data Refinery • The feedback loop between Data Science, Data Warehouse and Data Lake is critical • Ephemeral Data Science Workbench • Successful work products of science must Graduate into the appropriate layers of the Data Lake Cool New Data New Insights Governance Refinery
  • 25. @joe_Caserta Define and Find Your Data • Data Classification • Import/Define business taxonomy • Capture/Automate relationships between data sets • Integrate metadata with other systems • Centralized Auditing • Security access information for every application with data • Operational information for execution • Search & Lineage (Browse) • Predefined navigation paths to explore data • Text-based search for data elements across data ecosystem • Browse visualization of data lineage • Security & Policy Engine • Rationalize compliance policy at run-time • Prevent data derivation based on classification (re-classification) Key Requirements • Automatic data- discovery • Metadata tagging • Classification
  • 26. @joe_Caserta Caution: Assembly Required  Some of the most hopeful tools are brand new or in incubation!  Enterprise big data implementations typically combine products with custom built components Tools People, Processes and Business commitment is still critical! Data Integration Data Catalog & Governance Emerging Solutions
  • 27. @joe_Caserta Existing On-Premise Solution • Challenges with operations of data servers in Data Center • Increasing infrastructure complexity • Keeping up with data growth Cloud Advantages • Reduced upfront capital investment • Faster speed to value • Elasticity “Those that go out and buy expensive infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on.” - Matt Wood, AWS Move to the Cloud?
  • 28. @joe_Caserta Come out and Play CIL - Caserta Innovations Lab Experience Big Data Warehousing Meetup • Meet monthly to share data best practices, experiences • 3,300+ Members http://www.meetup.com/Big-Data-Warehousing/ Examples of Previous Topics • Data Governance, Compliance & Security in Hadoop w/Cloudera • Real Time Trade Data Monitoring with Storm & Cassandra • Predictive Analytics • Exploring Big Data Analytics Techniques w/Datameer • Using a Graph DB for MDM & Relationship Mgmt • Data Science w/Claudia Perlcih & Revolution Analytics • Processing 1.4 Trillion Events in Hadoop • Building a Relevance Engine using Hadoop, Mahout & Pig • Big Data 2.0 – YARN Distributed ETL & SQL w/Hadoop • Intro to NoSQL w/10GEN
  • 29. @joe_Caserta Thank You / Q&A Joe Caserta President, Caserta Concepts [email protected] @joe_Caserta
  • 30. @joe_Caserta The Data Scientist Winning Trifecta Modern Data Engineering/Data Preparation Domain Knowledge/Business Expertise Advanced Mathematics/ Statistics
  • 31. @joe_Caserta Electronic Medical Records (EMR) Analytics Hadoop Data LakeEdge Node ` 100k files variant 1..n … variant 1..n HDFS Put Netezza DW Sqoop Pig EMR Processor UDF Library Provider table (parquet) Member table (parquet) Python Wrapper Provider table Member table Forqlift Sequence Files … variant 1..n Sequence Files … 15 More Entities (parquet) More Dimensions And Facts • Receive Electronic Medial Records from various providers in various formats • Address Hadoop ‘small file’ problem • No barrier for onboarding and analysis of new data • Blend new data with Data Lake and Big Data Warehouse • Machine Learning • Text Analytics • Natural Language Processing • Reporting • Ad-hoc queries • File ingestion • Information Lifecycle Mgmt