SlideShare a Scribd company logo
Anarchy Doesn’t Scale:
Your Data Lake Needs Governance
Kiran Kamreddy
Sr. Product Marketing Manager
Hadoop Portfolio, Teradata
One Data Lake: Many Definitions
One Data Lake: Many Definitions
A	
  centralized	
  repository	
  of	
  raw	
  data	
  into	
  which	
  many	
  data-­‐producing	
  
streams	
  flow	
  and	
  from	
  which	
  downstream	
  facili7es	
  may	
  draw.	
  	
  	
  	
  
Informa4on	
  
Sources	
  
Data	
  Lake	
  
Downstream	
  
Facili4es	
  
Data	
  Variety	
   is	
  the	
  driving	
  factor	
  in	
  building	
  a	
  Data	
  Lake	
  
Data Lake Maturity & Risks
Data	
  Lake	
  ini4a4ves	
  usually	
  start	
  small	
  	
  
–  Single	
  business	
  unit	
  
–  Few	
  data	
  sources,	
  applica7ons	
  
–  Few	
  data	
  scien7sts	
  
–  Exclusive	
  domain	
  
	
  
	
  
Informal,	
  Tribal	
  ways	
  of	
  sharing	
  	
  	
  
–  Handshakes	
  
–  Over	
  the	
  desk	
  conversa7ons	
  
–  No	
  processes,	
  mechanisms	
  
–  Everyone	
  is	
  free	
  to	
  use	
  	
  
Anarchy	
  works	
  well	
  
Data Lake Maturity & Risks
Data	
  Lake	
  ini4a4ves	
  grow	
  popular	
  
–  More	
  Business	
  Units	
  
–  More	
  Data	
  Sources	
  	
  
–  More	
  Applica7ons	
  
–  More	
  Users	
  	
  
…. grow into more complex environments
Without	
  proper	
  governance	
  mechanisms	
  	
  	
  
Data	
  lakes	
  risk	
  turning	
  data	
  swamps	
  
Consistency
Quality
Security
What can Governance do for my data lake?
Fundamental	
  capabili4es	
  for	
  tracking,	
  organizing	
  and	
  understanding	
  data	
  
- Where	
  did	
  my	
  data	
  come	
  from	
  ?	
  How	
  is	
  it	
  being	
  transformed	
  ?	
  	
  
- Track	
  usage,	
  resolve	
  anomalies,	
  visualize,	
  op7mize	
  and	
  clarify	
  data	
  lineage	
  
- 	
  Search	
  and	
  access	
  data	
  (	
  not	
  only	
  browse	
  )	
  
- Assess	
  data	
  quality	
  and	
  fitness	
  for	
  purpose	
  
Specialized capabilities to meets security& compliance requirements
-  Govern who can/cannot access the data and who cannot
-  Data life cycle management, archiving and retention policies
-  Auditing, compliance
Data	
  Governance	
  first	
  approach	
  to	
  prevent	
  turning	
  to	
  Data	
  swamps	
  	
  
RetrofiIng	
  data	
  governance	
  is	
  not	
  feasible	
  
	
  
Governance and Productivity
Governance	
  should	
  supports	
  day-­‐to-­‐day	
  use	
  of	
  data	
  	
  
–  Data	
  workers	
  need	
  a	
  strong	
  understanding	
  	
  
–  Roles	
  for	
  data	
  stewards,	
  data	
  owners,	
  data	
  analysts/scien7sts	
  need	
  to	
  
be	
  assigned	
  
	
  
Operational Metadata is critical to understanding
–  Where did it come from?
–  What is the environment? – landing zone, OS, Line of Business
–  What processes touched my data?did you lose any data? – Checksums etc.
–  When did the data get ingested, transformed?
–  Did it get exported, when, where how will it be used (organizational)?
Provision consistent ingest methods that track operational metadata
What about Security & Compliance?
•  Compliance	
  and	
  Regulatory	
  	
  	
  
–  Capture,	
  store	
  and	
  move	
  data	
  	
  	
  
–  Sarbanes-­‐Oxley,	
  HIPAA,	
  Basel	
  II	
  	
  
•  Security	
  	
  	
  
–  Authoriza7on,	
  Authen7ca7on	
  	
  
–  Handling	
  sensi7ve	
  data	
  	
  
•  Audi4ng	
  	
  
–  Recording	
  every	
  aUempt	
  to	
  access	
  
•  Archive	
  &	
  Reten4on	
  
–  Data	
  life	
  cycle	
  policies	
  
	
  
Governance Challenges on Hadoop Data Lakes
Hadoop	
  is	
  different	
  to	
  DW	
  
• Scale:	
  High	
  volumes	
  of	
  data,	
  mul7ple	
  user	
  access	
  
• Variety:	
  Schema-­‐on-­‐read,	
  mul7ple	
  formats	
  of	
  data	
  
• Mul7ple	
  storage	
  layers	
  (HDFS,	
  Hive,	
  HBase)	
  
• Many	
  processing	
  engines	
  (MR,	
  Hive,	
  Pig,	
  Impala,	
  Drill…)	
  
• Many	
  workflow	
  engines/schedules	
  (Cron,	
  Oozie,	
  Falcon…)	
  
• Holis7c	
  view	
  of	
  data	
  with	
  required	
  context	
  is	
  difficult	
  	
  
	
  
Hadoop needs less stringent, more flexible mechanisms
Balance agility and self service with processes, rules, regulations
Maintain Governance without losing Hadoop's power
•  Apache	
  Hadoop	
  has	
  built-­‐in	
  support	
  for	
  these	
  capabili4es	
  
•  HCatalog,	
  Hive,	
  etc	
  
•  Hadoop	
  distribu4on	
  vendors	
  have	
  all	
  made	
  improvements	
  in	
  
each	
  of	
  these	
  areas	
  
•  Navigator,	
  Ranger,	
  etc	
  
•  Vendors	
  provide	
  specialized	
  capabili4es	
  in	
  each	
  area	
  that	
  go	
  
beyond	
  what	
  a	
  Hadoop	
  distribu4on	
  provides	
  
Multiple Approaches….. but incomplete
One approach for Data Governance on Hadoop
Teradata	
  Loom®	
  –	
  Integrated	
  Data	
  Management	
  for	
  Hadoop	
  
	
  Metadata	
  management,	
  Lineage,	
  Data	
  Wrangling	
  
	
  Automa7c	
  data	
  cataloging,	
  data	
  profiling	
  and	
  sta7s7cs	
  genera7on	
  
Teradata Rainstor – Data Archiving
Structured data archiving in Hadoop with robust security
Compliance and auditing
ThinkBig – Hadoop professional services
Hadoop Data Lake – packaged service/product offering to build and deploy
high-quality, governed data lakes
Teradata Loom
Find	
  and	
  Understand	
  Your	
  Data	
  
•  Ac7veScan	
  
–  Data	
  cataloging	
  
–  Event	
  triggers	
  
–  Job	
  detec7on	
  and	
  lineage	
  crea7on	
  
–  Data	
  profiling	
  (sta7s7cs)	
  
•  Workbench	
  and	
  Metadata	
  Registry	
  
–  Data	
  explora7on	
  and	
  discovery	
  
–  Technical	
  and	
  business	
  metadata	
  
–  Data	
  sampling	
  and	
  previews	
  
–  Lineage	
  rela7onships	
  
–  Search	
  over	
  metadata	
  
–  REST	
  API	
  –	
  easily	
  integrate	
  third-­‐party	
  
apps	
  
Prepare	
  Your	
  Data	
  
•  Data	
  Wrangling	
  
–  Self-­‐service,	
  interac7ve	
  data	
  wrangling	
  for	
  
Hadoop	
  
–  Metadata	
  tracked	
  	
  
•  HiveQL	
  
–  Joins,	
  unions,	
  aggrega7ons,	
  UDFs	
  
–  Metadata	
  tracked	
  in	
  Loom	
  
Teradata Rainstor
–  Retain	
  data	
  online	
  that	
  is	
  queryable	
  for	
  an	
  indefinite	
  period	
  
– 	
  Re7re	
  data	
  that	
  are	
  no	
  longer	
  required	
  with	
  auto-­‐expira4on	
  policies	
  
– 	
  Comply	
  with	
  strict	
  government	
  rules	
  and	
  regula7ons	
  
– 	
  Retain	
  the	
  metadata	
  as	
  it	
  was	
  originally	
  captured	
  
– 	
  Store	
  tamper-­‐proof,	
  immutable	
  (unchangeable)	
  data	
  
– 	
  Maintain	
  availability	
  to	
  data	
  as	
  RDBMS	
  versions	
  change	
  or	
  expire	
  
–  Compression,	
  MPP	
  SQL	
  query	
  engine,	
  Encryp7on,	
  Audi7ng	
  
Big Data Services from ThinkBig
Big Data
Strategy &
Roadmap
Data Lake
Implementation
Analytics &
Data Science
Training &
Support
Lack of Clear
Big Data
Strategy
Difficulty
Turning Data
into Action
Missing Big
Data Skills
	
  
Focused	
  exclusively	
  on	
  tying	
  Hadoop	
  and	
  big	
  data	
  solu7ons	
  to	
  measurable	
  
business	
  value	
  
	
  
Data Scattered
& Not Well
Understood
Impact	
  
•  Co-­‐loca4on	
  of	
  data	
  provides	
  more	
  efficient	
  workflow	
  for	
  analysts	
  
•  Hadoop	
  provides	
  scalability	
  at	
  a	
  lower	
  cost	
  than	
  tradi4onal	
  systems	
  
•  Develop	
  new	
  insights	
  to	
  drive	
  business	
  value	
  
Situa4on	
  
	
  	
  
Problem	
  
Lack	
  of	
  centralized	
  metadata	
  repository	
  makes	
  data	
  governance	
  impossible.	
  	
  
Enterprise	
  must	
  have	
  transparency	
  into	
  data	
  in	
  the	
  cluster	
  and	
  capability	
  to	
  define	
  
extensible	
  metadata.	
  
Large	
  scale	
  data	
  lake	
  planned	
  with	
  many	
  heterogeneous	
  sources	
  and	
  many	
  individual	
  
analyst	
  users.	
  
Solu4on	
  
Hadoop	
  provides	
  data	
  lake	
  infrastructure.	
  	
  Loom	
  provides	
  centralized	
  metadata	
  
management,	
  with	
  an	
  automa7on	
  framework.	
  
Data	
  Governance	
  for	
  Hadoop	
  
Bank	
  Holding	
  Company	
  
Talk Summary
•  Data	
  governance	
  is	
  cri4cal	
  to	
  building	
  a	
  successful	
  data	
  lake	
  
–  Fundamental	
  governance	
  capabili7es	
  make	
  data	
  workers	
  more	
  
produc7ve	
  
–  Solu7ons	
  for	
  mee7ng	
  regulatory	
  requirements	
  are	
  also	
  needed	
  
•  Teradata	
  Loom	
  provides	
  required	
  data	
  cataloging	
  and	
  lineage	
  
capabili4es	
  to	
  make	
  hadoop	
  users	
  more	
  produc4ve	
  
•  RainStor	
  provides	
  advanced	
  archiving	
  solu4on	
  
•  ThinkBig	
  Data	
  Lake	
  provides	
  the	
  complete	
  package	
  
Stop	
  by	
  our	
  booth	
  for	
  
more	
  details	
  	
  
We	
  Love	
  Feedback	
  
Questions/Comments
Email: kiran.kamreddy@teradata.com
Follow Me
Twitter @kirankamreddy
Rate This Session
with the PARTNERS Mobile App
Remember To Share Your Virtual Passes
Follow Teradata 2015 PARTNERS
www.teradata-partners.com/social

More Related Content

PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
PPTX
Deploying a Governed Data Lake
WaterlineData
 
PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
 
PPTX
Developing a Strategy for Data Lake Governance
Tony Baer
 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
PPTX
Data Governance, Compliance and Security in Hadoop with Cloudera
Caserta
 
PPTX
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
DataWorks Summit
 
PPTX
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
Deploying a Governed Data Lake
WaterlineData
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
 
Developing a Strategy for Data Lake Governance
Tony Baer
 
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Caserta
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
DataWorks Summit
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 

What's hot (19)

PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
PDF
Building the Enterprise Data Lake: A look at architecture
mark madsen
 
PDF
Designing the Next Generation Data Lake
Robert Chong
 
PPTX
Data Discovery & Lineage in Enterprise Hadoop
DataWorks Summit
 
PPTX
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
PPTX
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
PDF
Data lake benefits
Ricky Barron
 
PDF
Incorporating the Data Lake into Your Analytic Architecture
Caserta
 
PDF
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
PDF
Creating a Modern Data Architecture
Zaloni
 
PPTX
Data Governance Initiative
DataWorks Summit
 
PPTX
Big data architectures and the data lake
James Serra
 
PDF
Data Lake Architecture
DATAVERSITY
 
PDF
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
PDF
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
NVISIA
 
PDF
5 Steps for Architecting a Data Lake
MetroStar
 
PDF
Planing and optimizing data lake architecture
Milos Milovanovic
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Building the Enterprise Data Lake: A look at architecture
mark madsen
 
Designing the Next Generation Data Lake
Robert Chong
 
Data Discovery & Lineage in Enterprise Hadoop
DataWorks Summit
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 
Data lake benefits
Ricky Barron
 
Incorporating the Data Lake into Your Analytic Architecture
Caserta
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
Creating a Modern Data Architecture
Zaloni
 
Data Governance Initiative
DataWorks Summit
 
Big data architectures and the data lake
James Serra
 
Data Lake Architecture
DATAVERSITY
 
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
NVISIA
 
5 Steps for Architecting a Data Lake
MetroStar
 
Planing and optimizing data lake architecture
Milos Milovanovic
 
Ad

Viewers also liked (10)

PPTX
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Avinash Ramineni
 
PDF
Data Lakes: 8 Enterprise Data Management Requirements
SnapLogic
 
PDF
Organising the Data Lake - Information Management in a Big Data World
DataWorks Summit/Hadoop Summit
 
PDF
Filling the Data Lake
DataWorks Summit/Hadoop Summit
 
PDF
Metadata Workshop
Rachel Lovinger
 
PDF
Data Lake: A simple introduction
IBM Analytics
 
PDF
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
 
PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PDF
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
PDF
Introduction to metadata management
Open Data Support
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Avinash Ramineni
 
Data Lakes: 8 Enterprise Data Management Requirements
SnapLogic
 
Organising the Data Lake - Information Management in a Big Data World
DataWorks Summit/Hadoop Summit
 
Filling the Data Lake
DataWorks Summit/Hadoop Summit
 
Metadata Workshop
Rachel Lovinger
 
Data Lake: A simple introduction
IBM Analytics
 
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
Introduction to metadata management
Open Data Support
 
Ad

Similar to Data Governance for Data Lakes (20)

PDF
Datalake Architecture
TechYugadi IT Solutions & Consulting
 
PDF
Data lakes
Şaban Dalaman
 
PPTX
Data Lake Organization (Data Mining and Knowledge discovery)
klkovida04
 
PDF
Big data data lake and beyond
Rajesh Kumar
 
PDF
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Informatica
 
PPTX
Hadoop and Your Data Warehouse
Caserta
 
PDF
An Overview of Data Lake
IRJET Journal
 
PPTX
Data governance datalakes_multitenancy
Sathish K S
 
PDF
The Maturity Model: Taking the Growing Pains Out of Hadoop
Inside Analysis
 
PDF
The Central Hub: Defining the Data Lake
Eric Kavanagh
 
PDF
Data Lakes: A Logical Approach for Faster Unified Insights
Denodo
 
PDF
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
BigDataEverywhere
 
PDF
The Great Lakes: How to Approach a Big Data Implementation
Inside Analysis
 
PDF
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
PDF
Data Virtualization: An Essential Component of a Cloud Data Lake
Denodo
 
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
PwC
 
PDF
Intro to Data Science on Hadoop
Caserta
 
PPTX
Data lake-itweekend-sharif university-vahid amiry
datastack
 
PDF
Architecting Agile Data Applications for Scale
Databricks
 
Data lakes
Şaban Dalaman
 
Data Lake Organization (Data Mining and Knowledge discovery)
klkovida04
 
Big data data lake and beyond
Rajesh Kumar
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Informatica
 
Hadoop and Your Data Warehouse
Caserta
 
An Overview of Data Lake
IRJET Journal
 
Data governance datalakes_multitenancy
Sathish K S
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
Inside Analysis
 
The Central Hub: Defining the Data Lake
Eric Kavanagh
 
Data Lakes: A Logical Approach for Faster Unified Insights
Denodo
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
BigDataEverywhere
 
The Great Lakes: How to Approach a Big Data Implementation
Inside Analysis
 
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Denodo
 
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
PwC
 
Intro to Data Science on Hadoop
Caserta
 
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Architecting Agile Data Applications for Scale
Databricks
 

Recently uploaded (20)

PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 

Data Governance for Data Lakes

  • 1. Anarchy Doesn’t Scale: Your Data Lake Needs Governance Kiran Kamreddy Sr. Product Marketing Manager Hadoop Portfolio, Teradata
  • 2. One Data Lake: Many Definitions
  • 3. One Data Lake: Many Definitions A  centralized  repository  of  raw  data  into  which  many  data-­‐producing   streams  flow  and  from  which  downstream  facili7es  may  draw.         Informa4on   Sources   Data  Lake   Downstream   Facili4es   Data  Variety   is  the  driving  factor  in  building  a  Data  Lake  
  • 4. Data Lake Maturity & Risks Data  Lake  ini4a4ves  usually  start  small     –  Single  business  unit   –  Few  data  sources,  applica7ons   –  Few  data  scien7sts   –  Exclusive  domain       Informal,  Tribal  ways  of  sharing       –  Handshakes   –  Over  the  desk  conversa7ons   –  No  processes,  mechanisms   –  Everyone  is  free  to  use     Anarchy  works  well  
  • 5. Data Lake Maturity & Risks Data  Lake  ini4a4ves  grow  popular   –  More  Business  Units   –  More  Data  Sources     –  More  Applica7ons   –  More  Users     …. grow into more complex environments Without  proper  governance  mechanisms       Data  lakes  risk  turning  data  swamps   Consistency Quality Security
  • 6. What can Governance do for my data lake? Fundamental  capabili4es  for  tracking,  organizing  and  understanding  data   - Where  did  my  data  come  from  ?  How  is  it  being  transformed  ?     - Track  usage,  resolve  anomalies,  visualize,  op7mize  and  clarify  data  lineage   -   Search  and  access  data  (  not  only  browse  )   - Assess  data  quality  and  fitness  for  purpose   Specialized capabilities to meets security& compliance requirements -  Govern who can/cannot access the data and who cannot -  Data life cycle management, archiving and retention policies -  Auditing, compliance Data  Governance  first  approach  to  prevent  turning  to  Data  swamps     RetrofiIng  data  governance  is  not  feasible    
  • 7. Governance and Productivity Governance  should  supports  day-­‐to-­‐day  use  of  data     –  Data  workers  need  a  strong  understanding     –  Roles  for  data  stewards,  data  owners,  data  analysts/scien7sts  need  to   be  assigned     Operational Metadata is critical to understanding –  Where did it come from? –  What is the environment? – landing zone, OS, Line of Business –  What processes touched my data?did you lose any data? – Checksums etc. –  When did the data get ingested, transformed? –  Did it get exported, when, where how will it be used (organizational)? Provision consistent ingest methods that track operational metadata
  • 8. What about Security & Compliance? •  Compliance  and  Regulatory       –  Capture,  store  and  move  data       –  Sarbanes-­‐Oxley,  HIPAA,  Basel  II     •  Security       –  Authoriza7on,  Authen7ca7on     –  Handling  sensi7ve  data     •  Audi4ng     –  Recording  every  aUempt  to  access   •  Archive  &  Reten4on   –  Data  life  cycle  policies    
  • 9. Governance Challenges on Hadoop Data Lakes Hadoop  is  different  to  DW   • Scale:  High  volumes  of  data,  mul7ple  user  access   • Variety:  Schema-­‐on-­‐read,  mul7ple  formats  of  data   • Mul7ple  storage  layers  (HDFS,  Hive,  HBase)   • Many  processing  engines  (MR,  Hive,  Pig,  Impala,  Drill…)   • Many  workflow  engines/schedules  (Cron,  Oozie,  Falcon…)   • Holis7c  view  of  data  with  required  context  is  difficult       Hadoop needs less stringent, more flexible mechanisms Balance agility and self service with processes, rules, regulations Maintain Governance without losing Hadoop's power
  • 10. •  Apache  Hadoop  has  built-­‐in  support  for  these  capabili4es   •  HCatalog,  Hive,  etc   •  Hadoop  distribu4on  vendors  have  all  made  improvements  in   each  of  these  areas   •  Navigator,  Ranger,  etc   •  Vendors  provide  specialized  capabili4es  in  each  area  that  go   beyond  what  a  Hadoop  distribu4on  provides   Multiple Approaches….. but incomplete
  • 11. One approach for Data Governance on Hadoop Teradata  Loom®  –  Integrated  Data  Management  for  Hadoop    Metadata  management,  Lineage,  Data  Wrangling    Automa7c  data  cataloging,  data  profiling  and  sta7s7cs  genera7on   Teradata Rainstor – Data Archiving Structured data archiving in Hadoop with robust security Compliance and auditing ThinkBig – Hadoop professional services Hadoop Data Lake – packaged service/product offering to build and deploy high-quality, governed data lakes
  • 12. Teradata Loom Find  and  Understand  Your  Data   •  Ac7veScan   –  Data  cataloging   –  Event  triggers   –  Job  detec7on  and  lineage  crea7on   –  Data  profiling  (sta7s7cs)   •  Workbench  and  Metadata  Registry   –  Data  explora7on  and  discovery   –  Technical  and  business  metadata   –  Data  sampling  and  previews   –  Lineage  rela7onships   –  Search  over  metadata   –  REST  API  –  easily  integrate  third-­‐party   apps   Prepare  Your  Data   •  Data  Wrangling   –  Self-­‐service,  interac7ve  data  wrangling  for   Hadoop   –  Metadata  tracked     •  HiveQL   –  Joins,  unions,  aggrega7ons,  UDFs   –  Metadata  tracked  in  Loom  
  • 13. Teradata Rainstor –  Retain  data  online  that  is  queryable  for  an  indefinite  period   –   Re7re  data  that  are  no  longer  required  with  auto-­‐expira4on  policies   –   Comply  with  strict  government  rules  and  regula7ons   –   Retain  the  metadata  as  it  was  originally  captured   –   Store  tamper-­‐proof,  immutable  (unchangeable)  data   –   Maintain  availability  to  data  as  RDBMS  versions  change  or  expire   –  Compression,  MPP  SQL  query  engine,  Encryp7on,  Audi7ng  
  • 14. Big Data Services from ThinkBig Big Data Strategy & Roadmap Data Lake Implementation Analytics & Data Science Training & Support Lack of Clear Big Data Strategy Difficulty Turning Data into Action Missing Big Data Skills   Focused  exclusively  on  tying  Hadoop  and  big  data  solu7ons  to  measurable   business  value     Data Scattered & Not Well Understood
  • 15. Impact   •  Co-­‐loca4on  of  data  provides  more  efficient  workflow  for  analysts   •  Hadoop  provides  scalability  at  a  lower  cost  than  tradi4onal  systems   •  Develop  new  insights  to  drive  business  value   Situa4on       Problem   Lack  of  centralized  metadata  repository  makes  data  governance  impossible.     Enterprise  must  have  transparency  into  data  in  the  cluster  and  capability  to  define   extensible  metadata.   Large  scale  data  lake  planned  with  many  heterogeneous  sources  and  many  individual   analyst  users.   Solu4on   Hadoop  provides  data  lake  infrastructure.    Loom  provides  centralized  metadata   management,  with  an  automa7on  framework.   Data  Governance  for  Hadoop   Bank  Holding  Company  
  • 16. Talk Summary •  Data  governance  is  cri4cal  to  building  a  successful  data  lake   –  Fundamental  governance  capabili7es  make  data  workers  more   produc7ve   –  Solu7ons  for  mee7ng  regulatory  requirements  are  also  needed   •  Teradata  Loom  provides  required  data  cataloging  and  lineage   capabili4es  to  make  hadoop  users  more  produc4ve   •  RainStor  provides  advanced  archiving  solu4on   •  ThinkBig  Data  Lake  provides  the  complete  package   Stop  by  our  booth  for   more  details    
  • 17. We  Love  Feedback   Questions/Comments Email: [email protected] Follow Me Twitter @kirankamreddy Rate This Session with the PARTNERS Mobile App Remember To Share Your Virtual Passes Follow Teradata 2015 PARTNERS www.teradata-partners.com/social