SlideShare a Scribd company logo
Hadoop in Validated Environment
Data Governance Initiative
Martin Ryzl
Director, Analytics Platform
Ivo Lasek
Architect, Analytics Platform
Research
Manufacturing
Marketing
Search
Data
Integration
Data
Analytics
Open
Data
90 Days
Laboratory
Information
Management
SAP
Enterprise
Asset
Management
Manufacturing
Execution
Systems
Data
Analytics
Data
Integration
Who is the
dataset owner?
How can I
get access?
What does the
data mean?
How can
I reproduce
the results?
Where is the
data I need?
6 Months
Where Is the Data I Need?
Data Lake
Data Lake
Merge
Clean
Data Lake
Merge
Clean
Security and Data Governance
Data Catalog
Source: http://www.data.gov/
Who Is the Dataset Owner?
Entitlements
Dataset
Owner
Dataset
User
Entitlements
Dataset
Owner
Dataset
User
Entitlements
Dataset
Owner
Dataset
User
How Can I Get Access?
Entitlements
Dataset
Owner
Entitlement
Steward
Entitlements
Dataset
Owner
Entitlement
Steward
Dataset
User
Entitlements
Dataset
Owner
Entitlement
Steward
Dataset
User
Entitlements
Dataset
Owner
Entitlement
Steward
Dataset
User
What Does the Data Mean?
Semantic meaning – Metastore
id name ssn birth_n
o
phone id
personal_number
employee_number
first_name
division
Metastore
Dataset
Owner
Metastore
Dataset
Owner
Data
Steward
Metastore
Dataset
Owner
Data
Steward
Metastore
Dataset
Owner
Data
Steward
Entitlement
Steward
Dataset
User
Metastore
Dataset
Owner
Data
Steward
Entitlement
Steward
Dataset
User
Metastore
Dataset
Owner
Data
Steward
Entitlement
Steward
Dataset
User
How Can I Reproduce the Data?
Reporting Data
Reproducibility
delta1
delta2
delta3
Raw Data
Aggregate
v1.0
delta1..delta3
Reporting Data
Reproducibility
delta1
delta2
delta3
Raw Data
delta1..delta3
Reporting Data
Aggregate
v1.0
delta4
delta5
delta6 delta1..delta7
Reporting Data
Aggregate
v1.0
delta7
Traceability
delta1
delta2
delta3
Raw Data
delta1..delta3
Reporting Data
Aggregate
v1.0
delta1..delta3
Cleaned Data
Clean
v1.0
delta4
delta5
delta6 delta1..delta7
Reporting Data
Aggregate
v1.0
delta1..delta7
Cleaned Data
Clean
v1.1
delta7
Data Lineage
Access Logs
Process Logs
2015-06-04 12:53:31,601 INFO [main] parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1702)) - Get metadata for subqueries
17865102 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Get metadata for destination tables
2015-06-04 12:53:31,601 INFO [main] parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1726)) - Get metadata for destination
tables
17865345 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Completed getting MetaData in Semantic Analysis
2015-06-04 12:53:31,844 INFO [main] parse.SemanticAnalyzer (SemanticAnalyzer.java:analyzeInternal(10004)) - Completed getting MetaData
in Semantic Analysis
17865347 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Not invoking CBO because the statement has too few joins
2015-06-04 12:53:31,846 INFO [main] parse.SemanticAnalyzer (SemanticAnalyzer.java:canHandleAstForCbo(10258)) - Not invoking CBO
because the statement has too few joins
Heart beat
17866695 [main] ERROR org.apache.hadoop.hive.ql.Driver - FAILED: SemanticException [Error 10044]: Line 2:18 Cannot insert into target
table because column number/types are different ''2015-06-04-07-50'': Table insclause-0 has 165 columns, but query has 166 columns.
org.apache.hadoop.hive.ql.parse.SemanticException: Line 2:18 Cannot insert into target table because column number/types are different
''2015-06-04-07-50'': Table insclause-0 has 165 columns, but query has 166 columns.
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genConversionSelectOperator(SemanticAnalyzer.java:6535)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFileSinkPlan(SemanticAnalyzer.java:6336)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:8977)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:8868)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9713)
Who is the
dataset owner?
How can I
get access?
What does the
data mean?
How can
I reproduce
the results?
Where is the
data I need?
HDFS/Hive Metastore Ranger
Metastore Falcon
Contacts
• Martin Ryzl (martin.ryzl@merck.com)
• Ivo Lasek (ivo.lasek@merck.com)
• http://www.merck.com/
• http://www.msdit.cz/

More Related Content

What's hot (20)

PPTX
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 
PPTX
Jethro + Symphony Health at Qlik Qonnections
Remy Rosenbaum
 
PPTX
Designing Data Pipelines for Automous and Trusted Analytics
DataWorks Summit
 
PDF
Hadoop,Big Data Analytics and More
Trendwise Analytics
 
PPTX
Data Science with Hadoop: A Primer
DataWorks Summit
 
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
PPTX
Aster getting started
Ahsan Nabi Khan
 
PPTX
Pentaho Analytics on MongoDB
Mark Kromer
 
PPTX
The convergence of reporting and interactive BI on Hadoop
DataWorks Summit
 
PDF
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
PPTX
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
PPTX
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
PPTX
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
PPTX
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Adnan Masood
 
PDF
Data Governance for Data Lakes
Kiran Kamreddy
 
PPTX
A Mayo Clinic Big Data Implementation
BDPA Education and Technology Foundation
 
PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
 
PDF
Building a Data Lake - An App Dev's Perspective
GeekNightHyderabad
 
PPTX
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
PDF
Nov 2010 HUG: Business Intelligence for Big Data
Yahoo Developer Network
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 
Jethro + Symphony Health at Qlik Qonnections
Remy Rosenbaum
 
Designing Data Pipelines for Automous and Trusted Analytics
DataWorks Summit
 
Hadoop,Big Data Analytics and More
Trendwise Analytics
 
Data Science with Hadoop: A Primer
DataWorks Summit
 
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
Aster getting started
Ahsan Nabi Khan
 
Pentaho Analytics on MongoDB
Mark Kromer
 
The convergence of reporting and interactive BI on Hadoop
DataWorks Summit
 
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Adnan Masood
 
Data Governance for Data Lakes
Kiran Kamreddy
 
A Mayo Clinic Big Data Implementation
BDPA Education and Technology Foundation
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
 
Building a Data Lake - An App Dev's Perspective
GeekNightHyderabad
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
Nov 2010 HUG: Business Intelligence for Big Data
Yahoo Developer Network
 

Viewers also liked (20)

PPTX
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
 
PPTX
Karta an ETL Framework to process high volume datasets
DataWorks Summit
 
PDF
50 Shades of SQL
DataWorks Summit
 
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
PPTX
Realistic Synthetic Generation Allows Secure Development
DataWorks Summit
 
PDF
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
PPTX
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
PPTX
Running Spark and MapReduce together in Production
DataWorks Summit
 
PPTX
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
 
PDF
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
DataWorks Summit
 
PPT
Hadoop for Genomics__HadoopSummit2010
Yahoo Developer Network
 
PPTX
One Click Hadoop Clusters - Anywhere (Using Docker)
DataWorks Summit
 
PDF
Inspiring Travel at Airbnb [WIP]
DataWorks Summit
 
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
PPTX
Spark Application Development Made Easy
DataWorks Summit
 
PPTX
Open Source SQL for Hadoop: Where are we and Where are we Going?
DataWorks Summit
 
PPTX
NoSQL Needs SomeSQL
DataWorks Summit
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
PPTX
Big Data Challenges in the Energy Sector
DataWorks Summit
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
 
Karta an ETL Framework to process high volume datasets
DataWorks Summit
 
50 Shades of SQL
DataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
DataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Running Spark and MapReduce together in Production
DataWorks Summit
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
DataWorks Summit
 
Hadoop for Genomics__HadoopSummit2010
Yahoo Developer Network
 
One Click Hadoop Clusters - Anywhere (Using Docker)
DataWorks Summit
 
Inspiring Travel at Airbnb [WIP]
DataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
Spark Application Development Made Easy
DataWorks Summit
 
Open Source SQL for Hadoop: Where are we and Where are we Going?
DataWorks Summit
 
NoSQL Needs SomeSQL
DataWorks Summit
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Big Data Challenges in the Energy Sector
DataWorks Summit
 
Ad

Similar to Hadoop in Validated Environment - Data Governance Initiative (20)

PPTX
Introduction to Data Science
Caserta
 
PPTX
Balancing data democratization with comprehensive information governance: bui...
DataWorks Summit
 
PPSX
Big datarevealed hadoop catalog
Steven Meister
 
PPTX
Data Governance, Compliance and Security in Hadoop with Cloudera
Caserta
 
PDF
Intro to Data Science on Hadoop
Caserta
 
PPTX
The Power of Data
DataWorks Summit
 
PPTX
Defining and Applying Data Governance in Today’s Business Environment
Caserta
 
PDF
Oracle Big Data Governance Webcast Charts
Jeffrey T. Pollock
 
PPT
BigData Analytics
Mayank Kumar Sharma
 
PDF
big data
Jisha Aravind
 
PPTX
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Scott Mitchell
 
PDF
Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...
EMC
 
PPT
Dataware housing
work
 
PPTX
Big data analytics - Introduction to Big Data and Hadoop
SamiraChandan
 
PDF
BAR360 open data platform presentation at DAMA, Sydney
Sai Paravastu
 
PPTX
TSE_Pres12.pptx
ssuseracaaae2
 
PDF
The book of elephant tattoo
Mohamed Magdy
 
PDF
Getting down to business on Big Data analytics
The Marketing Distillery
 
PDF
Big Data Tools: A Deep Dive into Essential Tools
FredReynolds2
 
PPT
Datawarehousing
work
 
Introduction to Data Science
Caserta
 
Balancing data democratization with comprehensive information governance: bui...
DataWorks Summit
 
Big datarevealed hadoop catalog
Steven Meister
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Caserta
 
Intro to Data Science on Hadoop
Caserta
 
The Power of Data
DataWorks Summit
 
Defining and Applying Data Governance in Today’s Business Environment
Caserta
 
Oracle Big Data Governance Webcast Charts
Jeffrey T. Pollock
 
BigData Analytics
Mayank Kumar Sharma
 
big data
Jisha Aravind
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Scott Mitchell
 
Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...
EMC
 
Dataware housing
work
 
Big data analytics - Introduction to Big Data and Hadoop
SamiraChandan
 
BAR360 open data platform presentation at DAMA, Sydney
Sai Paravastu
 
TSE_Pres12.pptx
ssuseracaaae2
 
The book of elephant tattoo
Mohamed Magdy
 
Getting down to business on Big Data analytics
The Marketing Distillery
 
Big Data Tools: A Deep Dive into Essential Tools
FredReynolds2
 
Datawarehousing
work
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 

Hadoop in Validated Environment - Data Governance Initiative