SlideShare a Scribd company logo
Apache Atlas:
Tracking dataset lineage
across Hadoop components
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under
development, may be under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache
Software Foundation project websites ("Apache"). Progress of the project capabilities can be
tracked from inception to release through Apache, however, technical feasibility, market
demand, user feedback and the overarching Apache Software Foundation community
development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a
contractual commitment, promise or obligation from Hortonworks to deliver these features in
any generally available product.
Product features and technology directions are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers
should not rely upon it when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Speakers
Andrew Ahn
Governance Director
Product Management
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Atlas Overview
• Near term roadmap
• Cross Component Lineage
• Questions
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
STRUCTURED
UNSTRUCTURED
Vision - Enterprise Data Governance Across Platfroms
TRADITIONAL
RDBMS
METADATA
MPP
APPLIANCES
Project 1
Project 5
Project 4
Project 3
Metadata
Project 6
DATA
LAKE
GOAL: Provide a common approach to data
governance across all systems and data within the
enterprise
Transparent
Governance standards and protocols must be clearly
defined and available to all
Reproducible
Recreate the relevant data landscape at a point in time
Auditable
All relevant events and assets but be traceable with
appropriate historical lineage
Consistent
Compliance practices must be consistent
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ready for Trusted Governance
OPERATIONS SECURITY
GOVERNANCE
STORAGE
STORAGE
Machine
Learning
Batch
StreamingInteractive
Search
GOVERNANCE
YA R N
D A T A O P E R A T I N G S Y S T E M
Data Management
along the entire data lifecycle with integrated
provenance and lineage capability
Modeling with Metadata
enables comprehensive data lineage through
a hybrid approach with enhanced tagging
and attribute capabilities
Interoperable Solutions
across the Hadoop ecosystem, through a
common metadata store
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DGI* Community becomes Apache Atlas
May
2015
Proto-type
Built
Apache
Atlas
Incubation
DGI group
Kickoff
Feb
2015
Dec
2014
July
2015
HDP 2.3
Foundation
GA Release
First kickoff to GA in 7 months
Global Financial
Company
* DGI: Data Governance Initiative
Faster & Safer
Co-Development driven
by customer use cases
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas: Metadata Services
• Cross- component dataset
lineage. Centralized location for
all metadata inside HDP
• Single Interface point for
Metadata Exchange with
platforms outside of HDP
• Business Taxonomy based
classification. Conceptual,
Logical And Technical
Apache Atlas
Hive
Ranger
Falcon
Sqoop
Storm
Kafka
Spark
NiFi
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Management Through Metadata
Management Scalability
Many traditional tools and patterns do not scale when applied to multi-tenant data lakes.
Many enterprise have silo’d data and metadata stores that collide in the data lake. This is
compounded by the ability to have very large windows (years). Can traditional EDW tools
manage 100 million entities effectively with room to grow ?
Metadata Tools
Scalable, decoupled, de-centralized manage driven through metadata is the only via solution.
This allows quick integration with automation and other metamodels
Tags for Management, Discovery and Security
Proper metadata is the foundation for business taxonomy, stewardship, attribute based
security and self-service.
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas High Level Architecture
Type System
Repository
Search DSL
Bridge
Hive Storm
Falcon Others
REST API
Graph DB
Search
Kafka
Sqoop
Connectors
MessagingFramework
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Near Term Roadmap:
Summer 2016
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access Policy Driven by metadata
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Catalog
Breadcrumbs for
taxonomy context path
Contents at
taxonomy context
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop
Cross Component
Data Lineage
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
Teradata
Connector
Apache
Kafka
Atlas: Tracks Metadata + Lineage in one place
Custom
Activity
Reporter
Metadata
Repository
RDBMS
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Technical and Logical Metadata Exchange
Knowledge
Store
Atlas
REST API
Structured
Unstructured
Files:
XML / JSON
3rd Party
Vendors
Custom
Reporter
Non-Hadoop
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive Integration: Model for integration
Apache Atlas
Hive Bridge
(Client)
Hive Hook
(Post-execution)
REST API
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDF: Dataflow Governance Solution
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Dataflow Security Use case Requirements
Accelerated Data Collection: An
integrated, data source agnostic
collection platform
Increased Security and
Unprecedented Chain of Custody:
Secure from source to storage with
high fidelity data provenance
The Internet of Any Thing (IoAT): A
Proven Platform for the Internet of
Things
http://hortonworks.com/hdf/
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Grade Governance Dataflow Solution
Filtered
Metadata
• HDP Taxonomy
• Centrallized
Metadata
Repository
• Downstream HDP
Impacts
• Cross component
lineage
• 3rd Party
integration
• Guaranteed
Delivery
• Data Buffering
• Prioritized
Queueing
• Flow specific QoS
• Visual Command
& Control
Months
Lineage
Years
Lineage
Reference
Taxonomy
(Tags)
Event level
versus Dataset
level
HDF - NiFI
Operation
Control
Maximum
Fidelity
Event Level
HDP – Atlas
Governance
Management
Medium / Low
Fidelity
Dataset Level
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
• Tutorial
• Atlas Tour
• Sqoop Lineage
• Kafka / Storm Linage
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Availability:
- Tech Preview VMs: May 2016
- GA Release: Summer 2016
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions ?
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reference
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Online Resources
VM: https://s3.amazonaws.com/demo-drops.hortonworks.com/HDP-
Atlas-Ranger-TP.ova —> Download Public Preview VM
Tutorial: https://github.com/hortonworks/tutorials/tree/atlas-ranger-
tp/tutorials/hortonworks/atlas-ranger-preview
Blog: http://hwxjojo.wpengine.com/blog/the-next-generation-of-
hadoop-based-security-data-governance/ (this is giving an error, right
now)
Learn More: http://hortonworks.com/solutions/atlas-ranger-
integration/
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You

More Related Content

What's hot (20)

PPTX
The Path to Data and Analytics Modernization
Analytics8
 
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PDF
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PPTX
Data Lake Overview
James Serra
 
PDF
Time to Talk about Data Mesh
LibbySchulze
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PDF
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
PDF
Introducing Neo4j
Neo4j
 
PDF
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
Denodo
 
PDF
Data Mesh for Dinner
Kent Graziano
 
PPTX
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
BigID Inc
 
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
PDF
Moving to Databricks & Delta
Databricks
 
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PDF
Introducing Databricks Delta
Databricks
 
PDF
Snowflake SnowPro Certification Exam Cheat Sheet
Jeno Yamma
 
The Path to Data and Analytics Modernization
Analytics8
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Tristan Baker
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Data Lake Overview
James Serra
 
Time to Talk about Data Mesh
LibbySchulze
 
Data Lakehouse Symposium | Day 4
Databricks
 
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
Introducing Neo4j
Neo4j
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
Denodo
 
Data Mesh for Dinner
Kent Graziano
 
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
BigID Inc
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Moving to Databricks & Delta
Databricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Introducing Databricks Delta
Databricks
 
Snowflake SnowPro Certification Exam Cheat Sheet
Jeno Yamma
 

Viewers also liked (6)

PDF
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
PDF
How PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
Hortonworks
 
PPTX
Role of Analytics in Consumer Packaged Goods Industry
Perceptive Analytics
 
PDF
Préparation de Données Hadoop avec Trifacta
Victor Coustenoble
 
PDF
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
PDF
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Kai Wähner
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
How PepsiCo's Big Data Strategy is Disrupting CPG Retail Analytics
Hortonworks
 
Role of Analytics in Consumer Packaged Goods Industry
Perceptive Analytics
 
Préparation de Données Hadoop avec Trifacta
Victor Coustenoble
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Kai Wähner
 
Ad

Similar to Apache Atlas: Tracking dataset lineage across Hadoop components (20)

PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
PPTX
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
DataWorks Summit/Hadoop Summit
 
PPTX
Classification based security in Hadoop
Madhan Neethiraj
 
PPTX
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
PDF
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
 
PPTX
Enterprise Data Classification and Provenance
DataWorks Summit/Hadoop Summit
 
PPTX
What the #$* is a Business Catalog and why you need it
DataWorks Summit/Hadoop Summit
 
PDF
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
PPTX
Hortonworks Oracle Big Data Integration
Hortonworks
 
PPTX
Atlas and ranger epam meetup
Alex Zeltov
 
PDF
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
PDF
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
PDF
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Mats Johansson
 
PPTX
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
PPTX
The Implacable advance of the data
DataWorks Summit
 
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
PDF
Hortonworks & Bilot Data Driven Transformations with Hadoop
Mats Johansson
 
PPTX
Enabling the Real Time Analytical Enterprise
Hortonworks
 
PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
PDF
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Hortonworks
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
DataWorks Summit/Hadoop Summit
 
Classification based security in Hadoop
Madhan Neethiraj
 
Is your Enterprise Data lake Metadata Driven AND Secure?
DataWorks Summit/Hadoop Summit
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
 
Enterprise Data Classification and Provenance
DataWorks Summit/Hadoop Summit
 
What the #$* is a Business Catalog and why you need it
DataWorks Summit/Hadoop Summit
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
Hortonworks
 
Hortonworks Oracle Big Data Integration
Hortonworks
 
Atlas and ranger epam meetup
Alex Zeltov
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Scott Clinton
 
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Mats Johansson
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
The Implacable advance of the data
DataWorks Summit
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Hortonworks & Bilot Data Driven Transformations with Hadoop
Mats Johansson
 
Enabling the Real Time Analytical Enterprise
Hortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hortonworks
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Hortonworks
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 

Apache Atlas: Tracking dataset lineage across Hadoop components

  • 1. Apache Atlas: Tracking dataset lineage across Hadoop components
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery. This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Speakers Andrew Ahn Governance Director Product Management
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda • Atlas Overview • Near term roadmap • Cross Component Lineage • Questions
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Overview
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved STRUCTURED UNSTRUCTURED Vision - Enterprise Data Governance Across Platfroms TRADITIONAL RDBMS METADATA MPP APPLIANCES Project 1 Project 5 Project 4 Project 3 Metadata Project 6 DATA LAKE GOAL: Provide a common approach to data governance across all systems and data within the enterprise Transparent Governance standards and protocols must be clearly defined and available to all Reproducible Recreate the relevant data landscape at a point in time Auditable All relevant events and assets but be traceable with appropriate historical lineage Consistent Compliance practices must be consistent
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ready for Trusted Governance OPERATIONS SECURITY GOVERNANCE STORAGE STORAGE Machine Learning Batch StreamingInteractive Search GOVERNANCE YA R N D A T A O P E R A T I N G S Y S T E M Data Management along the entire data lifecycle with integrated provenance and lineage capability Modeling with Metadata enables comprehensive data lineage through a hybrid approach with enhanced tagging and attribute capabilities Interoperable Solutions across the Hadoop ecosystem, through a common metadata store
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DGI* Community becomes Apache Atlas May 2015 Proto-type Built Apache Atlas Incubation DGI group Kickoff Feb 2015 Dec 2014 July 2015 HDP 2.3 Foundation GA Release First kickoff to GA in 7 months Global Financial Company * DGI: Data Governance Initiative Faster & Safer Co-Development driven by customer use cases
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas: Metadata Services • Cross- component dataset lineage. Centralized location for all metadata inside HDP • Single Interface point for Metadata Exchange with platforms outside of HDP • Business Taxonomy based classification. Conceptual, Logical And Technical Apache Atlas Hive Ranger Falcon Sqoop Storm Kafka Spark NiFi
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Data Management Through Metadata Management Scalability Many traditional tools and patterns do not scale when applied to multi-tenant data lakes. Many enterprise have silo’d data and metadata stores that collide in the data lake. This is compounded by the ability to have very large windows (years). Can traditional EDW tools manage 100 million entities effectively with room to grow ? Metadata Tools Scalable, decoupled, de-centralized manage driven through metadata is the only via solution. This allows quick integration with automation and other metamodels Tags for Management, Discovery and Security Proper metadata is the foundation for business taxonomy, stewardship, attribute based security and self-service.
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas High Level Architecture Type System Repository Search DSL Bridge Hive Storm Falcon Others REST API Graph DB Search Kafka Sqoop Connectors MessagingFramework
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Near Term Roadmap: Summer 2016
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Access Policy Driven by metadata
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Catalog Breadcrumbs for taxonomy context path Contents at taxonomy context
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Cross Component Data Lineage
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Atlas: Tracks Metadata + Lineage in one place Custom Activity Reporter Metadata Repository RDBMS
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Technical and Logical Metadata Exchange Knowledge Store Atlas REST API Structured Unstructured Files: XML / JSON 3rd Party Vendors Custom Reporter Non-Hadoop
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive Integration: Model for integration Apache Atlas Hive Bridge (Client) Hive Hook (Post-execution) REST API
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDF: Dataflow Governance Solution
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Dataflow Security Use case Requirements Accelerated Data Collection: An integrated, data source agnostic collection platform Increased Security and Unprecedented Chain of Custody: Secure from source to storage with high fidelity data provenance The Internet of Any Thing (IoAT): A Proven Platform for the Internet of Things http://hortonworks.com/hdf/
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Grade Governance Dataflow Solution Filtered Metadata • HDP Taxonomy • Centrallized Metadata Repository • Downstream HDP Impacts • Cross component lineage • 3rd Party integration • Guaranteed Delivery • Data Buffering • Prioritized Queueing • Flow specific QoS • Visual Command & Control Months Lineage Years Lineage Reference Taxonomy (Tags) Event level versus Dataset level HDF - NiFI Operation Control Maximum Fidelity Event Level HDP – Atlas Governance Management Medium / Low Fidelity Dataset Level
  • 22. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo • Tutorial • Atlas Tour • Sqoop Lineage • Kafka / Storm Linage
  • 23. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Availability: - Tech Preview VMs: May 2016 - GA Release: Summer 2016
  • 24. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions ?
  • 25. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reference
  • 26. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Online Resources VM: https://s3.amazonaws.com/demo-drops.hortonworks.com/HDP- Atlas-Ranger-TP.ova —> Download Public Preview VM Tutorial: https://github.com/hortonworks/tutorials/tree/atlas-ranger- tp/tutorials/hortonworks/atlas-ranger-preview Blog: http://hwxjojo.wpengine.com/blog/the-next-generation-of- hadoop-based-security-data-governance/ (this is giving an error, right now) Learn More: http://hortonworks.com/solutions/atlas-ranger- integration/
  • 27. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You

Editor's Notes

  • #2: TALK TRACK Data is powering successful clinical care and successful operations. [NEXT SLIDE]
  • #7: 6
  • #8: TALK TRACK Open Enterprise Hadoop enables trusted governance, with: Data lifecycle management along the entire lifecycle Modeling with metadata, and Interoperable solutions that can access a common metadata store. [NEXT SLIDE] SUPPORTING DETAIL Trusted Governance Why this matters to our customers: As data accumulates in an HDP cluster, the enterprise needs governance policies to control how that data is ingested, transformed and eventually retired. This keeps those Big Data assets from turning into big liabilities that you can’t control. Proof point: HDP includes 100% open source Apache Atlas and Apache Falcon for centralized data governance coordinated by YARN. These data governance engines provide those mature data management and metadata modeling capabilities, and they are constantly strengthened by members of the Data Governance Initiative. The Data Governance Initiative (DGI) is working to develop an extensible foundation that addresses enterprise requirements for comprehensive data governance. The DGI coalition includes Hortonworks partner SAS and customers Merck, Target, Aetna and Schlumberger. Together, we assure that Hadoop: Snaps into existing frameworks to openly exchange metadata Addresses enterprise data governance requirements within its own stack of technologies Citation: “As customers are moving Hadoop into corporate data and processing environments, metadata and data governance are much needed capabilities. SAS participation in this initiative strengthens the integration of SAS data management, analytics and visualization into the HDP environment and more broadly it helps advance the Apache Hadoop project. This additional integration will give customers better ability to manage big data governance within the Hadoop framework,” said SAS Vice President of Product Management Randy Guard.” | http://hortonworks.com/press-releases/hortonworks-establishes-data-governance-initiative/
  • #9: How fast ? 7 months !
  • #12: Apache Atlas is the only open source project created to solve the governance challenge in the open. The founding members of the project include all the members of the data governance initiative and others from the Hadoop community. The core functionality defined by the project includes the following: Data Classification – create an understanding of the data within Hadoop and provide a classification of this data to external and internal sources Centralized Auditing – provide a framework to capture and report on access to and modifications of data within Hadoop Search & Lineage – allow pre-defined and ad hoc exploration of data and metadata while maintaining a history of how a data source or explicit data was constructed Security and Policy Engine – implement engines to protect and rationalize data access and according to compliance policy
  • #17: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #18: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  • #19: Specify Metrics – Time / Success /user /etc… Contrast with Ranger plug-in – pre execute
  • #23: Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagonsis ** bring meta from external systems into hadoop – keep it together
  • #32: The Data Governance Framework will enable Freddie Mac to design Data Index tool from the ground up for scalability, security and reliability