SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Introducing RecordService
Lenni Kuff
2© Cloudera, Inc. All rights reserved.
RecordService is a distributed,
scalable, data access service for
unified authorization in Hadoop.
3© Cloudera, Inc. All rights reserved.
Motivation
• As the Hadoop ecosystem expands, new components continue to be added
• Speaks to the overall flexibility of Hadoop
• This is good - more functionality, more workloads, more use cases.
• As use cases for Hadoop mature, user requirements and expectations increase:
• Security
• Performance
• Compatibility
• The flexibility of Hadoop has come at cost of increased complexity
4© Cloudera, Inc. All rights reserved.
Storage
Compute
5© Cloudera, Inc. All rights reserved.
Storage
Compute
…
6© Cloudera, Inc. All rights reserved.
Example: Security
Challenge: Provide unified fine-grained security across compute frameworks
• Integrating consistent security layer into every components is not scalable.
• Securing data at file-level precludes fine grained access control (column/row)
• File ACLs not enough - User can view all or nothing.
• Currently, must split files, duplicate data – large operational cost.
Solution: Add a level of abstraction - secure service to access datasets in “record”
format
• Can now apply fine-grained constraints on projection of dataset
• Same access control policy can be applied uniformly across compute
frameworks; uncoupled from underlying storage layer
7© Cloudera, Inc. All rights reserved.
Introducing RecordService
8© Cloudera, Inc. All rights reserved.
Record Service - Overview
• Simplifies
• Provides a higher level, logical abstraction for data (ie Tables or Views)
• Returns schemed objects (instead of paths and bytes). No need for applications
to worry about storage APIs and file formats.
• HCatalog? Similar concept - RecordService is secure, performant. Plan to
support HCatalog as a data model on RecordService.
• Secures
• Central location for all authorization checks using Sentry metadata.
• Secure service that does not execute arbitrary user code
• Accelerates
• Unified data access path allows platform-wide performance improvements.
9© Cloudera, Inc. All rights reserved.
Architecture
10© Cloudera, Inc. All rights reserved.
Architecture
• Runs as a distributed service: Planner Servers & Worker Servers
• Servers do not store any state
• Easy HA, fault tolerance.
• Planner Servers responsible for request planning
• Retrieve and combine metadata (NN, HMS, Sentry)
• Split generation -> Creates tasks for workers
• Performs authorization
• Worker Servers reads from storage and constructs records.
• IO, file parsing, predicate evaluation
• Runs as the “source” for a DAG computation
11© Cloudera, Inc. All rights reserved.
Architecture – Server APIs
• Planner and Worker services expose thrift APIs
• PlanRequest(), Exec(), Fetch()
• PlanRequest()
• Accepts SQL to specify request: Support SELECT and PROJECT
• Access to tables and views stored in HMS
• Does not run operators that require data exchange; “map only”
• Generates a list of tasks which contain the request, each with locality
• Exec()/Fetch()
• Returns records in a canonical optimized, columnar-format.
12© Cloudera, Inc. All rights reserved.
Architecture – Fault tolerance
• Cluster state persisted in ZK
• Membership, delegation tokens, secret keys
• Servers do not communicate with each other directly => scalability
• Planner services
• Expected to run a few (i.e. 3) for HA
• Fault tolerance handled with clients getting a list of planners and failing over
• Plan requests are short
• Worker services
• Expect to run on each node in the cluster with data
• Fault tolerance handled by framework (e.g. MR) rescheduling task
13© Cloudera, Inc. All rights reserved.
Architecture – Security
• Authentication using Kerberos and delegation tokens
• Planner authorizes request using metadata in Sentry
• Column level ACLs
• Row level ACLs – create a view with a predicate
• Masking – create a view with the masking function in the select list
• Tasks generated by the planner are signed with a shared key
• Worker runs generated tasks.
• Does not authorize, relies on signed tasks
• Runs as user with full access to data, does not run user code
14© Cloudera, Inc. All rights reserved.
Architecture – Security example
CREATE VIEW v as
SELECT mask(credit_card_number) as ccn,
name, balance, region
FROM data WHERE region = “Europe”
1. Restrict access to the data set: disable access to ‘data’ table and underlying
files in HDFS.
2. Give access by creating view, v
3. Set column level permissions on v per user if necessary
Write path (ingest) unchanged. Job expected to run as privileged user.
15© Cloudera, Inc. All rights reserved.
Client APIs – Integration with ecosystem
• Similar APIs designed to integrate with MapReduce and Spark
• Client APIs make things simpler
• Don’t need to interact with HMS
• Care about the underlying storage format: worker always returns records in a
canonical format.
• Storage engine details (e.g. s3)
16© Cloudera, Inc. All rights reserved.
Client Integration APIs
• Drop in replacements for common existing InputFormats
• Text, Avro
• Can be used with Spark as well
• SparkSQL: integration with the Data Sources API
• Predicate pushdown, projection
• Migration should be easy
17© Cloudera, Inc. All rights reserved.
MR Example
//FileInputFormat.setInputPaths(job, new Path(args[0]));
//job.setInputFormatClass(AvroKeyInputFormat.class);
RecordServiceConfig.setInputTable(configuration, null, args[0]);
job.setInputFormatClass(
com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);
18© Cloudera, Inc. All rights reserved.
Spark Example
// Comment out one or the other
val file = sc.recordServiceTextFile(path)
//val file = sc.textFile(path)
19© Cloudera, Inc. All rights reserved.
Spark SQL Example
ctx.sql(s"""
|CREATE TEMPORARY TABLE $tbl
|USING com.cloudera.recordservice.spark.DefaultSource
|OPTIONS (
| RecordServiceTable '$db.$tbl',
| RecordServiceTableSize '$size'
|)
""".stripMargin)
20© Cloudera, Inc. All rights reserved.
Performance
• Shares some core components with Impala
• IO management, optimized C++ code, runtime code generation, uses low level
storage APIs
• Highly efficient implementation of the scan functionality
• Optimized columnar on wire format
• Inspired by Apache Parquet
• Accelerates performance for many workloads
21© Cloudera, Inc. All rights reserved.
Terasort
• ~Worst case scenario. Minimal schema: a single STRING column
• Custom RecordServiceTeraInputFormat (similar to TeraInputFormat)
• 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks)
• Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales
• See Github repo for more details and runnable examples.
22© Cloudera, Inc. All rights reserved.
TeraChecksum
1
0.48
0.23
1.03
0.8
0.85
0
0.2
0.4
0.6
0.8
1
1.2
1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark)
Normalizedjobtime
TeraChecksum
Without RecordService
With RecordService
23© Cloudera, Inc. All rights reserved.
Spark SQL
• Represents a more expected use case
• Data is fully schemed
• TPCDS
• 500GB scale factor, on parquet
• Cluster
• 5 node cluster
24© Cloudera, Inc. All rights reserved.
0
50
100
150
200
250
300
350
TPCDS
SparkSQL
SparkSQL
SparkSQL with RecordService
Spark SQL
~15% improvement in query times; queries are not scan bound
25© Cloudera, Inc. All rights reserved.
Spark SQL
29.5
31
14
23.5
0
5
10
15
20
25
30
35
2% Selective Scan Sum(col)
SparkSQL
SparkSQL
SparkSQL with RecordService
26© Cloudera, Inc. All rights reserved.
State of the project
• Available in v0.2 beta:
• Integration with Spark, MR, Pig (via HCatalog)
• Planner HA
• Apache 2.0 Licensed
• Sentry Column-Level Privilege Support
• Mini Roadmap:
• Improved multi-tenancy
• Complex types
• More InputFormat support / integration options
• Intend to donate to Apache Software Foundation
27© Cloudera, Inc. All rights reserved.
Conclusion
• RecordService provides a schemed data access service for Hadoop
• Logical data access instead of physical
• Much more powerful abstraction
• Demonstrated security enforcement, improved performance
• Simpler: clients don’t need to worry about low level details: storage APIs, file
formats
• Opens the door for future improvements
28© Cloudera, Inc. All rights reserved.
Contributing!
• Mailing list: recordservice-user@googlegroups.com
• Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd-
p/Beta
• Contributions: http://github.com/cloudera/RecordServiceClient/
• Documentation: http://cloudera.github.io/RecordServiceClient/
• Bug Reporting: https://issues.cloudera.org/projects/RS
• Beta Download:
http://www.cloudera.com/downloads/beta/record-service/0-2-0.html
29© Cloudera, Inc. All rights reserved.
Thank you

More Related Content

What's hot (20)

PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
PPTX
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Cloudera, Inc.
 
PPTX
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
PPTX
Risk Management for Data: Secured and Governed
Cloudera, Inc.
 
PPTX
Road to Cloudera certification
Cloudera, Inc.
 
PDF
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
PPTX
Solr consistency and recovery internals
Cloudera, Inc.
 
PPTX
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Cloudera, Inc.
 
PPTX
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
PPTX
Intro to Apache Spark
Cloudera, Inc.
 
PPTX
Using Hadoop to Drive Down Fraud for Telcos
Cloudera, Inc.
 
PPTX
Part 3: Models in Production: A Look From Beginning to End
Cloudera, Inc.
 
PPTX
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
PPTX
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
PPTX
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Cloudera, Inc.
 
PDF
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
PPTX
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
 
PPTX
Apache Spark Operations
Cloudera, Inc.
 
PPTX
Kudu Deep-Dive
Supriya Sahay
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Cloudera, Inc.
 
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
Risk Management for Data: Secured and Governed
Cloudera, Inc.
 
Road to Cloudera certification
Cloudera, Inc.
 
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
Solr consistency and recovery internals
Cloudera, Inc.
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Cloudera, Inc.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Intro to Apache Spark
Cloudera, Inc.
 
Using Hadoop to Drive Down Fraud for Telcos
Cloudera, Inc.
 
Part 3: Models in Production: A Look From Beginning to End
Cloudera, Inc.
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Cloudera, Inc.
 
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
 
Apache Spark Operations
Cloudera, Inc.
 
Kudu Deep-Dive
Supriya Sahay
 

Viewers also liked (20)

PPTX
Securing Your Apache Spark Applications
Cloudera, Inc.
 
PDF
PCRF-Policy Charging System-Functional Analysis
Biju M R
 
PDF
Switchyard design overview
Milind Punj
 
PDF
Benefits And Applications of PET Plastic Packaging
plasticingenuity
 
PDF
1. GRID COMPUTING
Dr Sandeep Kumar Poonia
 
PPTX
Cross cultural communication in business world
onlyvvek
 
PPTX
Waste water treatment processes
Ashish Agarwal
 
PDF
Green Storage 1: Economics, Environment, Energy and Engineering
digitallibrary
 
PPTX
Agile Product Management Basics
Rich Mironov
 
PPTX
Practical introduction to hadoop
inside-BigData.com
 
PDF
Improving Utilization of Infrastructure Cloud
IJASCSE
 
DOCX
college assignment on Applications of ipsec
bigchill29
 
PDF
Basics of print planning
Philip Vantassel, C.P.M.
 
PDF
Compulsory motor third party liability in Mozambique
https://logisticscompanies.co.za
 
PDF
Informatica transformation guide
sonu_pal
 
PDF
How to measure illumination
ajsatienza
 
PPTX
Top 8 print production manager resume samples
kelerdavi
 
PPTX
Optimized Learning and Development
AIESEC
 
PPT
Ironport Data Loss Prevention
dkaya
 
PDF
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
Change Management Institute
 
Securing Your Apache Spark Applications
Cloudera, Inc.
 
PCRF-Policy Charging System-Functional Analysis
Biju M R
 
Switchyard design overview
Milind Punj
 
Benefits And Applications of PET Plastic Packaging
plasticingenuity
 
1. GRID COMPUTING
Dr Sandeep Kumar Poonia
 
Cross cultural communication in business world
onlyvvek
 
Waste water treatment processes
Ashish Agarwal
 
Green Storage 1: Economics, Environment, Energy and Engineering
digitallibrary
 
Agile Product Management Basics
Rich Mironov
 
Practical introduction to hadoop
inside-BigData.com
 
Improving Utilization of Infrastructure Cloud
IJASCSE
 
college assignment on Applications of ipsec
bigchill29
 
Basics of print planning
Philip Vantassel, C.P.M.
 
Compulsory motor third party liability in Mozambique
https://logisticscompanies.co.za
 
Informatica transformation guide
sonu_pal
 
How to measure illumination
ajsatienza
 
Top 8 print production manager resume samples
kelerdavi
 
Optimized Learning and Development
AIESEC
 
Ironport Data Loss Prevention
dkaya
 
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
Change Management Institute
 
Ad

Similar to Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path for Compute Frameworks (20)

PPTX
RecordService for Unified Access Control
Cloudera, Inc.
 
PDF
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Dataconomy Media
 
PPTX
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
PPTX
Fighting cyber fraud with hadoop
Niel Dunnage
 
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
ODP
The power of hadoop in cloud computing
Joey Echeverria
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
PPTX
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Cloudera, Inc.
 
PPTX
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Cloudera, Inc.
 
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
PPTX
Turning Data into Business Value with a Modern Data Platform
Cloudera, Inc.
 
PPTX
Bringing Trus and Visibility to Apache Hadoop
DataWorks Summit
 
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
PDF
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
PPTX
Seeking Cybersecurity--Strategies to Protect the Data
Cloudera, Inc.
 
PPT
Data Science Day New York: The Platform for Big Data
Cloudera, Inc.
 
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Jonathan Seidman
 
PPTX
SDX Pitch Deck (201) - Apresentação SDP 2024
PauloEduardoBitarJun
 
PDF
25 snowflake
剑飞 陈
 
PPTX
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
RecordService for Unified Access Control
Cloudera, Inc.
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Dataconomy Media
 
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
Fighting cyber fraud with hadoop
Niel Dunnage
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
The power of hadoop in cloud computing
Joey Echeverria
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Cloudera, Inc.
 
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Cloudera, Inc.
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
Turning Data into Business Value with a Modern Data Platform
Cloudera, Inc.
 
Bringing Trus and Visibility to Apache Hadoop
DataWorks Summit
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Seeking Cybersecurity--Strategies to Protect the Data
Cloudera, Inc.
 
Data Science Day New York: The Platform for Big Data
Cloudera, Inc.
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Jonathan Seidman
 
SDX Pitch Deck (201) - Apresentação SDP 2024
PauloEduardoBitarJun
 
25 snowflake
剑飞 陈
 
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

Recently uploaded (20)

PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Digital Circuits, important subject in CS
contactparinay1
 

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path for Compute Frameworks

  • 1. 1© Cloudera, Inc. All rights reserved. Introducing RecordService Lenni Kuff
  • 2. 2© Cloudera, Inc. All rights reserved. RecordService is a distributed, scalable, data access service for unified authorization in Hadoop.
  • 3. 3© Cloudera, Inc. All rights reserved. Motivation • As the Hadoop ecosystem expands, new components continue to be added • Speaks to the overall flexibility of Hadoop • This is good - more functionality, more workloads, more use cases. • As use cases for Hadoop mature, user requirements and expectations increase: • Security • Performance • Compatibility • The flexibility of Hadoop has come at cost of increased complexity
  • 4. 4© Cloudera, Inc. All rights reserved. Storage Compute
  • 5. 5© Cloudera, Inc. All rights reserved. Storage Compute …
  • 6. 6© Cloudera, Inc. All rights reserved. Example: Security Challenge: Provide unified fine-grained security across compute frameworks • Integrating consistent security layer into every components is not scalable. • Securing data at file-level precludes fine grained access control (column/row) • File ACLs not enough - User can view all or nothing. • Currently, must split files, duplicate data – large operational cost. Solution: Add a level of abstraction - secure service to access datasets in “record” format • Can now apply fine-grained constraints on projection of dataset • Same access control policy can be applied uniformly across compute frameworks; uncoupled from underlying storage layer
  • 7. 7© Cloudera, Inc. All rights reserved. Introducing RecordService
  • 8. 8© Cloudera, Inc. All rights reserved. Record Service - Overview • Simplifies • Provides a higher level, logical abstraction for data (ie Tables or Views) • Returns schemed objects (instead of paths and bytes). No need for applications to worry about storage APIs and file formats. • HCatalog? Similar concept - RecordService is secure, performant. Plan to support HCatalog as a data model on RecordService. • Secures • Central location for all authorization checks using Sentry metadata. • Secure service that does not execute arbitrary user code • Accelerates • Unified data access path allows platform-wide performance improvements.
  • 9. 9© Cloudera, Inc. All rights reserved. Architecture
  • 10. 10© Cloudera, Inc. All rights reserved. Architecture • Runs as a distributed service: Planner Servers & Worker Servers • Servers do not store any state • Easy HA, fault tolerance. • Planner Servers responsible for request planning • Retrieve and combine metadata (NN, HMS, Sentry) • Split generation -> Creates tasks for workers • Performs authorization • Worker Servers reads from storage and constructs records. • IO, file parsing, predicate evaluation • Runs as the “source” for a DAG computation
  • 11. 11© Cloudera, Inc. All rights reserved. Architecture – Server APIs • Planner and Worker services expose thrift APIs • PlanRequest(), Exec(), Fetch() • PlanRequest() • Accepts SQL to specify request: Support SELECT and PROJECT • Access to tables and views stored in HMS • Does not run operators that require data exchange; “map only” • Generates a list of tasks which contain the request, each with locality • Exec()/Fetch() • Returns records in a canonical optimized, columnar-format.
  • 12. 12© Cloudera, Inc. All rights reserved. Architecture – Fault tolerance • Cluster state persisted in ZK • Membership, delegation tokens, secret keys • Servers do not communicate with each other directly => scalability • Planner services • Expected to run a few (i.e. 3) for HA • Fault tolerance handled with clients getting a list of planners and failing over • Plan requests are short • Worker services • Expect to run on each node in the cluster with data • Fault tolerance handled by framework (e.g. MR) rescheduling task
  • 13. 13© Cloudera, Inc. All rights reserved. Architecture – Security • Authentication using Kerberos and delegation tokens • Planner authorizes request using metadata in Sentry • Column level ACLs • Row level ACLs – create a view with a predicate • Masking – create a view with the masking function in the select list • Tasks generated by the planner are signed with a shared key • Worker runs generated tasks. • Does not authorize, relies on signed tasks • Runs as user with full access to data, does not run user code
  • 14. 14© Cloudera, Inc. All rights reserved. Architecture – Security example CREATE VIEW v as SELECT mask(credit_card_number) as ccn, name, balance, region FROM data WHERE region = “Europe” 1. Restrict access to the data set: disable access to ‘data’ table and underlying files in HDFS. 2. Give access by creating view, v 3. Set column level permissions on v per user if necessary Write path (ingest) unchanged. Job expected to run as privileged user.
  • 15. 15© Cloudera, Inc. All rights reserved. Client APIs – Integration with ecosystem • Similar APIs designed to integrate with MapReduce and Spark • Client APIs make things simpler • Don’t need to interact with HMS • Care about the underlying storage format: worker always returns records in a canonical format. • Storage engine details (e.g. s3)
  • 16. 16© Cloudera, Inc. All rights reserved. Client Integration APIs • Drop in replacements for common existing InputFormats • Text, Avro • Can be used with Spark as well • SparkSQL: integration with the Data Sources API • Predicate pushdown, projection • Migration should be easy
  • 17. 17© Cloudera, Inc. All rights reserved. MR Example //FileInputFormat.setInputPaths(job, new Path(args[0])); //job.setInputFormatClass(AvroKeyInputFormat.class); RecordServiceConfig.setInputTable(configuration, null, args[0]); job.setInputFormatClass( com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);
  • 18. 18© Cloudera, Inc. All rights reserved. Spark Example // Comment out one or the other val file = sc.recordServiceTextFile(path) //val file = sc.textFile(path)
  • 19. 19© Cloudera, Inc. All rights reserved. Spark SQL Example ctx.sql(s""" |CREATE TEMPORARY TABLE $tbl |USING com.cloudera.recordservice.spark.DefaultSource |OPTIONS ( | RecordServiceTable '$db.$tbl', | RecordServiceTableSize '$size' |) """.stripMargin)
  • 20. 20© Cloudera, Inc. All rights reserved. Performance • Shares some core components with Impala • IO management, optimized C++ code, runtime code generation, uses low level storage APIs • Highly efficient implementation of the scan functionality • Optimized columnar on wire format • Inspired by Apache Parquet • Accelerates performance for many workloads
  • 21. 21© Cloudera, Inc. All rights reserved. Terasort • ~Worst case scenario. Minimal schema: a single STRING column • Custom RecordServiceTeraInputFormat (similar to TeraInputFormat) • 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks) • Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales • See Github repo for more details and runnable examples.
  • 22. 22© Cloudera, Inc. All rights reserved. TeraChecksum 1 0.48 0.23 1.03 0.8 0.85 0 0.2 0.4 0.6 0.8 1 1.2 1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark) Normalizedjobtime TeraChecksum Without RecordService With RecordService
  • 23. 23© Cloudera, Inc. All rights reserved. Spark SQL • Represents a more expected use case • Data is fully schemed • TPCDS • 500GB scale factor, on parquet • Cluster • 5 node cluster
  • 24. 24© Cloudera, Inc. All rights reserved. 0 50 100 150 200 250 300 350 TPCDS SparkSQL SparkSQL SparkSQL with RecordService Spark SQL ~15% improvement in query times; queries are not scan bound
  • 25. 25© Cloudera, Inc. All rights reserved. Spark SQL 29.5 31 14 23.5 0 5 10 15 20 25 30 35 2% Selective Scan Sum(col) SparkSQL SparkSQL SparkSQL with RecordService
  • 26. 26© Cloudera, Inc. All rights reserved. State of the project • Available in v0.2 beta: • Integration with Spark, MR, Pig (via HCatalog) • Planner HA • Apache 2.0 Licensed • Sentry Column-Level Privilege Support • Mini Roadmap: • Improved multi-tenancy • Complex types • More InputFormat support / integration options • Intend to donate to Apache Software Foundation
  • 27. 27© Cloudera, Inc. All rights reserved. Conclusion • RecordService provides a schemed data access service for Hadoop • Logical data access instead of physical • Much more powerful abstraction • Demonstrated security enforcement, improved performance • Simpler: clients don’t need to worry about low level details: storage APIs, file formats • Opens the door for future improvements
  • 28. 28© Cloudera, Inc. All rights reserved. Contributing! • Mailing list: [email protected] • Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd- p/Beta • Contributions: http://github.com/cloudera/RecordServiceClient/ • Documentation: http://cloudera.github.io/RecordServiceClient/ • Bug Reporting: https://issues.cloudera.org/projects/RS • Beta Download: http://www.cloudera.com/downloads/beta/record-service/0-2-0.html
  • 29. 29© Cloudera, Inc. All rights reserved. Thank you

Editor's Notes

  • #3: In this talk we will be introducing Record Service … In Short, RecordService is a highly scalable, distributed, data access service for Hadoop that provides unified authorization while also simplifying the platform.
  • #4: Before digging in to the details of RecordService, let’s take a step back and look at the current state of the Hadoop ecosystem. What we have seen is more components, continue added to the stack at an accelerated rate.
  • #8: * RS provides layer of abstraction over storage so compute frameworks don’t need to care as where data is stored Provides platform for uniform, fine grained security across all compute engines Helps to simplify Hadoop – Unified data access path
  • #9: Single place for performance enhancements