SlideShare a Scribd company logo
MapReduce over Tahoe
                     Aaron Cordova
                                Associate       New York
                                               Oct 1, 2009


                    Booz Allen Hamilton Inc.             .
              134 National Business Parkway
               Annapolis Junction, MD 20701
                   cordova_aaron@bah.com




Hadoop World 2009 2009
 Hadoop World NYC
                                                             1
MapReduce over Tahoe
  Impact of data security requirements on large scale analysis

  Introduction to Tahoe

  Integrating Tahoe with Hadoop’s MapReduce

  Deployment scenarios, considerations

  Test results




Hadoop World NYC 2009
                                                                  2
Features of Large Scale Analysis
  As data grows, it becomes harder, more expensive to move
   – “Massive” data

  The more data sets are located together, the more valuable each is
   – Network Effect

  Bring computation to the data




Hadoop World NYC 2009
                                                                        3
Data Security and Large Scale Analysis
  Each department within an organization has its own data

  Some data need to be shared

  Others are protected
                                                             CRM


                                                      Product
                                                                Sales
                                                      Testing




Hadoop World NYC 2009
                                                                        4
Data Security
  Because of security constraints,
   departments tend to setup
   their own data storage and
   processing systems
   independently                       Support      Support      Support      Support
  This includes support staff
                                       Storage      Storage      Storage      Storage
  Highly inefficient
                                      Processing   Processing   Processing   Processing
  Analysis across datasets is
   impossible                           Apps         Apps         Apps         Apps




Hadoop World NYC 2009
                                                                                          5
“Stovepipe Effect”




Hadoop World NYC 2009
                        6
Tahoe - A Least Authority File System
  Release 1.5

  AllMyData.com

  Included in Ubuntu Karmic Koala

  Open Source




Hadoop World NYC 2009
                                        7
Tahoe Architecture
  Data originates at the client, which
   is trusted                                   Storage Servers
  Client encrypts, segments, and
   erasure-codes data

  Segments are distributed to
   storage nodes over encrypted
   links

  Storage nodes only see encrypted
                                          SSL
   data, and are not trusted

                                           Client




Hadoop World NYC 2009
                                                                  8
Tahoe Architecture Features
  AES Encryption

  Segmentation

  Erasure-coding

  Distributed

  Flexible Access Control




Hadoop World NYC 2009
                              9
Erasure Coding Overview

                                                 N




           K
  Only k of n segments are needed to recover the file

  Up to n-k machines can fail, be compromised, or malicious without data loss

  n and k are configurable, and can be chosen to achieve desired availability

  Expansion factor of data is k/n (default is 3/10, or 3.3)



Hadoop World NYC 2009
                                                                                 10
Flexible Access Control
  Each file has a Read Capability and a Write Capability

  These are decryption keys                                ReadCap
                                                    File
  Directories have capabilities too
                                                            WriteCap

                                                            ReadCap
                                                    Dir
                                                            WriteCap




Hadoop World NYC 2009
                                                                       11
Flexible Access Control
  Access to a subset of files can be done by:
   – creating a directory                                    Dir
   – attaching files
   – sharing read or write capabilities of the dir

  Any files or directories attached are accessible

  Any outside the directory are not                  File          Dir   ReadCap




                                                             File         File




Hadoop World NYC 2009
                                                                                    12
Access Control Example

 Files




Directories         /Sales                            /Testing

         Each department can access their own files


Hadoop World NYC 2009
                                                                 13
Access Control Example

 Files




Directories         /Sales                            /Testing

         Each department can access their own files


Hadoop World NYC 2009
                                                                 14
Access Control Example

 Files




Directories      /Sales             /New             /Testing
                                  Products
   Files that need to be shared can be linked to a new directory, whose read
                     capability is given to both departments

Hadoop World NYC 2009
                                                                               15
Hadoop Can Use The Following File Systems
  HDFS

  Cloud Store (KFS)

  Amazon S3

  FTP

  Read only HTTP

  Now, Tahoe!




Hadoop World NYC 2009
                                            16
Hadoop File System Integration HowTo
  Step 1.
   – Locate your favorite file system’s API

  Step 2.
   – subclass FileSystem
   – found in /src/core/org/apache/hadoop/fs/FileSystem.java

  Step 3.
   – Add lines to core-site.xml:
                          <name> fs.lafs.impl </name>
                          <value> your.class </value>

  Step 4.
   – Test using your favorite Infrastructure Service Provider




Hadoop World NYC 2009
                                                                17
Hadoop Integration : MapReduce
  One Tahoe client is run on each         Storage Servers
   machine that serves as a
   MapReduce Worker

  On average, clients
   communicate with k storage
   servers

  Jobs are limited by aggregate
   network bandwidth

  MapReduce workers are trusted,
   storage nodes are not             Hadoop Map Reduce Workers




Hadoop World NYC 2009
                                                                 18
Hadoop-Tahoe Configuration
  Step 1. Start Tahoe

  Step 2. Create a new directory in Tahoe, note the WriteCap

  Step 3. Configure core-site.xml thus:
   – fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS
   – lafs.rootcap: $WRITE_CAP
   – fs.default.name: lafs://localhost

  Step 4. Start MapReduce, but not HDFS




Hadoop World NYC 2009
                                                                19
Deployment Scenario - Large Organization
  Within a datacenter,                        Storage Servers
   departments can run
   MapReduce jobs on discrete
   groups of compute nodes

  Each MapReduce job accesses
   a directory containing a subset
   of files

  Results are written back to the
   storage servers, encrypted

                                            Sales            Audit

                                     MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                         20
Deployment Scenario - Community
  If a community uses a shared                   Storage Servers
   data center, different
   organizations can run discrete
   MapReduce jobs

  Perhaps most importantly, when
   results are deemed appropriate
   to share, access can be granted
   simply by sending a read or
   write capability

  Since the data are all co-located
   already, no data needs to be
   moved
                                            FBI         Homeland Sec

                                       MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                           21
Deployment Scenario - Public Cloud Services
  Since storage nodes require no              Storage Servers
   trust, they can be located at a
   remote location, e.g. within a
   cloud service provider’s
   datacenter
                                           Cloud Service Provider
  MapReduce jobs can be done
   this way if bandwidth to the
   datacenter is adequate




                                     MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                         22
Deployment Scenario - Public Cloud Services
  For some users, everything                 Storage Servers
   could be run remotely in a
   service provider’s data center

  There are a few caveats and
   additional precautions in this
   scenario:

                                          Cloud Service Provider




                                    MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                        23
Public Cloud Deployment Considerations
  Store configuration files in memory

  Encrypt / disable swap

  Encrypt spillover
                                         Cloud Service Provider
  Must trust memory / hypervisor

  Trust service provider disks




Hadoop World NYC 2009
                                                                  24
HDFS and Linux Disk Encryption Drawbacks
  At most one key per node - no support for flexible access control

  Decryption done at the storage node rather than at the client - still have to trust storage nodes




Hadoop World NYC 2009
                                                                                                       25
Tahoe and HDFS - Comparison

          Feature             HDFS              Tahoe

       Confidentiality    File Permissions   AES Encryption

          Integrity        Checksum         Merkel Hash Tree

         Availability      Replication      Erasure Coding

      Expansion Factor         3x              3.3x (k/n)

        Self-Healing       Automatic           Automatic

      Load-balancing       Automatic            Planned

        Mutable Files          No                 Yes


Hadoop World NYC 2009
                                                               26
Performance           HDFS            Tahoe

  Tests run on ten nodes

  RandomWrite writes 1 GB per
                                                                   200
   node

  WordCount done over randomly                                    150
   generated text

  Tahoe write speed is 10x slower
                                                                   100
  Read-intensive jobs are about
   the same
                                                                   50
  Not so bad since the most
   common data use case is write-
   once, read-many                                                 0
                                     Random Write   Word Count

Hadoop World NYC 2009
                                                                          27
Code
  Tahoe available from http://allmydata.org
   – Licensed under GPL 2 or TGPPL

  Integration code available at http://hadoop-lafs.googlecode.com
   – Licensed under Apache 2




Hadoop World NYC 2009
                                                                     28

More Related Content

What's hot (20)

PPTX
Moving from C#/.NET to Hadoop/MongoDB
MongoDB
 
PDF
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
PDF
Advanced Security In Hadoop Cluster
Edureka!
 
PDF
Hadoop 101
EMC
 
PDF
Intro to GlusterFS Webinar - August 2011
GlusterFS
 
PDF
Hadoop Successes and Failures to Drive Deployment Evolution
Benoit Perroud
 
PPTX
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
PPTX
Hadoop introduction
musrath mohammad
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PDF
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
GlusterFS
 
PDF
Hadoop Distributed File System Reliability and Durability at Facebook
DataWorks Summit
 
PPTX
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Cloudera, Inc.
 
PPTX
Compaction and Splitting in Apache Accumulo
Hortonworks
 
PPTX
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
PDF
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Edureka!
 
PPTX
Seamless replication and disaster recovery for Apache Hive Warehouse
DataWorks Summit
 
PPTX
Introduction to Cloudera's Administrator Training for Apache Hadoop
Cloudera, Inc.
 
PDF
EMC Isilon Best Practices for Hadoop Data Storage
EMC
 
PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Moving from C#/.NET to Hadoop/MongoDB
MongoDB
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
Advanced Security In Hadoop Cluster
Edureka!
 
Hadoop 101
EMC
 
Intro to GlusterFS Webinar - August 2011
GlusterFS
 
Hadoop Successes and Failures to Drive Deployment Evolution
Benoit Perroud
 
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Hadoop introduction
musrath mohammad
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
GlusterFS
 
Hadoop Distributed File System Reliability and Durability at Facebook
DataWorks Summit
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Cloudera, Inc.
 
Compaction and Splitting in Apache Accumulo
Hortonworks
 
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Edureka!
 
Seamless replication and disaster recovery for Apache Hive Warehouse
DataWorks Summit
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Cloudera, Inc.
 
EMC Isilon Best Practices for Hadoop Data Storage
EMC
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 

Similar to Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem (20)

PDF
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
PDF
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
PDF
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
PDF
Hadoop architecture-tutorial
vinayiqbusiness
 
PPTX
Apache Hadoop
Ajit Koti
 
PDF
Unit IV.pdf
KennyPratheepKumar
 
PDF
Hadoop .pdf
SudhanshiBakre1
 
PPTX
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Cloudera, Inc.
 
PPTX
Hadoop architecture-tutorial
vinayiqbusiness
 
PPTX
EMC config Hadoop
solarisyougood
 
PPTX
Hadoop Platforms - Introduction, Importance, Providers
Mrigendra Sharma
 
PPTX
Introduction to hadoop and hdfs
shrey mehrotra
 
PPTX
Hadoop - HDFS
KavyaGo
 
PDF
field_guide_to_hadoop_pentaho
Martin Ferguson
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
Bigdata and hadoop
Aditi Yadav
 
PDF
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld
 
PPTX
Hadoop
yasser hassen
 
DOCX
500 data engineering interview question.docx
aekannake
 
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
Hadoop architecture-tutorial
vinayiqbusiness
 
Apache Hadoop
Ajit Koti
 
Unit IV.pdf
KennyPratheepKumar
 
Hadoop .pdf
SudhanshiBakre1
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Cloudera, Inc.
 
Hadoop architecture-tutorial
vinayiqbusiness
 
EMC config Hadoop
solarisyougood
 
Hadoop Platforms - Introduction, Importance, Providers
Mrigendra Sharma
 
Introduction to hadoop and hdfs
shrey mehrotra
 
Hadoop - HDFS
KavyaGo
 
field_guide_to_hadoop_pentaho
Martin Ferguson
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Bigdata and hadoop
Aditi Yadav
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld
 
500 data engineering interview question.docx
aekannake
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Ad

Recently uploaded (20)

PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem

  • 1. MapReduce over Tahoe Aaron Cordova Associate New York Oct 1, 2009 Booz Allen Hamilton Inc. . 134 National Business Parkway Annapolis Junction, MD 20701 [email protected] Hadoop World 2009 2009 Hadoop World NYC 1
  • 2. MapReduce over Tahoe  Impact of data security requirements on large scale analysis  Introduction to Tahoe  Integrating Tahoe with Hadoop’s MapReduce  Deployment scenarios, considerations  Test results Hadoop World NYC 2009 2
  • 3. Features of Large Scale Analysis  As data grows, it becomes harder, more expensive to move – “Massive” data  The more data sets are located together, the more valuable each is – Network Effect  Bring computation to the data Hadoop World NYC 2009 3
  • 4. Data Security and Large Scale Analysis  Each department within an organization has its own data  Some data need to be shared  Others are protected CRM Product Sales Testing Hadoop World NYC 2009 4
  • 5. Data Security  Because of security constraints, departments tend to setup their own data storage and processing systems independently Support Support Support Support  This includes support staff Storage Storage Storage Storage  Highly inefficient Processing Processing Processing Processing  Analysis across datasets is impossible Apps Apps Apps Apps Hadoop World NYC 2009 5
  • 7. Tahoe - A Least Authority File System  Release 1.5  AllMyData.com  Included in Ubuntu Karmic Koala  Open Source Hadoop World NYC 2009 7
  • 8. Tahoe Architecture  Data originates at the client, which is trusted Storage Servers  Client encrypts, segments, and erasure-codes data  Segments are distributed to storage nodes over encrypted links  Storage nodes only see encrypted SSL data, and are not trusted Client Hadoop World NYC 2009 8
  • 9. Tahoe Architecture Features  AES Encryption  Segmentation  Erasure-coding  Distributed  Flexible Access Control Hadoop World NYC 2009 9
  • 10. Erasure Coding Overview N K  Only k of n segments are needed to recover the file  Up to n-k machines can fail, be compromised, or malicious without data loss  n and k are configurable, and can be chosen to achieve desired availability  Expansion factor of data is k/n (default is 3/10, or 3.3) Hadoop World NYC 2009 10
  • 11. Flexible Access Control  Each file has a Read Capability and a Write Capability  These are decryption keys ReadCap File  Directories have capabilities too WriteCap ReadCap Dir WriteCap Hadoop World NYC 2009 11
  • 12. Flexible Access Control  Access to a subset of files can be done by: – creating a directory Dir – attaching files – sharing read or write capabilities of the dir  Any files or directories attached are accessible  Any outside the directory are not File Dir ReadCap File File Hadoop World NYC 2009 12
  • 13. Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 13
  • 14. Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 14
  • 15. Access Control Example Files Directories /Sales /New /Testing Products Files that need to be shared can be linked to a new directory, whose read capability is given to both departments Hadoop World NYC 2009 15
  • 16. Hadoop Can Use The Following File Systems  HDFS  Cloud Store (KFS)  Amazon S3  FTP  Read only HTTP  Now, Tahoe! Hadoop World NYC 2009 16
  • 17. Hadoop File System Integration HowTo  Step 1. – Locate your favorite file system’s API  Step 2. – subclass FileSystem – found in /src/core/org/apache/hadoop/fs/FileSystem.java  Step 3. – Add lines to core-site.xml: <name> fs.lafs.impl </name> <value> your.class </value>  Step 4. – Test using your favorite Infrastructure Service Provider Hadoop World NYC 2009 17
  • 18. Hadoop Integration : MapReduce  One Tahoe client is run on each Storage Servers machine that serves as a MapReduce Worker  On average, clients communicate with k storage servers  Jobs are limited by aggregate network bandwidth  MapReduce workers are trusted, storage nodes are not Hadoop Map Reduce Workers Hadoop World NYC 2009 18
  • 19. Hadoop-Tahoe Configuration  Step 1. Start Tahoe  Step 2. Create a new directory in Tahoe, note the WriteCap  Step 3. Configure core-site.xml thus: – fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS – lafs.rootcap: $WRITE_CAP – fs.default.name: lafs://localhost  Step 4. Start MapReduce, but not HDFS Hadoop World NYC 2009 19
  • 20. Deployment Scenario - Large Organization  Within a datacenter, Storage Servers departments can run MapReduce jobs on discrete groups of compute nodes  Each MapReduce job accesses a directory containing a subset of files  Results are written back to the storage servers, encrypted Sales Audit MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 20
  • 21. Deployment Scenario - Community  If a community uses a shared Storage Servers data center, different organizations can run discrete MapReduce jobs  Perhaps most importantly, when results are deemed appropriate to share, access can be granted simply by sending a read or write capability  Since the data are all co-located already, no data needs to be moved FBI Homeland Sec MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 21
  • 22. Deployment Scenario - Public Cloud Services  Since storage nodes require no Storage Servers trust, they can be located at a remote location, e.g. within a cloud service provider’s datacenter Cloud Service Provider  MapReduce jobs can be done this way if bandwidth to the datacenter is adequate MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 22
  • 23. Deployment Scenario - Public Cloud Services  For some users, everything Storage Servers could be run remotely in a service provider’s data center  There are a few caveats and additional precautions in this scenario: Cloud Service Provider MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 23
  • 24. Public Cloud Deployment Considerations  Store configuration files in memory  Encrypt / disable swap  Encrypt spillover Cloud Service Provider  Must trust memory / hypervisor  Trust service provider disks Hadoop World NYC 2009 24
  • 25. HDFS and Linux Disk Encryption Drawbacks  At most one key per node - no support for flexible access control  Decryption done at the storage node rather than at the client - still have to trust storage nodes Hadoop World NYC 2009 25
  • 26. Tahoe and HDFS - Comparison Feature HDFS Tahoe Confidentiality File Permissions AES Encryption Integrity Checksum Merkel Hash Tree Availability Replication Erasure Coding Expansion Factor 3x 3.3x (k/n) Self-Healing Automatic Automatic Load-balancing Automatic Planned Mutable Files No Yes Hadoop World NYC 2009 26
  • 27. Performance HDFS Tahoe  Tests run on ten nodes  RandomWrite writes 1 GB per 200 node  WordCount done over randomly 150 generated text  Tahoe write speed is 10x slower 100  Read-intensive jobs are about the same 50  Not so bad since the most common data use case is write- once, read-many 0 Random Write Word Count Hadoop World NYC 2009 27
  • 28. Code  Tahoe available from http://allmydata.org – Licensed under GPL 2 or TGPPL  Integration code available at http://hadoop-lafs.googlecode.com – Licensed under Apache 2 Hadoop World NYC 2009 28