SlideShare a Scribd company logo
MapReduce over Tahoe
                     Aaron Cordova
                                Associate       New York
                                               Oct 1, 2009


                    Booz Allen Hamilton Inc.             .
              134 National Business Parkway
               Annapolis Junction, MD 20701
                   cordova_aaron@bah.com




Hadoop World 2009 2009
 Hadoop World NYC
                                                             1
MapReduce over Tahoe
  Impact of data security requirements on large scale analysis

  Introduction to Tahoe

  Integrating Tahoe with Hadoop’s MapReduce

  Deployment scenarios, considerations

  Test results




Hadoop World NYC 2009
                                                                  2
Features of Large Scale Analysis
  As data grows, it becomes harder, more expensive to move
   – “Massive” data

  The more data sets are located together, the more valuable each is
   – Network Effect

  Bring computation to the data




Hadoop World NYC 2009
                                                                        3
Data Security and Large Scale Analysis
  Each department within an organization has its own data

  Some data need to be shared

  Others are protected
                                                             CRM


                                                      Product
                                                                Sales
                                                      Testing




Hadoop World NYC 2009
                                                                        4
Data Security
  Because of security constraints,
   departments tend to setup
   their own data storage and
   processing systems
   independently                       Support      Support      Support      Support
  This includes support staff
                                       Storage      Storage      Storage      Storage
  Highly inefficient
                                      Processing   Processing   Processing   Processing
  Analysis across datasets is
   impossible                           Apps         Apps         Apps         Apps




Hadoop World NYC 2009
                                                                                          5
“Stovepipe Effect”




Hadoop World NYC 2009
                        6
Tahoe - A Least Authority File System
  Release 1.5

  AllMyData.com

  Included in Ubuntu Karmic Koala

  Open Source




Hadoop World NYC 2009
                                        7
Tahoe Architecture
  Data originates at the client, which
   is trusted                                   Storage Servers
  Client encrypts, segments, and
   erasure-codes data

  Segments are distributed to
   storage nodes over encrypted
   links

  Storage nodes only see encrypted
                                          SSL
   data, and are not trusted

                                           Client




Hadoop World NYC 2009
                                                                  8
Tahoe Architecture Features
  AES Encryption

  Segmentation

  Erasure-coding

  Distributed

  Flexible Access Control




Hadoop World NYC 2009
                              9
Erasure Coding Overview

                                                 N




           K
  Only k of n segments are needed to recover the file

  Up to n-k machines can fail, be compromised, or malicious without data loss

  n and k are configurable, and can be chosen to achieve desired availability

  Expansion factor of data is k/n (default is 3/10, or 3.3)



Hadoop World NYC 2009
                                                                                 10
Flexible Access Control
  Each file has a Read Capability and a Write Capability

  These are decryption keys                                ReadCap
                                                    File
  Directories have capabilities too
                                                            WriteCap

                                                            ReadCap
                                                    Dir
                                                            WriteCap




Hadoop World NYC 2009
                                                                       11
Flexible Access Control
  Access to a subset of files can be done by:
   – creating a directory                                    Dir
   – attaching files
   – sharing read or write capabilities of the dir

  Any files or directories attached are accessible

  Any outside the directory are not                  File          Dir   ReadCap




                                                             File         File




Hadoop World NYC 2009
                                                                                    12
Access Control Example

 Files




Directories         /Sales                            /Testing

         Each department can access their own files


Hadoop World NYC 2009
                                                                 13
Access Control Example

 Files




Directories         /Sales                            /Testing

         Each department can access their own files


Hadoop World NYC 2009
                                                                 14
Access Control Example

 Files




Directories      /Sales             /New             /Testing
                                  Products
   Files that need to be shared can be linked to a new directory, whose read
                     capability is given to both departments

Hadoop World NYC 2009
                                                                               15
Hadoop Can Use The Following File Systems
  HDFS

  Cloud Store (KFS)

  Amazon S3

  FTP

  Read only HTTP

  Now, Tahoe!




Hadoop World NYC 2009
                                            16
Hadoop File System Integration HowTo
  Step 1.
   – Locate your favorite file system’s API

  Step 2.
   – subclass FileSystem
   – found in /src/core/org/apache/hadoop/fs/FileSystem.java

  Step 3.
   – Add lines to core-site.xml:
                          <name> fs.lafs.impl </name>
                          <value> your.class </value>

  Step 4.
   – Test using your favorite Infrastructure Service Provider




Hadoop World NYC 2009
                                                                17
Hadoop Integration : MapReduce
  One Tahoe client is run on each         Storage Servers
   machine that serves as a
   MapReduce Worker

  On average, clients
   communicate with k storage
   servers

  Jobs are limited by aggregate
   network bandwidth

  MapReduce workers are trusted,
   storage nodes are not             Hadoop Map Reduce Workers




Hadoop World NYC 2009
                                                                 18
Hadoop-Tahoe Configuration
  Step 1. Start Tahoe

  Step 2. Create a new directory in Tahoe, note the WriteCap

  Step 3. Configure core-site.xml thus:
   – fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS
   – lafs.rootcap: $WRITE_CAP
   – fs.default.name: lafs://localhost

  Step 4. Start MapReduce, but not HDFS




Hadoop World NYC 2009
                                                                19
Deployment Scenario - Large Organization
  Within a datacenter,                        Storage Servers
   departments can run
   MapReduce jobs on discrete
   groups of compute nodes

  Each MapReduce job accesses
   a directory containing a subset
   of files

  Results are written back to the
   storage servers, encrypted

                                            Sales            Audit

                                     MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                         20
Deployment Scenario - Community
  If a community uses a shared                   Storage Servers
   data center, different
   organizations can run discrete
   MapReduce jobs

  Perhaps most importantly, when
   results are deemed appropriate
   to share, access can be granted
   simply by sending a read or
   write capability

  Since the data are all co-located
   already, no data needs to be
   moved
                                            FBI         Homeland Sec

                                       MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                           21
Deployment Scenario - Public Cloud Services
  Since storage nodes require no              Storage Servers
   trust, they can be located at a
   remote location, e.g. within a
   cloud service provider’s
   datacenter
                                           Cloud Service Provider
  MapReduce jobs can be done
   this way if bandwidth to the
   datacenter is adequate




                                     MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                         22
Deployment Scenario - Public Cloud Services
  For some users, everything                 Storage Servers
   could be run remotely in a
   service provider’s data center

  There are a few caveats and
   additional precautions in this
   scenario:

                                          Cloud Service Provider




                                    MapReduce Workers / Tahoe Clients


Hadoop World NYC 2009
                                                                        23
Public Cloud Deployment Considerations
  Store configuration files in memory

  Encrypt / disable swap

  Encrypt spillover
                                         Cloud Service Provider
  Must trust memory / hypervisor

  Trust service provider disks




Hadoop World NYC 2009
                                                                  24
HDFS and Linux Disk Encryption Drawbacks
  At most one key per node - no support for flexible access control

  Decryption done at the storage node rather than at the client - still have to trust storage nodes




Hadoop World NYC 2009
                                                                                                       25
Tahoe and HDFS - Comparison

          Feature             HDFS              Tahoe

       Confidentiality    File Permissions   AES Encryption

          Integrity        Checksum         Merkel Hash Tree

         Availability      Replication      Erasure Coding

      Expansion Factor         3x              3.3x (k/n)

        Self-Healing       Automatic           Automatic

      Load-balancing       Automatic            Planned

        Mutable Files          No                 Yes


Hadoop World NYC 2009
                                                               26
Performance           HDFS            Tahoe

  Tests run on ten nodes

  RandomWrite writes 1 GB per
                                                                   200
   node

  WordCount done over randomly                                    150
   generated text

  Tahoe write speed is 10x slower
                                                                   100
  Read-intensive jobs are about
   the same
                                                                   50
  Not so bad since the most
   common data use case is write-
   once, read-many                                                 0
                                     Random Write   Word Count

Hadoop World NYC 2009
                                                                          27
Code
  Tahoe available from http://allmydata.org
   – Licensed under GPL 2 or TGPPL

  Integration code available at http://hadoop-lafs.googlecode.com
   – Licensed under Apache 2




Hadoop World NYC 2009
                                                                     28

More Related Content

PDF
Introduction to hadoop and hdfs
PDF
Cloud computing era
PDF
Introduction to h base
PPT
Data Science Day New York: The Platform for Big Data
PDF
Cloud Storage Adoption, Practice, and Deployment
PDF
Right Availability in RAC environment. Playing with Oracle clusterware infras...
PDF
Introduction to map reduce
PPTX
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Introduction to hadoop and hdfs
Cloud computing era
Introduction to h base
Data Science Day New York: The Platform for Big Data
Cloud Storage Adoption, Practice, and Deployment
Right Availability in RAC environment. Playing with Oracle clusterware infras...
Introduction to map reduce
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

What's hot (20)

PPTX
Moving from C#/.NET to Hadoop/MongoDB
PDF
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
PDF
Advanced Security In Hadoop Cluster
PDF
Hadoop 101
 
PDF
Intro to GlusterFS Webinar - August 2011
PDF
Hadoop Successes and Failures to Drive Deployment Evolution
PPTX
Hadoop Backup and Disaster Recovery
PPTX
Hadoop introduction
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
PDF
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
PDF
Hadoop Distributed File System Reliability and Durability at Facebook
PPTX
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
PPTX
Compaction and Splitting in Apache Accumulo
PPTX
Hadoop and WANdisco: The Future of Big Data
PDF
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
PPTX
Seamless replication and disaster recovery for Apache Hive Warehouse
PPTX
Introduction to Cloudera's Administrator Training for Apache Hadoop
PDF
EMC Isilon Best Practices for Hadoop Data Storage
 
PDF
Hadoop 3.0 - Revolution or evolution?
Moving from C#/.NET to Hadoop/MongoDB
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Advanced Security In Hadoop Cluster
Hadoop 101
 
Intro to GlusterFS Webinar - August 2011
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Backup and Disaster Recovery
Hadoop introduction
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Hadoop Distributed File System Reliability and Durability at Facebook
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Compaction and Splitting in Apache Accumulo
Hadoop and WANdisco: The Future of Big Data
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Seamless replication and disaster recovery for Apache Hive Warehouse
Introduction to Cloudera's Administrator Training for Apache Hadoop
EMC Isilon Best Practices for Hadoop Data Storage
 
Hadoop 3.0 - Revolution or evolution?
Ad

Similar to Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem (20)

PDF
Plugging the Holes: Security and Compatability in Hadoop
PDF
Hw09 Security And Api Compatibility
PDF
Apache Hadoop & Friends at Utah Java User's Group
PDF
Hadoop programming
PDF
Distributed Data processing in a Cloud
PDF
Apache Hadoop Talk at QCon
PPTX
July 2012 HUG: Using Standard File-Based Applications and SQL-Based Tools wit...
PPTX
Bw tech hadoop
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
PDF
Getting Started with Hadoop
PPTX
NFS and ODBC
PPTX
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
DOCX
Hadoop Report
PDF
Deploying Grid Services Using Hadoop
PPTX
Big Data Hadoop (Overview)
PPTX
PDF
Analyst Report : The Enterprise Use of Hadoop
 
PDF
Hadoop Business Cases
Plugging the Holes: Security and Compatability in Hadoop
Hw09 Security And Api Compatibility
Apache Hadoop & Friends at Utah Java User's Group
Hadoop programming
Distributed Data processing in a Cloud
Apache Hadoop Talk at QCon
July 2012 HUG: Using Standard File-Based Applications and SQL-Based Tools wit...
Bw tech hadoop
BW Tech Meetup: Hadoop and The rise of Big Data
Getting Started with Hadoop
NFS and ODBC
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Hadoop Report
Deploying Grid Services Using Hadoop
Big Data Hadoop (Overview)
Analyst Report : The Enterprise Use of Hadoop
 
Hadoop Business Cases
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
KodekX | Application Modernization Development
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
GamePlan Trading System Review: Professional Trader's Honest Take
Advanced Soft Computing BINUS July 2025.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Monthly Chronicles - July 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
KodekX | Application Modernization Development
Review of recent advances in non-invasive hemoglobin estimation
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem

  • 1. MapReduce over Tahoe Aaron Cordova Associate New York Oct 1, 2009 Booz Allen Hamilton Inc. . 134 National Business Parkway Annapolis Junction, MD 20701 [email protected] Hadoop World 2009 2009 Hadoop World NYC 1
  • 2. MapReduce over Tahoe  Impact of data security requirements on large scale analysis  Introduction to Tahoe  Integrating Tahoe with Hadoop’s MapReduce  Deployment scenarios, considerations  Test results Hadoop World NYC 2009 2
  • 3. Features of Large Scale Analysis  As data grows, it becomes harder, more expensive to move – “Massive” data  The more data sets are located together, the more valuable each is – Network Effect  Bring computation to the data Hadoop World NYC 2009 3
  • 4. Data Security and Large Scale Analysis  Each department within an organization has its own data  Some data need to be shared  Others are protected CRM Product Sales Testing Hadoop World NYC 2009 4
  • 5. Data Security  Because of security constraints, departments tend to setup their own data storage and processing systems independently Support Support Support Support  This includes support staff Storage Storage Storage Storage  Highly inefficient Processing Processing Processing Processing  Analysis across datasets is impossible Apps Apps Apps Apps Hadoop World NYC 2009 5
  • 7. Tahoe - A Least Authority File System  Release 1.5  AllMyData.com  Included in Ubuntu Karmic Koala  Open Source Hadoop World NYC 2009 7
  • 8. Tahoe Architecture  Data originates at the client, which is trusted Storage Servers  Client encrypts, segments, and erasure-codes data  Segments are distributed to storage nodes over encrypted links  Storage nodes only see encrypted SSL data, and are not trusted Client Hadoop World NYC 2009 8
  • 9. Tahoe Architecture Features  AES Encryption  Segmentation  Erasure-coding  Distributed  Flexible Access Control Hadoop World NYC 2009 9
  • 10. Erasure Coding Overview N K  Only k of n segments are needed to recover the file  Up to n-k machines can fail, be compromised, or malicious without data loss  n and k are configurable, and can be chosen to achieve desired availability  Expansion factor of data is k/n (default is 3/10, or 3.3) Hadoop World NYC 2009 10
  • 11. Flexible Access Control  Each file has a Read Capability and a Write Capability  These are decryption keys ReadCap File  Directories have capabilities too WriteCap ReadCap Dir WriteCap Hadoop World NYC 2009 11
  • 12. Flexible Access Control  Access to a subset of files can be done by: – creating a directory Dir – attaching files – sharing read or write capabilities of the dir  Any files or directories attached are accessible  Any outside the directory are not File Dir ReadCap File File Hadoop World NYC 2009 12
  • 13. Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 13
  • 14. Access Control Example Files Directories /Sales /Testing Each department can access their own files Hadoop World NYC 2009 14
  • 15. Access Control Example Files Directories /Sales /New /Testing Products Files that need to be shared can be linked to a new directory, whose read capability is given to both departments Hadoop World NYC 2009 15
  • 16. Hadoop Can Use The Following File Systems  HDFS  Cloud Store (KFS)  Amazon S3  FTP  Read only HTTP  Now, Tahoe! Hadoop World NYC 2009 16
  • 17. Hadoop File System Integration HowTo  Step 1. – Locate your favorite file system’s API  Step 2. – subclass FileSystem – found in /src/core/org/apache/hadoop/fs/FileSystem.java  Step 3. – Add lines to core-site.xml: <name> fs.lafs.impl </name> <value> your.class </value>  Step 4. – Test using your favorite Infrastructure Service Provider Hadoop World NYC 2009 17
  • 18. Hadoop Integration : MapReduce  One Tahoe client is run on each Storage Servers machine that serves as a MapReduce Worker  On average, clients communicate with k storage servers  Jobs are limited by aggregate network bandwidth  MapReduce workers are trusted, storage nodes are not Hadoop Map Reduce Workers Hadoop World NYC 2009 18
  • 19. Hadoop-Tahoe Configuration  Step 1. Start Tahoe  Step 2. Create a new directory in Tahoe, note the WriteCap  Step 3. Configure core-site.xml thus: – fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS – lafs.rootcap: $WRITE_CAP – fs.default.name: lafs://localhost  Step 4. Start MapReduce, but not HDFS Hadoop World NYC 2009 19
  • 20. Deployment Scenario - Large Organization  Within a datacenter, Storage Servers departments can run MapReduce jobs on discrete groups of compute nodes  Each MapReduce job accesses a directory containing a subset of files  Results are written back to the storage servers, encrypted Sales Audit MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 20
  • 21. Deployment Scenario - Community  If a community uses a shared Storage Servers data center, different organizations can run discrete MapReduce jobs  Perhaps most importantly, when results are deemed appropriate to share, access can be granted simply by sending a read or write capability  Since the data are all co-located already, no data needs to be moved FBI Homeland Sec MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 21
  • 22. Deployment Scenario - Public Cloud Services  Since storage nodes require no Storage Servers trust, they can be located at a remote location, e.g. within a cloud service provider’s datacenter Cloud Service Provider  MapReduce jobs can be done this way if bandwidth to the datacenter is adequate MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 22
  • 23. Deployment Scenario - Public Cloud Services  For some users, everything Storage Servers could be run remotely in a service provider’s data center  There are a few caveats and additional precautions in this scenario: Cloud Service Provider MapReduce Workers / Tahoe Clients Hadoop World NYC 2009 23
  • 24. Public Cloud Deployment Considerations  Store configuration files in memory  Encrypt / disable swap  Encrypt spillover Cloud Service Provider  Must trust memory / hypervisor  Trust service provider disks Hadoop World NYC 2009 24
  • 25. HDFS and Linux Disk Encryption Drawbacks  At most one key per node - no support for flexible access control  Decryption done at the storage node rather than at the client - still have to trust storage nodes Hadoop World NYC 2009 25
  • 26. Tahoe and HDFS - Comparison Feature HDFS Tahoe Confidentiality File Permissions AES Encryption Integrity Checksum Merkel Hash Tree Availability Replication Erasure Coding Expansion Factor 3x 3.3x (k/n) Self-Healing Automatic Automatic Load-balancing Automatic Planned Mutable Files No Yes Hadoop World NYC 2009 26
  • 27. Performance HDFS Tahoe  Tests run on ten nodes  RandomWrite writes 1 GB per 200 node  WordCount done over randomly 150 generated text  Tahoe write speed is 10x slower 100  Read-intensive jobs are about the same 50  Not so bad since the most common data use case is write- once, read-many 0 Random Write Word Count Hadoop World NYC 2009 27
  • 28. Code  Tahoe available from http://allmydata.org – Licensed under GPL 2 or TGPPL  Integration code available at http://hadoop-lafs.googlecode.com – Licensed under Apache 2 Hadoop World NYC 2009 28