Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem

MapReduce over Tahoe
Aaron Cordova
Associate New York
Oct 1, 2009

Booz Allen Hamilton Inc. .
134 National Business Parkway
Annapolis Junction, MD 20701
cordova_aaron@bah.com

Hadoop World 2009 2009
Hadoop World NYC
1

MapReduce over Tahoe
 Impact of data security requirements on large scale analysis

 Introduction to Tahoe

 Integrating Tahoe with Hadoop’s MapReduce

 Deployment scenarios, considerations

 Test results

Hadoop World NYC 2009
2

Features of Large Scale Analysis
 As data grows, it becomes harder, more expensive to move
– “Massive” data

 The more data sets are located together, the more valuable each is
– Network Effect

 Bring computation to the data

3

Data Security and Large Scale Analysis
 Each department within an organization has its own data

 Some data need to be shared

 Others are protected
CRM

Product
Sales
Testing

4

Data Security
 Because of security constraints,
departments tend to setup
their own data storage and
processing systems
independently Support Support Support Support
 This includes support staff
Storage Storage Storage Storage
 Highly inefficient
Processing Processing Processing Processing
 Analysis across datasets is
impossible Apps Apps Apps Apps

5

“Stovepipe Effect”

6

Tahoe - A Least Authority File System
 Release 1.5

 AllMyData.com

 Included in Ubuntu Karmic Koala

 Open Source

7

Tahoe Architecture
 Data originates at the client, which
is trusted Storage Servers
 Client encrypts, segments, and
erasure-codes data

 Segments are distributed to
storage nodes over encrypted
links

 Storage nodes only see encrypted
SSL
data, and are not trusted

Client

8

Tahoe Architecture Features
 AES Encryption

 Segmentation

 Erasure-coding

 Distributed

 Flexible Access Control

9

Erasure Coding Overview

N

K
 Only k of n segments are needed to recover the file

 Up to n-k machines can fail, be compromised, or malicious without data loss

 n and k are configurable, and can be chosen to achieve desired availability

 Expansion factor of data is k/n (default is 3/10, or 3.3)

10

Flexible Access Control
 Each file has a Read Capability and a Write Capability

 These are decryption keys ReadCap
File
 Directories have capabilities too
WriteCap

ReadCap
Dir
WriteCap

11

Flexible Access Control
 Access to a subset of files can be done by:
– creating a directory Dir
– attaching files
– sharing read or write capabilities of the dir

 Any files or directories attached are accessible

 Any outside the directory are not File Dir ReadCap

File File

12

Access Control Example

Files

Directories /Sales /Testing

Each department can access their own files

13


Files

Directories /Sales /Testing

Each department can access their own files

14


Files

Directories /Sales /New /Testing
Products
Files that need to be shared can be linked to a new directory, whose read
capability is given to both departments

15

Hadoop Can Use The Following File Systems
 HDFS

 Cloud Store (KFS)

 Amazon S3

 FTP

 Read only HTTP

 Now, Tahoe!

16

Hadoop File System Integration HowTo
 Step 1.
– Locate your favorite file system’s API

 Step 2.
– subclass FileSystem
– found in /src/core/org/apache/hadoop/fs/FileSystem.java

 Step 3.
– Add lines to core-site.xml:
<name> fs.lafs.impl </name>
<value> your.class </value>

 Step 4.
– Test using your favorite Infrastructure Service Provider

17

Hadoop Integration : MapReduce
 One Tahoe client is run on each Storage Servers
machine that serves as a
MapReduce Worker

 On average, clients
communicate with k storage
servers

 Jobs are limited by aggregate
network bandwidth

 MapReduce workers are trusted,
storage nodes are not Hadoop Map Reduce Workers

18

Hadoop-Tahoe Configuration
 Step 1. Start Tahoe

 Step 2. Create a new directory in Tahoe, note the WriteCap

 Step 3. Configure core-site.xml thus:
– fs.lafs.impl: org.apache.hadoop.fs.lafs.LAFS
– lafs.rootcap: $WRITE_CAP
– fs.default.name: lafs://localhost

 Step 4. Start MapReduce, but not HDFS

19

Deployment Scenario - Large Organization
 Within a datacenter, Storage Servers
departments can run
MapReduce jobs on discrete
groups of compute nodes

 Each MapReduce job accesses
a directory containing a subset
of files

 Results are written back to the
storage servers, encrypted

Sales Audit

MapReduce Workers / Tahoe Clients

20

Deployment Scenario - Community
 If a community uses a shared Storage Servers
data center, different
organizations can run discrete
MapReduce jobs

 Perhaps most importantly, when
results are deemed appropriate
to share, access can be granted
simply by sending a read or
write capability

 Since the data are all co-located
already, no data needs to be
moved
FBI Homeland Sec


21

Deployment Scenario - Public Cloud Services
 Since storage nodes require no Storage Servers
trust, they can be located at a
remote location, e.g. within a
cloud service provider’s
datacenter
Cloud Service Provider
 MapReduce jobs can be done
this way if bandwidth to the
datacenter is adequate


22

Deployment Scenario - Public Cloud Services
 For some users, everything Storage Servers
could be run remotely in a
service provider’s data center

 There are a few caveats and
additional precautions in this
scenario:



23

Public Cloud Deployment Considerations
 Store configuration files in memory

 Encrypt / disable swap

 Encrypt spillover
 Must trust memory / hypervisor

 Trust service provider disks

24

HDFS and Linux Disk Encryption Drawbacks
 At most one key per node - no support for flexible access control

 Decryption done at the storage node rather than at the client - still have to trust storage nodes

25

Tahoe and HDFS - Comparison

Feature HDFS Tahoe

Conﬁdentiality File Permissions AES Encryption

Integrity Checksum Merkel Hash Tree

Availability Replication Erasure Coding

Expansion Factor 3x 3.3x (k/n)

Self-Healing Automatic Automatic

Load-balancing Automatic Planned

Mutable Files No Yes

26

Performance HDFS Tahoe

 Tests run on ten nodes

 RandomWrite writes 1 GB per
200
node

 WordCount done over randomly 150
generated text

 Tahoe write speed is 10x slower
100
 Read-intensive jobs are about
the same
50
 Not so bad since the most
common data use case is write-
once, read-many 0
Random Write Word Count

27

Code
 Tahoe available from http://allmydata.org
– Licensed under GPL 2 or TGPPL

 Integration code available at http://hadoop-lafs.googlecode.com
– Licensed under Apache 2

28

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem

More Related Content

What's hot (20)

Similar to Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Hw09 Map Reduce Over Tahoe A Least Authority Encrypted Distributed Filesystem