Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Cluster-based Storage
Antonio Cesarano
Bonaventura Del Monte
Università degli studi di
Salerno
16th May 2014
Advanced Operating Systems
Prof. Giuseppe Cattaneo

Agenda
 Context
 Goals of design
 NASD
 NASD prototype
 Distrubuted file systems on NASD
 NASD parallel file system
 Conclusions
A Cost-Effective,
High-Bandwidth Storage Architecture
Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang,
Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, Jim Zelenka, 1997-2001

Agenda
The File System
 Motivations
 Architecture
 Benchmarks
 Comparisons and conclusions
[Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,
2003]

Context - 1998
New drive
attachment
technology
I/O bounded
applications Streaming audio-video
Data
mining
Fibre channel
And new network standards

Context - 1998
Cost-ineffective
storage servers
Excess of on-drive
transistors

Controller
Context - 1998
Big
files
Splitting
Storage1 Storage2 Storage

Goal
No traditional storage file server
Cost-effective bandwidth scaling

What is NASD?
Network-Attached Secure Disk
direct transfer to clients
secure interfaces via cryptographic support
asynchronous oversight
variable-size data objects map to blocks

Network-Attached Secure Disk
Architecture

NASD prototype
 Based on Unix inode interface
 Network with 13 NASD
 Each NASD runs on
•DEC Alpha 3000, 133MHz, 64MB RAM
•2 x Seagate Medallist on 5MB/s SCSI bus
•Connected to 10 clients by ATM (155MB/s)
 Ad Hoc handling modules (16K loc)

NASD prototype
Tests result:
It scales!

DFS on NASD
Porting NFS and AFS on NASD architecture
o Ok, no performance loss
o But there are concurrency limitations
Solution:
A new higher-level parallel file system
must be used…

NASD parallel file system
Scalable I/O low-level interface
Cheops as storage management layer
 Exports the same object interfaces of NASD devices
 Maps them to object on devices
 Maps striped objects
 Supports concurrency control for multi-disk accesses
(10K loc)

NASD parallel file system Test
Clustering data mining application
+ =
*Each NASD drive provides 6.2MB/s

Conclusions
High Scalability
Direct transfer to clients
Working prototype
Usabe with existing file systems
But...very high costs:
•Network adapters
• ASIC microcontroller,
•Workstation
increasing the total cost by over 80%

The Google File System
• Started with their Search Engine
• They provided new services like:
 Google Video
 Gmail
 Google Maps, Earth
 Google App Engine
 … and many more

Design overview
Observing common operations in Google applications leads
developers to make several assumptions:
 Multiple clusters distribuited worldwide
 Fault-tolerance and auto-recovery need to be built into the
systems because problems are very often
 A modest number of large files (100+ MB or Multi-GB)
 Workloads consist of either large streaming or small
random reads, meanwhile write operations are sequential
and append large quantity of data to files
 Google applications and GFS should be co-designed
 Producer – consumer pattern

GFS Architecture
MASTER
CLIENT
CHUNK
SERVER
CHUNKS
UNIX FS
Request for Metadata
Metadata Response
METADATA
CHUNK
SERVER
CHUNKS
UNIX FS
R
A
M
R-W REQUEST
R-W RESPONSE

GFS Architecture: Chunks
 Similar to standard File System blocks but much
larger
 Size: at least 64 MB (configurable)
 Advantages:
• Reduced clients’ need to contact w/ the
master
• Client may perform many operations on a
single block
• Less chunks less metadata in the master
• No internal fragmentation due to lazy space

GFS Architecture: Chunks
 Disadvantages:
• Some small files, made of a small number of
chunks may be accessed many times
• Not a major issue since Google Apps mostly
read large multi-chunk files sequentially
• Moreover this can be fixed using an high
replication factor

GFS Architecture: Master
 A single process running on a separate machine
 Stores all metadata in its RAM:
• File and chunk namespace
• Mapping from files to chunks
• Chunks location
• Access control information and file locking
• Chunk versioning (snapshots handling)
• And so on…

GFS Architecture: Master
 Master has the following responsabilities:
 Chunk creation, re-replication,
rebalancing and deletion for:
 Balancing space utilization and access speed
 Spreading replicas across racks to reduce
correlated failures, usually 3 copies for each chunk
 Rebalancing data to smooth out storage and
request load
 Persistent and replicated logging of crititical
metadata updates

GFS Architecture: M - CS
Communication
 Master and chunkservers communicate regularly in
order to retrieve their states:
o Is chunkserver down?
o Are there disk failure on any chunkserver ?
o Are any replicas corrupted ?
o Which chunk-replicas does a given chunkserver
store?
 Moreover master handles garbage collection
and deletes «stale» replicas
o Master logs the deletion, then renames the
target file to an hidden name
o A lazy GC removes the hidden files after a
given amount of time

GFS Architecture: M - CS
Communication
 Server Requests
 Client retrieves metadata from master for the
requested
 Read / Write dataflows between client and
chunkserver decoupled from master control
flow
 Single master is no longer a bottleneck: its
involvement with R&W is minimized:
 Clients communicate directly with
chunkservers
 Master has to log operations as soon as they
are completed
 Less than 64 BYTES of metadata for each

GFS Architecture: Reading
CHUNKSERVER
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
CHUNKSERVER
CHUNKSERVER

CHUNKSERVER
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
CHUNKSERVER
CHUNKSERVER
NAME - RANGE
NAME
CHUNK
INDEX
CHUNK
HANDLE
REPLICA
LOCATIONS

CHUNKSERVER
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
CHUNKSERVER
CHUNKSERVER
CHUNK
HANDLE
RANGEDATA
FROM
FILE
DATA

GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK

MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
NAME - DATA
NAME
CHUNK
INDEX
CHUNK
HANDLE
PRIMARY
AND
SECONDAY
REPLICA
LOCATIONS
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK

MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
DATA
DATA
DATA

MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
WRITE
CMD

MASTER
APPLICATION
GFS CLIENT
METADATA
R
A
M
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
SECONDAY
CHUNKSERVER
BUFFER CHUNK
ACKs
ACK

Fault Tolerance
 GFS has its own relaxed consistency
model
 Consistent: all replicas have the same
value
 Defined: each replica reflects the
performed mutations
 GFS is high available
 Faster recovery (machine quickly
rebootable)
 Chunks replicated at least 3 times (take this
RAID-6)

Benchmarking: small cluster
GFS tested on a small cluster:
 I master
 16 chunkservers
 16 clients
 Server machines connected to 100MBits central switc
 Same for client machines
 The two switches are connected to a 1Gbits switch

Benchmarking: small cluster
Read Rate Write Rate
1 client 10 MB/s 6.3 MB/s
16 clients 6 MB/s 2.2 MB/s
Network limit
12.5 MB/s

Benchmarking: real-world cluster
Cluster A: 342 PCs
 Used for research and development
 Tasks last few hours reading TBs of data, processing
and writing results back
 Cluster B: 227 PCs
 Continuously generates and processes multi-TB data s
 Typical tasks last more hours than cluster A tasks

Benchmarking: real-world cluster
Cluster A B
Chunkservers # 342 227
Available disk space 72 TB 180 TB
Used disk space 55 TB 155 TB
# of files 735000 737000
# of chunks 992000 1550000
Metadata at CSs 13 GB 21 GB
Metadata at Master 48 MB 60 MB
Read rate 580 MB/s (750 MB/s) 380 MB/s (1300
MB/s)
Write rate 30 MB/s 100 MB/s * 3
Master Ops 202~380 Ops/s 347~533 Ops/s

Benchmarking: recovery time
 One chunkserver killed in cluster B:
o This chunkserver had 15000 chunks
containing 600GB of data
o All chunks were restored in 23.2 mins
with a replication rate of 440 MB/s
 Two chunkserver killed in cluster B:
o Each with 16000 chunks and 660 GB
of data, 266 of them became uniques
o These 266 chunks were replicated at
an higher priority within 2 mins

Comparisons to others models
GFS
RAIDxFS
GPFS
AFS
NASD
spreads file data across
storage servers
simpler, uses only
replication for redundancy
location independent
namespace
centralized approach rather
than distribuited managementcommodity machines instead of
network attached disks
lazy allocated fixed-size blocks rather than variable-lengh objects

Conclusion
 GFS demonstrates how to support large-scale
processing workloads on commodity hardware:
 designed to tollerate frequent component
failures
 optimised for huge files that are mostly
appended and read
 It has met Google’s storage needs, therefore
good enough for them
 GFS has influenced massively the computer
science in the last few years

Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014 (20)

More from Antonio Cesarano (8)

Recently uploaded (20)

Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014