SlideShare a Scribd company logo
1
1-Introduction to B-Trees and Shadowing
1.1- B-tree-
In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches,
sequential access, insertions, and deletions in logarithmic. The B-tree is a generalization of
a binary search tree in that a node can have more than two children (Comer 1979, p. 123). Unlike
self, the B-tree is optimized for systems that read and write large blocks of data. It is commonly
used in databases andfilesystems.
B-Tree is the generalization of the binary search tree. B+-Tree can be consideredas B-Tree
variant, with an exception that in B+-Tree only leafs contain the data. Inbinary search trees, we
have nodes having single search-key and left sub-tree and rightsub-tree containing all nodes with
search-keys that are less than and greater than parentsearch-key respectively. In B+-Tree, we can
have multiple search-keys, and multiple child nodes.
2
In BST, the distance of leaf from the tree root is not fixed. It depends on thesequence of
insertions in BST. But in case of B or B+ trees, the insertion algorithmensures that distance
between leaf and root is same for all cases. The Figure 1.2 shows the B+-Tree. In this example,
the ordering of the words is alphabetical. The size of node 1 is 2 and any more insertions in node
containing 2 search-keys will cause splitting of node and rebalancing operation. In case of B+-
Trees the leafs are chained together. This is because, anyway all search keys in adjacent leafs are
in sorted manner. So chaining can help for efficient sequential access to data associated with the
sorted keys in the leafsinbottom.
In B-trees, internal (non-leaf) nodes can have a variable number of child nodes within some pre-
defined range. When data is inserted or removed from a node, its number of child nodes changes.
In order to maintain the pre-defined range, internal nodes may be joined or split. Because a range
of child nodes is permitted, B-trees do not need re-balancing as frequently as other self-balancing
search trees, but may waste some space, since nodes are not entirely full. The lower and upper
bounds on the number of child nodes are typically fixed for a particular implementation
B+- Tree
B-Trees ensures the logarithmic time key-search, insert and remove operations.B-Trees can be
used to represent files or directories in file-system. Files are typicallyrepresented by b-tree that
hold disk-extents i.e. set of free disk blocks in their leafs. In the next section, we will cover the
basic concept of the shadowing.
3
1.2Shadowing-
Shadowing scheme is also known as copy-on-write (COW) scheme. Shadowing technique is
used to ensure atomic update to persistent data-structures in file-system. In this scheme, we look
at the storage in terms of fixed-size pages. There is a page tablewhich has a pointer to all valid
pages. Shadowing means that to update an on-disk page,the entire page is read into memory,
modified, and later written to disk at an alternatelocation. Now all we have to do is to update the
pointer in the page table to point tothis new page in the disk.
Byte-size of pointer is small and it can fit is one sector inthe disk. There are hard drives that offer
atomic sector upgrades and promise you thateither all of the old or new data in the sector. This
means you either have an old page ornewly written page. So atomic persistent updates are
ensured due to this scheme. It is apowerful mechanism to implement the crash recovery,
snapshots.
1.3 Problems with conventional B-Trees-
The entire file-system tree on the disk can be looked as made of fixed-size pages.When a page is
to be modified, it is read into memory, modified, and later written tosome other location in the
disk. Now let us assume that the leaf in b-tree shown below isequivalent to one one-disk page. If
we try to modify the leaf, then page corresponding tothe leaf will be shadowed. Now, the next
immediate ancestor of this node should point tothis node. This means we will have to modify the
ancestor of this node. Again shadowingis involved, and this process continues up to the root
recursively.
So entire path up to the root need to be shadowed. We will call this type of shadowing as strict
shadowing.Now the one additional problem arises due to linking of the leafs in tree.
Sinceadjacent leaf should also point to the modified leaf, it is also needed to be shadowed.This
process leads to shadowing of the entire tree just because of modification in oneleaf. Remember,
this all is going to happen in the hard-disks! This lead to performancedegradation. The root of
the problem is leaf chaining.
4
To solve the issues related to concurrency, we use mutex locks or semaphores.Now, let’s assume
for while that there are no links in leafs. In normal b-tree, suppose weneed to modify a single
node, we take a lock on it, make changes and then release thelock. But if method of updation is
shadowing, then we know that changes propagate tothe root, making it necessary to take locks on
the way up to the tree root. So there israce to take the lock on the tree root. Waiting for lock is
time consuming process,andhence there is need of efficient synchronization.
The regular b-trees shuffle the keys between neighboring nodes for the re-balacingpurpose after
key-insertion or deletion. If any leaf is modified then, then path up to rootwill be shadowed by
default. Suppose that the exchange of the keys happensbetweennodes whose immediate ancestor
is not same, then additional path up to tree root willhave to be shadowed due to modification
because of exchange of keys.Removing a key and effects of re-balancing and shadowing.
Removing a key and effects of re-balancing and shadowing.
So we can say that B-Trees + Shadowing = Expensive choice, if conventionalb-trees are used.
1.4 Modifications in conventional B-tree-
5
OhadRodeh, IBM Researcher, have suggested modifications to conventional b-treeand
algorithms related to it, for integrating b-tree schemes with shadowing technique. Wewill cover
few of them, related to problems discussed above.
1. To solve the problem related to shadowing of whole tree, the links between the leafsare
removed. Due to this, only the path up to the tree root needs to be shadowed.
2. In case of rebalancing operation, it is better to exchange the keys between nodeswhose
immediate ancestor is same, because this will involve shadowing of singlenode, which is better
instead of shadowing the another path up to the root involvingmany nodes.
6
2-Introduction and History of BTRFS
2.1 Introduction-
Btrfs is GPL-licensed copy-on-write file system for Linux. Its development beganat Oracle
Corporation in 2007. Principal BTRFS author is Chris Mason. Following areSome general points
about btrfs:
1. The core data structure of Btrfs is the copy-on-write B-tree which was originallyproposed by
IBM researcher Ohad Rodeh at a presentation at USENIX 2007.
2. Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release,and was
finally accepted into the mainline kernel as of 2.6.29 in 2009.
3. Btrfs is intended to address the lack of pooling, snapshots, checksums in Linux filesystems.
4. Goal of btrfs was "to let Linux scale for the storage that will be available. Scalingis not just
about addressing the storage but also means being able to administerand to manage it with a
clean interface that lets people see what’s being used andmakes it more reliable."
5. Btrfs has a number of the same design ideas that reiser3/4(Chris Mason was
workingonreiserFS before starting his work on btrfs).
The maximum number of files is 18,446,744,073,709,551,616 or 2 to the 64 power of filesThe
maximum file length is 255 characters. The theoretical max file size limit is 16 EB, or 8EB .The
BTRFS file system helps reduce fragmentation. Storage devices usually show a loss of
performance due to fragmentation (usually when fuller). BTRFS does allow for Online.
When disk space should become full, it is possible to add space to the existing BTRFS volume.
The method refers to Online Resize. The BTRFS file system does not need to be unmounted or
taken offline. An existing volume can be added, or removed, from the volume toIf a volume has
an existing ext3 or ext4 file system, it can be converted to BTRFS. The conversion is an in-place
conversion. This means that the existing data does not have to be removed before the file system
is converted. It is good practice to perform a backup in case .
7
2.1History-
The core data structure of Btrfs — the copy-on-write B-tree — was originally proposed by IBM
researcher Ohad Rodeh at a presentation at USENIX2007. Chris Mason, an engineer working
on ReiserFS for SUSE at the time, joined Oracle later that year and began work on a new file
system based on these B-trees.
Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release, and was
finally accepted into the mainline kernel as of 2.6.29 in 2009.Several Linux distributions began
offering Btrfs as an experimental choice of root file system during installation, including Arch
Linux, openSUSE 11.3, SLES 11 SP1, Ubuntu 10.10, Sabayon Linux, Red Hat Enterprise Linux
6, Fedora 15,MeeGo, Debian, and Slackware 13.37. In summer 2012, several Linux distributions
have moved Btrfs from experimental to production / supported status, including SLES 11
SP2 and Oracle Linux 5 and 6, with the Unbreakable Enterprise Kernel Release 2.
In 2011, de-fragmentation features were announced for the Linux 3.0 kernel version. Besides
Mason at Oracle, Miao Xie at Fujitsu contributed performance improvements.In June 2012,
Chris Mason left Oracle for Fusion-io, and in November 2013 he left Fusion-io for Facebook. He
continues to work on Btrfs.
2.3 Why btrfs File-System?
Linux kernel currently supports almost as 140 file-systems. Most of these file-systemsare
generally very good. So why do we need a new file-system even when we have thesemany file
systems? Reasons for the same are:
1. This file-system scales to very large storage. This is evident because maximum sizeof storage
that file-system can address is 16 EiB (264 Bytes).
2. This file-system is feature focused, providing features the other file-systems cannot.
3. Performance is important. This file-system does not intend to race with current
filesystemsbecause they are anyway good. It’s the features that makes btrfs standout.
4. This file-system is administrator focused, so that it is easy to configure, and faulttolerant.
8
3- Specifications and Features of BTRFS
3.1 Features-
1. As of version 3.12 of the Linux kernel mainline, Btrfs implements the following features:
2. Mostly self-healing in some configurations due to the nature of copy on write
3. Online defragmentation
4. Online volume growth and shrinking
5. Online block device addition and removal
6. Online balancing (movement of objects between block devices to balance load)
7. Offline filesystem check
8. Online data scrubbing for finding errors and automatically fixing them for files with
redundant copies
9. RAID 0, RAID 1, RAID 5, RAID 6 and RAID 10
10. Subvolumes (one or more separately mountable filesystem roots within each disk
partition)
11. Transparent compression (zlib and LZO)
12. Snapshots (read-only or copy-on-write clones of subvolumes)
13. File cloning (copy-on-write on individual files, or byte ranges thereof)
14. Checksums on data and metadata (CRC-32C)
15. In-place conversion (with rollback) from ext3/4 to Btrfs
16. File system seeding (Btrfs on read-only storage used as a copy-on-write backing for a
writeable Btrfs)
17. Block discard support (reclaims space on some virtualized setups and improves wear
leveling on SSDs with TRIM)
18. Send/receive (saving diffs between snapshots to a binary stream)
19. Hierarchical per-subvolume quotas
20. Out-of-band data deduplication (requires user space tools)
9
3.2 Planned features include:
1. In-band data deduplication
2. Online filesystem check
3. Very fast offline filesystem check
4. Object-level RAID 0, RAID 1, and RAID 10[citation needed]
5. Incremental backup
6. Ability to handle swap files and swap partitions
7. Encryption
In 2009, Btrfs was expected to offer a feature set comparable to ZFS, developed by Sun
Microsystems.[40]
After Oracle's acquisition of Sun in 2009, Mason and Oracle decided to
continue on with Btrfs development.[41]
Cloning-
Btrfs provides a clone operation which atomically creates a copy-on-write snapshot of a file.
Such cloned files are sometimes referred to as reflinks, in light of the associated Linux
kernel system calls.
By cloning, the file system does not create a new link pointing to an existing inode — it instead
creates a new inode that shares the same disk blocks as the original file. As a result, this
operation only works within the boundaries of the same Btrfs file system, while it can cross the
boundaries of subvolumes since Linux kernel version 3.6. The actual data blocks are not
becoming duplicated but, due to the copy-on-write nature of cloning, modifications to any of the
cloned files are not visible in their parent files and vice-versa.
This should not be confused with hard links, which are directory entries that associate multiple
file names with actual files on a file system. While hard links can be taken as different names for
the same underlying group of disk blocks (known as a file), cloning in Btrfs provides
independent files that are sharing their disk blocks as a form of data deduplication on the disk
block level. Any later changes to the content of such "dependent" files invoke the copy-on-write
mechanism, which creates independent copies of all altered disk blocks.
10
Support for this Btrfs feature was added in version 7.5 of the GNU coreutils, via the --
reflink option to the cp command.
Cloning can be especially effective in case of storing disk images of virtual machines or their
snapshots. Those are large files differing only in small portions, where the cloning provides both
their faster (instantaneous) copying and minimal consumption of storage space due to data
deduplication.
3.3 Snapshots-
Snapshots means the read only copy of data set frozen at particular point in time.Here we will
consider the case of the of writable snapshots of the tree structure.Inbtrfs, the cloning or snapshot
algorithm allowstheoretically large number snapshots.
In the above example, We have initial tree Tp. Here we have shown reference countof the each
block. Initially all the tree block have the reference-count as 1. Now wewill clone the btree using
tree Tq. The root of the Tq refer the same block as that of Tp.Now as there are two tree root
referencing some common blocks B and C, the referencecountof these blocks is increased by
one. So cloning algorithm just sets the pointer to the blocks referred by original tree root and
increase the reference-count of blocks referenced.Hence we can have as many as the snapshots
as we want, because pointer occupies less space than the actual data.
11
Now we consider the case of process of editing of the shared blocks. Figure 4.2shows the
example of the same. In this example, there are two tree rootTp and Tq.Now suppose that we are
editing the snapshot with respect to the Tq, and the leaf beingmodified is H. Node C is the
immediate ancestor of the leaf H. It should point to themodified copy of the leaf H. So block C is
shadowed to C0 which points to same blocks as that of C. The reference count of the C
isdecremented. Then the leaf H is shadowed toH0 and the reference count of the block H is
decremented by one. Hence due to this kind of sharing, the space requirement instead of copying
and modifying entire tree is low.
3.4 Subvolumes-
It is volume within volume which can be mounted separately. The user sees thevolumes as the
directories. There are benefits of doing this. We can, for example, makethe database directory as
subvolume, which will enable you to take snapshots for use withbackup. But like volumes in
other file-system, subvolume can’t be mounted anywhere inthe logical view of the directories. It
has to be mounted under the parent directory itself.
12
A subvolume in Btrfs is quite different from the usual LVM logical volumes. With LVM, a
logical volume is a block device in its own right — while this is not the case with Btrfs. A Btrfs
subvolume is not a separate block device, and it cannot be treated or used that way.
Instead, a Btrfs subvolume can be thought of as a separate POSIX file namespace. This
namespace can be accessed either through the top-level subvolume of the file system, or it can be
mounted on its own and accessed separately by specifying the subvol or subvolid option
to mount. When accessed through the top-level subvolume, subvolumes are visible and accessed
as its subdirectories.
Subvolumes can be created at any place within the file system hierarchy, and they can also be
nested. Nested subvolumes appear as subdirectories within their parent subvolumes, similar to
the way top-level subvolume presents its subvolumes as subdirectories. Deleting a subvolume
deletes all subvolumes below it in the nesting hierarchy, and for this reason the top-level
subvolume cannot be deleted.
Any Btrfs file system always has a default subvolume, which is initially set to be the top-level
subvolume, and it is mounted by default if no subvolume selection option is passed to mount. Of
course, the default subvolume can be changed as required.
3.5 Send/receive-
Given any pair of subvolumes (or snapshots), Btrfs can generate a binary diff between them (by
using the btrfs send command) that can be replayed later (by using btrfs receive), possibly on a
different Btrfs file system. The send/receive feature effectively creates (and applies) a set of data
modifications required for converting one subvolume into another.
The send/receive feature can be used with regularly scheduled snapshots for implementing a
simple form of file system master/slave replication, or for the purpose of performing incremental
backups.
3.6 Quota groups-
13
A quota group (or qgroup) imposes an upper limit to the space a subvolume or snapshot may
consume. A new snapshot initially consumes no quota because its data is shared with its parent,
but thereafter incurs a charge for new files and copy-on-write operations on existing files. When
quotas are active, a quota group is automatically created with each new subvolume or snapshot.
These initial quota groups are building blocks which can be grouped (with the btrfs
qgroup command) into hierarchies to implement quota pools.
Quota groups only apply to subvolumes and snapshots, while having quotas enforced on
individual subdirectories is not possible.In-place ext2/3/4 conversion
As the result of having very little metadata anchored in fixed locations, Btrfs can warp to fit
unusual spatial layouts of the backend storage devices. The btrfs-convert tool exploits this ability
to do an in-place conversion of any ext2/3/4 file system, by nesting the equivalent Btrfs metadata
in its unallocated space — while preserving an unmodified copy of the original file system.
The conversion involves creating a copy of the whole ext2/3/4 metadata, while the Btrfs files
simply point to the same blocks used by the ext2/3/4 files. This makes the bulk of the blocks
shared between the two filesystems before the conversion becomes permanent. Thanks to the
copy-on-write nature of Btrfs, the original versions of the file data blocks are preserved during
all file modifications. Until the conversion becomes permanent, only the blocks that were
marked as free in ext2/3/4 are used to hold new Btrfs modifications, meaning that the conversion
can be undone at any time.
All converted files are available and writable in the default subvolume of the Btrfs. A sparse file
holding all of the references to the original ext2/3/4 filesystem is created in a separate
subvolume, which is mountable on its own as a read-only disk image, allowing both original and
converted file systems to be accessed at the same time. Deleting this sparse file frees up the
space and makes the conversion permanent.
3.7 Seed devices-
When creating a new Btrfs, an existing Btrfs can be used as a read-only "seed" file system. The
new file system will then act as a copy-on-write overlay on the seed. The seed can be later
detached from the Btrfs, at which point the rebalancer will simply copy over any seed data still
referenced by the new file system before detaching. Mason has suggested this may be useful for
14
aLive CD installer, which might boot from a read-only Btrfs seed on optical disc, rebalance itself
to the target partition on the install disk in the background while the user continues to work, then
eject the disc to complete the installation without rebooting.
3.8 Encryption-
Though Chris Mason said in his interview in 2009 that encryption was planned for Btrfs, this is
unlikely to be implemented for some time, if ever, due to the complexity of implementation and
pre-existing tested and peer-reviewed solutions. The current recommendation for encryption with
Btrfs is to use a full-disk encryption mechanism such as dm-crypt/LUKS on the underlying
devices, and to create the Btrfs filesystem on top of that layer (and that if a RAID is to be used
with encryption, encrypting a dm-raid device or a hardware-RAID device gives much faster disk
performance than dm-crypt overlaid by Btrfs' own filesystem-level RAID features).
3.9 Checking and recovery-
Unix systems traditionally rely on "fsck" programs to check and repair filesystems.
The btrfsck program is now available but, as of May 2012, it is described by the authors as
"relatively new code which has "not seen widespread testing on a large range of real-life
breakage", and that "may cause additional damage in the process of repair".
There is another tool, named btrfs-restore, that can be used to recover files from an unmountable
filesystem, without modifying the broken filesystem itself (i.e., non-destructively).
In normal use, Btrfs is mostly self-healing and can recover from broken root trees at mount time,
thanks to making periodic data flushes to permanent storage every 30 seconds (which is the
default period). Thus, isolated errors will cause a maximum of 30 seconds of filesystem changes
to be lost at the next mount. This period can be changed by specifying a desired value (in
seconds) for the commit mount option.
15
3- Design
OhadRodeh's original proposal at USENIX 2007 noted that B+ trees, which are widely used as
on-disk data structures for databases, could not efficiently support copy-on-write-based
snapshots because its leaf nodes were linked together: if a leaf was copy-on-written, its siblings
and parents would have to be as well, as would their siblings and parents and so on until the
entire tree was copied. He suggested instead a modified B-tree (which has no leaf linkage), with
a refcount associated to each tree node but stored in an ad-hoc free map structure and certain
relaxations to the tree's balancing algorithms to make them copy-on-write friendly. The result
would be a data structure suitable for a high-performance object store that could perform copy-
on-write snapshots, while maintaining good concurrency.
At Oracle later that year, Chris Mason began work on a snapshot-capable file system that would
use this data structure almost exclusively—not just for metadata and file data, but also
recursively to track space allocation of the trees themselves. This allowed all traversal and
modifications to be funneled through a single code path, against which features such as copy-on-
write, checksumming and mirroring needed to be implemented only once to benefit the entire file
system.
Btrfs is structured as several layers of such trees, all using the same B-tree implementation. The
trees store generic items sorted on a 136-bit key. The first 64 bits of the key are a unique object
id. The middle 8 bits are an item type field; its use is hardwired into code as an item filter in tree
lookups. Objects can have multiple items of multiple types. The remaining right-hand 64 bits are
used in type-specific ways. Therefore items for the same object end up adjacent to each other in
the tree, ordered by type. By choosing certain right-hand key values, objects can further put
items of the same type in a particular order.
Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block
number of a child node. Leaf nodes contain item keys packed into the front of the node and item
data packed into the end, with the two growing toward each other as the leaf fills up.
16
In this section, we will cover basic data structures that are used in the btrfs. Everytree block is
either a leaf or node. Every leaf and node begins with the header.
//Node :
struct btrfs_node {
struct btrfs_header header ;
struct btrfs_key_ptr ptrs [ ] ;
}
// Leaf :
struct b t r f s_l e a f {
struct bt r f s_heade r header ;
struct bt r f s_i t em i tems [ ] ;
}
// header ( p r e s ent in node and l e a f )
struct bt r f s_heade r {
u8 csum[ 3 2 byt e s ] ;
u8 f s i d [ 1 6 ] ;
__le64 blocknr ;
__le64 g ene r a t i on ;
__le64 owner ;
__le16 nr i t ems ;
__le16 f l a g s ;
u8 l e v e l ;
}
//Key p t r s ( p r e s ent in node )
struct btrfs_key_ptr {
struct btrfs_disk_key ;
6
__le64 bl o ckpt r ;
__le64 g ene r a t i on ;
17
}
// Items ( p r e s ent in l e a f )
struct bt r f s_i t em {
struct btrfs_disk_key key ;
__le32 o f f s e t ;
__le32 s i z e ;
}
// Items ( p r e s ent in i tems in the l e a f )
struct bt r f s_key {
u64 o b j e c t i d ;
u32 f l a g s ;
u64 o f f s e t ;
}
Every tree node carries the header. The block header contains a checksum for theblock contents,
the uuid of the filesystem that owns the block, the level of the block inthe tree, and the block
number where this block is supposed to live. These fields allow the contents of the metadata to
be verified when the data is read.The generation fieldcorresponds to the transaction id that
allocated the block. So nodes have pointer array which points to other leafs or node (i.e. some
blocks on disk) using blockptr field in key.
Now we will look at more details about the leaf structure.Leaf node containsthe header and the
array of items. Now, each logical object in file-system (e.g. files anddirectories) contains various
items. B-tree implementation are used to store these itemssorted on a 136-bit key (struct btrfs-
key) in leaf. The first 64 bits of the key are aobjectidwhich is unique id for each logical object.
This id is reported as the inode number. Typesof items in leaf can be inode, directory entries,
extent and so on, associated with object.
Next field in btrfs-key is type which tells information about type of item associated
withobject.Next field in key is offset which tell about the position of item in leaf.Now,
interesting thing is, as objectid forms MSB in the btrfs-key of items, so allitems related to the
object ends up being adjacent to each other i.e. they are automaticallygrouped together. This
means metadata and optionally data associated with an object isgrouped together. This results in
compact packing of the data and metadata. Suppose thethat there are N items in the leaf, then
18
index data-item associated with item[X] is dataitem[N-X]. This means the items and the data
associated with the items grow towards
each other in leaf. Following Figure 3.1 summarize this paragraph.Now let’s see some
information about the disk layout. The scheme to storeitems in leaf associated with an object is
space and time efficient as well. Normally, filesystems put only one kind of data - bitmaps, or
inodes, or directory entries - in any given
Leaf structure in Btrfs
file system block. This wastes disk space, since unused space in one kind of block can’tbe used
for any other purpose, and it wastes time, since getting to one particular pieceof file data requires
reading several different kinds of metadata, all located in differentblocks in the file system. In
btrfs, items are packed together (or pushed out to leaves) inarrangements that optimize both
access time and disk space. You can see the difference in these (very schematic, very simplified)
diagrams.
Old-school filesystems tend to organize data as shown in Btrfs, instead, creates a
disklayoutwhich looks as shown in As we can see, there is no fixed block for the inodes,
bitmaps, dir entries, file data or block pointer. The blocks associated with these can overlap for
the sake of compaction.The red arrows in Figure shows the disk seeks to locate data or meta-
data.The red portion in the blocks shows the unused or wasted space. As all metadata relatedtoan
19
object is closely packed, the there are less disk seek. Hence the scheme is time andspace
efficient.
There are various b-trees in btrfs. Everything is stored the btrees. There issingle tree
manipulation code. Also trees does not care about the object types in theb-tree. So same code can
be reused for all kinds of trees that are there in the btrfs. Hence scheme are not only space and
time efficient, but code efficient too.
Data organization in old-school filesystems
20
Data organization in Btrfs
4.1 Root tree-
Every tree appears as an object in the root tree (or tree of tree roots). Some trees, such as file
system trees and log trees, have a variable number of instances, each of which is given its own
object id. Trees which are singletons (the data relocation, extent and chunk trees) are assigned
special, fixed object ids ≤256. The root tree appears in itself as a tree with object id 1.
Trees refer to each other by object id. They may also refer to individual nodes in other trees as a
triplet of the tree's object id, the node's level within the tree and its leftmost key value. Such
references are independent of where the tree is actually stored.
4.2 File system tree-
subvolume. Subvolumes can nest, in which case they appear as a directory item (described User-
visible files and directories all live in a file system tree. There is one file system tree per below)
whose data is a reference to the nested subvolume's file system tree.
Within the file system tree, each file and directory objects has an inode item. Extended
attributes and ACL entries are stored alongside in separate items.
21
Within each directory, directory entries appear as directory items, whose right-hand key values
are a CRC32C hash of their filename. Their data is a location key, or the key of the inode item it
points to. Directory items together can thus act as an index for path-to-inode lookups, but are not
used for iteration because they are sorted by their hash, effectively randomly permuting them.
This means user applications iterating over and opening files in a large directory would thus
generate many more disk seeks between non-adjacent files—a notable performance drain in
other file systems with hash-ordered directories such as ReiserFS, ext3 (with Htree-indexes
enabled and ext4, all of which have TEA-hashed filenames. To avoid this, each directory entry
has a directory index item, whose right-hand value of the item is set to a per-directory counter
that increments with each new directory entry. Iteration over these index items thus returns
entries in roughly the same order as they are stored on disk.
Besides inode items, files and directories also have a reference item whose right-hand key value
is the object id of their parent directory. The data part of the reference item is the filename that
inode is known by in that directory. This allows upward traversal through the directory hierarchy
by providing a way to map inodes back to paths.
Files with hard links in other directories have multiple reference items, one for each parent
directory. Files with hard links in the same directory pack all of the links' filenames into the
same reference item. This was a design flaw that limited the number of same-directory hard links
to however many could fit in a single tree block. (On the default block size of 4 KB, an average
filename length of 8 bytes and a per-filename header of 4 bytes, this would be less than 350.)
Applications which made heavy use of same-directory hard links, such
as git, GNUS, GMame and BackupPCwere later observed to fail after hitting this limit.]
The limit
was eventually removed (and as of October 2012 has been merged pending release in Linux by
introducing spilloverextended reference items to hold hard link filenames which could not
otherwise fit.
4.3 Relocation trees-
Defragmentation, shrinking and rebalancing operations require extents to be relocated. However,
doing a simple copy-on-write of the relocating extent will break sharing between snapshots and
22
consume disk space. To preserve sharing, an update-and-swap algorithm is used, with a
special relocation tree serving as scratch space for affected metadata. The extent to be relocated
is first copied to its destination. Then, by following back references upward through the affected
subvolume's file system tree, metadata pointing to the old extent is progressively updated to
point at the new one; any newly updated items are stored in the relocation tree. Once the update
is complete, items in the relocation tree are swapped with their counterparts in the affected
subvolume, and the relocation tree is discarded.
(b)
(a) File system forest (b) the changes that occur after modification
23
4.4 Extents-
File data are kept outside the tree in extents, which are contiguous runs of disk blocks. Extent
blocks default to 4KiB in size, do not have headers and contain only (possibly compressed) file
data. In compressed extents, individual blocks are not compressed separately; rather, the
compression stream spans the entire extent.
Files have extent data items to track the extents which hold their contents. The item's right-hand
key value is the starting byte offset of the extent. This makes for efficient seeks in large files
with many extents, because the correct extent for any given file offset can be computed with just
one tree lookup.
Snapshots and cloned files share extents. When a small part of a large such extent is overwritten,
the resulting copy-on-write may create three new extents: a small one containing the overwritten
data, and two large ones with unmodified data on either side of the overwrite. To avoid having to
re-write unmodified data, the copy-on-write may instead create bookend extents, or extents
which are simply slices of existing extents. Extent data items allow for this by including an offset
into the extent they are tracking: items for bookends are those with non-zero offsets.
If the file data is small enough to fit inside a tree node, it is instead pulled in-tree and stored
inline in the extent data item. Each tree node is stored in its own tree block—a single
uncompressed block with a header. The tree block is regarded as a free-standing, single-block
extent.
24
5-Performance(Disk)-
25
6- Limitations
Here we will list few of the problems to be addressed.
1. Transactions
(a) Btrfs supports limited transactions without Atomicity-Consistency-Isolation-Durability
semantics.
(b) Only one transaction may run at a time which is not atomic wrt storage.
2. Checking and recovery
(a) fsck tool is available but not recommended as of now.
26
7-Future Development
Some of the planned features are:
1. Encryption
2. Data Deduplication
3. Parity Based RAID(RAID5 and RAID6)
4. Ability to handle swap
5. Incremental dumps
27
7-References
[1] Ohad Rodeh, Research paper on "B-trees, Shadowing, and Clones" oracle in 2008
[2]Kerner, Sean Michael "A Better File System For Linux". InternetNews.com. Archived from
the original on 24 June 2012. Retrieved 2008-10-30.
[4] Valerie Aurora, Article on "A short history of btrfs",IEEE Magazine in 2011
[5]MACEDONIA,M.R.: “B-tree file system” IEEE Commune. Magazine , vol. 47 pp. S30-S38 ,
Mar 2007
[6] Mason, Chris. "Btrfs: a copy on write, snapshotting FS”,(2007-06-12)
[7] Brown, Eric "Linux 3.0 scrubs up Btrfs, gets more Xen". Linux devices (eWeek).Archived
from the original on 2013-01-27.Retrieved 8 November 2011.
28

More Related Content

PDF
Linux : Booting and runlevels
John Ombagi
 
PPT
RT linux
SARITHA REDDY
 
PPT
What Is Power Over Ethernet 102909
Morty Eisen
 
PPTX
Android Device Hardening
anupriti
 
PPT
Directory Services Nma Unit-1
GPAPassedStudents
 
DOC
Tic tac toe game code
Upendra Sengar
 
PPTX
DNS Security, is it enough?
Zscaler
 
PPTX
Sql injection
Nuruzzaman Milon
 
Linux : Booting and runlevels
John Ombagi
 
RT linux
SARITHA REDDY
 
What Is Power Over Ethernet 102909
Morty Eisen
 
Android Device Hardening
anupriti
 
Directory Services Nma Unit-1
GPAPassedStudents
 
Tic tac toe game code
Upendra Sengar
 
DNS Security, is it enough?
Zscaler
 
Sql injection
Nuruzzaman Milon
 

What's hot (20)

PPT
Linux file system
Burhan Abbasi
 
DOCX
A computer shop management system
Ûťţåm Ğűpţä
 
PPTX
System Programming Unit II
Manoj Patil
 
PDF
Blackfin core architecture
Pantech ProLabs India Pvt Ltd
 
PPTX
Dom based xss
Lê Giáp
 
PPTX
DISASSEMBLER-DECOMPILER.pptx
ssuser13dc7d
 
PDF
[Webinar] QtSerialBus: Using Modbus and CAN bus with Qt
ICS
 
PDF
UML- Class Diagrams, State Machine Diagrams
QBI Institute
 
PPT
Registry Forensics
Somesh Sawhney
 
PDF
FAKE Review Detection
Cognizant Technology Solutions
 
PPTX
Grocery app aj
Amita Jain
 
PPTX
File Encryption
brittanyjespersen
 
PPT
Lecture 6 -_presentation_layer
Serious_SamSoul
 
PPTX
ONLINE GROCERY STORE MANAGEMENT SYSTEM PPT
ChetanBhandari14
 
PPT
Courier Management System By Mukesh
Mukesh Kumar
 
PPTX
Linux booting Process
Gaurav Sharma
 
PPTX
Trojans and backdoors
Gaurav Dalvi
 
PPTX
Top Down Parsing, Predictive Parsing
Tanzeela_Hussain
 
PPTX
Access control list [1]
Summit Bisht
 
PDF
File systems virtualization in windows using mini filter drivers
MPNIKHIL
 
Linux file system
Burhan Abbasi
 
A computer shop management system
Ûťţåm Ğűpţä
 
System Programming Unit II
Manoj Patil
 
Blackfin core architecture
Pantech ProLabs India Pvt Ltd
 
Dom based xss
Lê Giáp
 
DISASSEMBLER-DECOMPILER.pptx
ssuser13dc7d
 
[Webinar] QtSerialBus: Using Modbus and CAN bus with Qt
ICS
 
UML- Class Diagrams, State Machine Diagrams
QBI Institute
 
Registry Forensics
Somesh Sawhney
 
FAKE Review Detection
Cognizant Technology Solutions
 
Grocery app aj
Amita Jain
 
File Encryption
brittanyjespersen
 
Lecture 6 -_presentation_layer
Serious_SamSoul
 
ONLINE GROCERY STORE MANAGEMENT SYSTEM PPT
ChetanBhandari14
 
Courier Management System By Mukesh
Mukesh Kumar
 
Linux booting Process
Gaurav Sharma
 
Trojans and backdoors
Gaurav Dalvi
 
Top Down Parsing, Predictive Parsing
Tanzeela_Hussain
 
Access control list [1]
Summit Bisht
 
File systems virtualization in windows using mini filter drivers
MPNIKHIL
 
Ad

Viewers also liked (15)

PPTX
B tree file system
Dinesh Gupta
 
ODP
An Overview of Next-Gen Filesystems
Great Wide Open
 
ODP
Case study of BtrFS: A fault tolerant File system
Kumar Amit Mehta
 
PDF
PostgreSQL on EXT4, XFS, BTRFS and ZFS
Tomas Vondra
 
PDF
Novinky v PostgreSQL 9.4 a JSONB
Tomas Vondra
 
PDF
Btrfs: Design, Implementation and the Current Status
Lukáš Czerner
 
PDF
LUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFS
Marian Marinov
 
PPTX
Multi touch technology
SonuRana20111045
 
PPT
B trees in Data Structure
Anuj Modi
 
PPT
multi touch screen
Piyusha Singh
 
PPT
b+ tree
bitistu
 
PPT
B trees dbms
kuldeep100
 
PPTX
5 pen pc technology
PRADEEP Cheekatla
 
PDF
Multi-Touch Interaction Overview
TNO
 
PPTX
MultiTouch
Rishabha Garg
 
B tree file system
Dinesh Gupta
 
An Overview of Next-Gen Filesystems
Great Wide Open
 
Case study of BtrFS: A fault tolerant File system
Kumar Amit Mehta
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
Tomas Vondra
 
Novinky v PostgreSQL 9.4 a JSONB
Tomas Vondra
 
Btrfs: Design, Implementation and the Current Status
Lukáš Czerner
 
LUG-BG 2017 - Rangel Ivanov - Spread some butter - BTRFS
Marian Marinov
 
Multi touch technology
SonuRana20111045
 
B trees in Data Structure
Anuj Modi
 
multi touch screen
Piyusha Singh
 
b+ tree
bitistu
 
B trees dbms
kuldeep100
 
5 pen pc technology
PRADEEP Cheekatla
 
Multi-Touch Interaction Overview
TNO
 
MultiTouch
Rishabha Garg
 
Ad

Similar to b tree file system report (20)

PDF
Stratified B-trees - HotStorage11
Acunu
 
PPTX
6 chapter 6 record storage and primary file organization
siragezeynu
 
PPTX
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
PPT
3620121datastructures.ppt
SheejamolMathew
 
PDF
Google File System
vivatechijri
 
PDF
Different Storage Models in Big Data Analytics
darklegendharsha1
 
PPT
File Allocation Methods.ppt
BharathiLakshmiAAssi
 
PDF
4 026
dchathu30
 
PPT
StorageIndexing_CS541.ppt indexes for dtata bae
syedalishahid6
 
PPT
INDEXING METHODS USED IN DATABASE STORAGE
polin38
 
PPT
StorageIndexing_Main memory (RAM) for currently used data. Disk for the main ...
masooda5
 
PDF
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
IRJET Journal
 
PDF
Hadoop Distributed File System in Big data
ramukaka777787
 
PDF
DBMS Notes.pdf
shubhampatel67739
 
PDF
BACKUP STORAGE BLOCK-LEVEL DEDUPLICATION WITH DDUMBFS AND BACULA
ijait
 
PPT
Chapter13
gourab87
 
PDF
TokuDB 高科扩展性 MySQL 和 MariaDB 数据库
YUCHENG HU
 
PDF
Building modern data lakes
Minio
 
PPT
Cache Conscious Indexes
Tata Consultancy Services
 
PPTX
Introduction to Hadoop Distributed File System(HDFS).pptx
SakthiVinoth78
 
Stratified B-trees - HotStorage11
Acunu
 
6 chapter 6 record storage and primary file organization
siragezeynu
 
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
3620121datastructures.ppt
SheejamolMathew
 
Google File System
vivatechijri
 
Different Storage Models in Big Data Analytics
darklegendharsha1
 
File Allocation Methods.ppt
BharathiLakshmiAAssi
 
4 026
dchathu30
 
StorageIndexing_CS541.ppt indexes for dtata bae
syedalishahid6
 
INDEXING METHODS USED IN DATABASE STORAGE
polin38
 
StorageIndexing_Main memory (RAM) for currently used data. Disk for the main ...
masooda5
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
IRJET Journal
 
Hadoop Distributed File System in Big data
ramukaka777787
 
DBMS Notes.pdf
shubhampatel67739
 
BACKUP STORAGE BLOCK-LEVEL DEDUPLICATION WITH DDUMBFS AND BACULA
ijait
 
Chapter13
gourab87
 
TokuDB 高科扩展性 MySQL 和 MariaDB 数据库
YUCHENG HU
 
Building modern data lakes
Minio
 
Cache Conscious Indexes
Tata Consultancy Services
 
Introduction to Hadoop Distributed File System(HDFS).pptx
SakthiVinoth78
 

Recently uploaded (20)

PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PDF
Software Testing Tools - names and explanation
shruti533256
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PDF
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PPTX
Tunnel Ventilation System in Kanpur Metro
220105053
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PPT
Ppt for engineering students application on field effect
lakshmi.ec
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PDF
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PDF
Introduction to Data Science: data science process
ShivarkarSandip
 
PDF
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PPTX
easa module 3 funtamental electronics.pptx
tryanothert7
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
Software Testing Tools - names and explanation
shruti533256
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
Tunnel Ventilation System in Kanpur Metro
220105053
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
Ppt for engineering students application on field effect
lakshmi.ec
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
Introduction to Data Science: data science process
ShivarkarSandip
 
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
easa module 3 funtamental electronics.pptx
tryanothert7
 

b tree file system report

  • 1. 1 1-Introduction to B-Trees and Shadowing 1.1- B-tree- In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic. The B-tree is a generalization of a binary search tree in that a node can have more than two children (Comer 1979, p. 123). Unlike self, the B-tree is optimized for systems that read and write large blocks of data. It is commonly used in databases andfilesystems. B-Tree is the generalization of the binary search tree. B+-Tree can be consideredas B-Tree variant, with an exception that in B+-Tree only leafs contain the data. Inbinary search trees, we have nodes having single search-key and left sub-tree and rightsub-tree containing all nodes with search-keys that are less than and greater than parentsearch-key respectively. In B+-Tree, we can have multiple search-keys, and multiple child nodes.
  • 2. 2 In BST, the distance of leaf from the tree root is not fixed. It depends on thesequence of insertions in BST. But in case of B or B+ trees, the insertion algorithmensures that distance between leaf and root is same for all cases. The Figure 1.2 shows the B+-Tree. In this example, the ordering of the words is alphabetical. The size of node 1 is 2 and any more insertions in node containing 2 search-keys will cause splitting of node and rebalancing operation. In case of B+- Trees the leafs are chained together. This is because, anyway all search keys in adjacent leafs are in sorted manner. So chaining can help for efficient sequential access to data associated with the sorted keys in the leafsinbottom. In B-trees, internal (non-leaf) nodes can have a variable number of child nodes within some pre- defined range. When data is inserted or removed from a node, its number of child nodes changes. In order to maintain the pre-defined range, internal nodes may be joined or split. Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as other self-balancing search trees, but may waste some space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are typically fixed for a particular implementation B+- Tree B-Trees ensures the logarithmic time key-search, insert and remove operations.B-Trees can be used to represent files or directories in file-system. Files are typicallyrepresented by b-tree that hold disk-extents i.e. set of free disk blocks in their leafs. In the next section, we will cover the basic concept of the shadowing.
  • 3. 3 1.2Shadowing- Shadowing scheme is also known as copy-on-write (COW) scheme. Shadowing technique is used to ensure atomic update to persistent data-structures in file-system. In this scheme, we look at the storage in terms of fixed-size pages. There is a page tablewhich has a pointer to all valid pages. Shadowing means that to update an on-disk page,the entire page is read into memory, modified, and later written to disk at an alternatelocation. Now all we have to do is to update the pointer in the page table to point tothis new page in the disk. Byte-size of pointer is small and it can fit is one sector inthe disk. There are hard drives that offer atomic sector upgrades and promise you thateither all of the old or new data in the sector. This means you either have an old page ornewly written page. So atomic persistent updates are ensured due to this scheme. It is apowerful mechanism to implement the crash recovery, snapshots. 1.3 Problems with conventional B-Trees- The entire file-system tree on the disk can be looked as made of fixed-size pages.When a page is to be modified, it is read into memory, modified, and later written tosome other location in the disk. Now let us assume that the leaf in b-tree shown below isequivalent to one one-disk page. If we try to modify the leaf, then page corresponding tothe leaf will be shadowed. Now, the next immediate ancestor of this node should point tothis node. This means we will have to modify the ancestor of this node. Again shadowingis involved, and this process continues up to the root recursively. So entire path up to the root need to be shadowed. We will call this type of shadowing as strict shadowing.Now the one additional problem arises due to linking of the leafs in tree. Sinceadjacent leaf should also point to the modified leaf, it is also needed to be shadowed.This process leads to shadowing of the entire tree just because of modification in oneleaf. Remember, this all is going to happen in the hard-disks! This lead to performancedegradation. The root of the problem is leaf chaining.
  • 4. 4 To solve the issues related to concurrency, we use mutex locks or semaphores.Now, let’s assume for while that there are no links in leafs. In normal b-tree, suppose weneed to modify a single node, we take a lock on it, make changes and then release thelock. But if method of updation is shadowing, then we know that changes propagate tothe root, making it necessary to take locks on the way up to the tree root. So there israce to take the lock on the tree root. Waiting for lock is time consuming process,andhence there is need of efficient synchronization. The regular b-trees shuffle the keys between neighboring nodes for the re-balacingpurpose after key-insertion or deletion. If any leaf is modified then, then path up to rootwill be shadowed by default. Suppose that the exchange of the keys happensbetweennodes whose immediate ancestor is not same, then additional path up to tree root willhave to be shadowed due to modification because of exchange of keys.Removing a key and effects of re-balancing and shadowing. Removing a key and effects of re-balancing and shadowing. So we can say that B-Trees + Shadowing = Expensive choice, if conventionalb-trees are used. 1.4 Modifications in conventional B-tree-
  • 5. 5 OhadRodeh, IBM Researcher, have suggested modifications to conventional b-treeand algorithms related to it, for integrating b-tree schemes with shadowing technique. Wewill cover few of them, related to problems discussed above. 1. To solve the problem related to shadowing of whole tree, the links between the leafsare removed. Due to this, only the path up to the tree root needs to be shadowed. 2. In case of rebalancing operation, it is better to exchange the keys between nodeswhose immediate ancestor is same, because this will involve shadowing of singlenode, which is better instead of shadowing the another path up to the root involvingmany nodes.
  • 6. 6 2-Introduction and History of BTRFS 2.1 Introduction- Btrfs is GPL-licensed copy-on-write file system for Linux. Its development beganat Oracle Corporation in 2007. Principal BTRFS author is Chris Mason. Following areSome general points about btrfs: 1. The core data structure of Btrfs is the copy-on-write B-tree which was originallyproposed by IBM researcher Ohad Rodeh at a presentation at USENIX 2007. 2. Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release,and was finally accepted into the mainline kernel as of 2.6.29 in 2009. 3. Btrfs is intended to address the lack of pooling, snapshots, checksums in Linux filesystems. 4. Goal of btrfs was "to let Linux scale for the storage that will be available. Scalingis not just about addressing the storage but also means being able to administerand to manage it with a clean interface that lets people see what’s being used andmakes it more reliable." 5. Btrfs has a number of the same design ideas that reiser3/4(Chris Mason was workingonreiserFS before starting his work on btrfs). The maximum number of files is 18,446,744,073,709,551,616 or 2 to the 64 power of filesThe maximum file length is 255 characters. The theoretical max file size limit is 16 EB, or 8EB .The BTRFS file system helps reduce fragmentation. Storage devices usually show a loss of performance due to fragmentation (usually when fuller). BTRFS does allow for Online. When disk space should become full, it is possible to add space to the existing BTRFS volume. The method refers to Online Resize. The BTRFS file system does not need to be unmounted or taken offline. An existing volume can be added, or removed, from the volume toIf a volume has an existing ext3 or ext4 file system, it can be converted to BTRFS. The conversion is an in-place conversion. This means that the existing data does not have to be removed before the file system is converted. It is good practice to perform a backup in case .
  • 7. 7 2.1History- The core data structure of Btrfs — the copy-on-write B-tree — was originally proposed by IBM researcher Ohad Rodeh at a presentation at USENIX2007. Chris Mason, an engineer working on ReiserFS for SUSE at the time, joined Oracle later that year and began work on a new file system based on these B-trees. Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release, and was finally accepted into the mainline kernel as of 2.6.29 in 2009.Several Linux distributions began offering Btrfs as an experimental choice of root file system during installation, including Arch Linux, openSUSE 11.3, SLES 11 SP1, Ubuntu 10.10, Sabayon Linux, Red Hat Enterprise Linux 6, Fedora 15,MeeGo, Debian, and Slackware 13.37. In summer 2012, several Linux distributions have moved Btrfs from experimental to production / supported status, including SLES 11 SP2 and Oracle Linux 5 and 6, with the Unbreakable Enterprise Kernel Release 2. In 2011, de-fragmentation features were announced for the Linux 3.0 kernel version. Besides Mason at Oracle, Miao Xie at Fujitsu contributed performance improvements.In June 2012, Chris Mason left Oracle for Fusion-io, and in November 2013 he left Fusion-io for Facebook. He continues to work on Btrfs. 2.3 Why btrfs File-System? Linux kernel currently supports almost as 140 file-systems. Most of these file-systemsare generally very good. So why do we need a new file-system even when we have thesemany file systems? Reasons for the same are: 1. This file-system scales to very large storage. This is evident because maximum sizeof storage that file-system can address is 16 EiB (264 Bytes). 2. This file-system is feature focused, providing features the other file-systems cannot. 3. Performance is important. This file-system does not intend to race with current filesystemsbecause they are anyway good. It’s the features that makes btrfs standout. 4. This file-system is administrator focused, so that it is easy to configure, and faulttolerant.
  • 8. 8 3- Specifications and Features of BTRFS 3.1 Features- 1. As of version 3.12 of the Linux kernel mainline, Btrfs implements the following features: 2. Mostly self-healing in some configurations due to the nature of copy on write 3. Online defragmentation 4. Online volume growth and shrinking 5. Online block device addition and removal 6. Online balancing (movement of objects between block devices to balance load) 7. Offline filesystem check 8. Online data scrubbing for finding errors and automatically fixing them for files with redundant copies 9. RAID 0, RAID 1, RAID 5, RAID 6 and RAID 10 10. Subvolumes (one or more separately mountable filesystem roots within each disk partition) 11. Transparent compression (zlib and LZO) 12. Snapshots (read-only or copy-on-write clones of subvolumes) 13. File cloning (copy-on-write on individual files, or byte ranges thereof) 14. Checksums on data and metadata (CRC-32C) 15. In-place conversion (with rollback) from ext3/4 to Btrfs 16. File system seeding (Btrfs on read-only storage used as a copy-on-write backing for a writeable Btrfs) 17. Block discard support (reclaims space on some virtualized setups and improves wear leveling on SSDs with TRIM) 18. Send/receive (saving diffs between snapshots to a binary stream) 19. Hierarchical per-subvolume quotas 20. Out-of-band data deduplication (requires user space tools)
  • 9. 9 3.2 Planned features include: 1. In-band data deduplication 2. Online filesystem check 3. Very fast offline filesystem check 4. Object-level RAID 0, RAID 1, and RAID 10[citation needed] 5. Incremental backup 6. Ability to handle swap files and swap partitions 7. Encryption In 2009, Btrfs was expected to offer a feature set comparable to ZFS, developed by Sun Microsystems.[40] After Oracle's acquisition of Sun in 2009, Mason and Oracle decided to continue on with Btrfs development.[41] Cloning- Btrfs provides a clone operation which atomically creates a copy-on-write snapshot of a file. Such cloned files are sometimes referred to as reflinks, in light of the associated Linux kernel system calls. By cloning, the file system does not create a new link pointing to an existing inode — it instead creates a new inode that shares the same disk blocks as the original file. As a result, this operation only works within the boundaries of the same Btrfs file system, while it can cross the boundaries of subvolumes since Linux kernel version 3.6. The actual data blocks are not becoming duplicated but, due to the copy-on-write nature of cloning, modifications to any of the cloned files are not visible in their parent files and vice-versa. This should not be confused with hard links, which are directory entries that associate multiple file names with actual files on a file system. While hard links can be taken as different names for the same underlying group of disk blocks (known as a file), cloning in Btrfs provides independent files that are sharing their disk blocks as a form of data deduplication on the disk block level. Any later changes to the content of such "dependent" files invoke the copy-on-write mechanism, which creates independent copies of all altered disk blocks.
  • 10. 10 Support for this Btrfs feature was added in version 7.5 of the GNU coreutils, via the -- reflink option to the cp command. Cloning can be especially effective in case of storing disk images of virtual machines or their snapshots. Those are large files differing only in small portions, where the cloning provides both their faster (instantaneous) copying and minimal consumption of storage space due to data deduplication. 3.3 Snapshots- Snapshots means the read only copy of data set frozen at particular point in time.Here we will consider the case of the of writable snapshots of the tree structure.Inbtrfs, the cloning or snapshot algorithm allowstheoretically large number snapshots. In the above example, We have initial tree Tp. Here we have shown reference countof the each block. Initially all the tree block have the reference-count as 1. Now wewill clone the btree using tree Tq. The root of the Tq refer the same block as that of Tp.Now as there are two tree root referencing some common blocks B and C, the referencecountof these blocks is increased by one. So cloning algorithm just sets the pointer to the blocks referred by original tree root and increase the reference-count of blocks referenced.Hence we can have as many as the snapshots as we want, because pointer occupies less space than the actual data.
  • 11. 11 Now we consider the case of process of editing of the shared blocks. Figure 4.2shows the example of the same. In this example, there are two tree rootTp and Tq.Now suppose that we are editing the snapshot with respect to the Tq, and the leaf beingmodified is H. Node C is the immediate ancestor of the leaf H. It should point to themodified copy of the leaf H. So block C is shadowed to C0 which points to same blocks as that of C. The reference count of the C isdecremented. Then the leaf H is shadowed toH0 and the reference count of the block H is decremented by one. Hence due to this kind of sharing, the space requirement instead of copying and modifying entire tree is low. 3.4 Subvolumes- It is volume within volume which can be mounted separately. The user sees thevolumes as the directories. There are benefits of doing this. We can, for example, makethe database directory as subvolume, which will enable you to take snapshots for use withbackup. But like volumes in other file-system, subvolume can’t be mounted anywhere inthe logical view of the directories. It has to be mounted under the parent directory itself.
  • 12. 12 A subvolume in Btrfs is quite different from the usual LVM logical volumes. With LVM, a logical volume is a block device in its own right — while this is not the case with Btrfs. A Btrfs subvolume is not a separate block device, and it cannot be treated or used that way. Instead, a Btrfs subvolume can be thought of as a separate POSIX file namespace. This namespace can be accessed either through the top-level subvolume of the file system, or it can be mounted on its own and accessed separately by specifying the subvol or subvolid option to mount. When accessed through the top-level subvolume, subvolumes are visible and accessed as its subdirectories. Subvolumes can be created at any place within the file system hierarchy, and they can also be nested. Nested subvolumes appear as subdirectories within their parent subvolumes, similar to the way top-level subvolume presents its subvolumes as subdirectories. Deleting a subvolume deletes all subvolumes below it in the nesting hierarchy, and for this reason the top-level subvolume cannot be deleted. Any Btrfs file system always has a default subvolume, which is initially set to be the top-level subvolume, and it is mounted by default if no subvolume selection option is passed to mount. Of course, the default subvolume can be changed as required. 3.5 Send/receive- Given any pair of subvolumes (or snapshots), Btrfs can generate a binary diff between them (by using the btrfs send command) that can be replayed later (by using btrfs receive), possibly on a different Btrfs file system. The send/receive feature effectively creates (and applies) a set of data modifications required for converting one subvolume into another. The send/receive feature can be used with regularly scheduled snapshots for implementing a simple form of file system master/slave replication, or for the purpose of performing incremental backups. 3.6 Quota groups-
  • 13. 13 A quota group (or qgroup) imposes an upper limit to the space a subvolume or snapshot may consume. A new snapshot initially consumes no quota because its data is shared with its parent, but thereafter incurs a charge for new files and copy-on-write operations on existing files. When quotas are active, a quota group is automatically created with each new subvolume or snapshot. These initial quota groups are building blocks which can be grouped (with the btrfs qgroup command) into hierarchies to implement quota pools. Quota groups only apply to subvolumes and snapshots, while having quotas enforced on individual subdirectories is not possible.In-place ext2/3/4 conversion As the result of having very little metadata anchored in fixed locations, Btrfs can warp to fit unusual spatial layouts of the backend storage devices. The btrfs-convert tool exploits this ability to do an in-place conversion of any ext2/3/4 file system, by nesting the equivalent Btrfs metadata in its unallocated space — while preserving an unmodified copy of the original file system. The conversion involves creating a copy of the whole ext2/3/4 metadata, while the Btrfs files simply point to the same blocks used by the ext2/3/4 files. This makes the bulk of the blocks shared between the two filesystems before the conversion becomes permanent. Thanks to the copy-on-write nature of Btrfs, the original versions of the file data blocks are preserved during all file modifications. Until the conversion becomes permanent, only the blocks that were marked as free in ext2/3/4 are used to hold new Btrfs modifications, meaning that the conversion can be undone at any time. All converted files are available and writable in the default subvolume of the Btrfs. A sparse file holding all of the references to the original ext2/3/4 filesystem is created in a separate subvolume, which is mountable on its own as a read-only disk image, allowing both original and converted file systems to be accessed at the same time. Deleting this sparse file frees up the space and makes the conversion permanent. 3.7 Seed devices- When creating a new Btrfs, an existing Btrfs can be used as a read-only "seed" file system. The new file system will then act as a copy-on-write overlay on the seed. The seed can be later detached from the Btrfs, at which point the rebalancer will simply copy over any seed data still referenced by the new file system before detaching. Mason has suggested this may be useful for
  • 14. 14 aLive CD installer, which might boot from a read-only Btrfs seed on optical disc, rebalance itself to the target partition on the install disk in the background while the user continues to work, then eject the disc to complete the installation without rebooting. 3.8 Encryption- Though Chris Mason said in his interview in 2009 that encryption was planned for Btrfs, this is unlikely to be implemented for some time, if ever, due to the complexity of implementation and pre-existing tested and peer-reviewed solutions. The current recommendation for encryption with Btrfs is to use a full-disk encryption mechanism such as dm-crypt/LUKS on the underlying devices, and to create the Btrfs filesystem on top of that layer (and that if a RAID is to be used with encryption, encrypting a dm-raid device or a hardware-RAID device gives much faster disk performance than dm-crypt overlaid by Btrfs' own filesystem-level RAID features). 3.9 Checking and recovery- Unix systems traditionally rely on "fsck" programs to check and repair filesystems. The btrfsck program is now available but, as of May 2012, it is described by the authors as "relatively new code which has "not seen widespread testing on a large range of real-life breakage", and that "may cause additional damage in the process of repair". There is another tool, named btrfs-restore, that can be used to recover files from an unmountable filesystem, without modifying the broken filesystem itself (i.e., non-destructively). In normal use, Btrfs is mostly self-healing and can recover from broken root trees at mount time, thanks to making periodic data flushes to permanent storage every 30 seconds (which is the default period). Thus, isolated errors will cause a maximum of 30 seconds of filesystem changes to be lost at the next mount. This period can be changed by specifying a desired value (in seconds) for the commit mount option.
  • 15. 15 3- Design OhadRodeh's original proposal at USENIX 2007 noted that B+ trees, which are widely used as on-disk data structures for databases, could not efficiently support copy-on-write-based snapshots because its leaf nodes were linked together: if a leaf was copy-on-written, its siblings and parents would have to be as well, as would their siblings and parents and so on until the entire tree was copied. He suggested instead a modified B-tree (which has no leaf linkage), with a refcount associated to each tree node but stored in an ad-hoc free map structure and certain relaxations to the tree's balancing algorithms to make them copy-on-write friendly. The result would be a data structure suitable for a high-performance object store that could perform copy- on-write snapshots, while maintaining good concurrency. At Oracle later that year, Chris Mason began work on a snapshot-capable file system that would use this data structure almost exclusively—not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy-on- write, checksumming and mirroring needed to be implemented only once to benefit the entire file system. Btrfs is structured as several layers of such trees, all using the same B-tree implementation. The trees store generic items sorted on a 136-bit key. The first 64 bits of the key are a unique object id. The middle 8 bits are an item type field; its use is hardwired into code as an item filter in tree lookups. Objects can have multiple items of multiple types. The remaining right-hand 64 bits are used in type-specific ways. Therefore items for the same object end up adjacent to each other in the tree, ordered by type. By choosing certain right-hand key values, objects can further put items of the same type in a particular order. Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.
  • 16. 16 In this section, we will cover basic data structures that are used in the btrfs. Everytree block is either a leaf or node. Every leaf and node begins with the header. //Node : struct btrfs_node { struct btrfs_header header ; struct btrfs_key_ptr ptrs [ ] ; } // Leaf : struct b t r f s_l e a f { struct bt r f s_heade r header ; struct bt r f s_i t em i tems [ ] ; } // header ( p r e s ent in node and l e a f ) struct bt r f s_heade r { u8 csum[ 3 2 byt e s ] ; u8 f s i d [ 1 6 ] ; __le64 blocknr ; __le64 g ene r a t i on ; __le64 owner ; __le16 nr i t ems ; __le16 f l a g s ; u8 l e v e l ; } //Key p t r s ( p r e s ent in node ) struct btrfs_key_ptr { struct btrfs_disk_key ; 6 __le64 bl o ckpt r ; __le64 g ene r a t i on ;
  • 17. 17 } // Items ( p r e s ent in l e a f ) struct bt r f s_i t em { struct btrfs_disk_key key ; __le32 o f f s e t ; __le32 s i z e ; } // Items ( p r e s ent in i tems in the l e a f ) struct bt r f s_key { u64 o b j e c t i d ; u32 f l a g s ; u64 o f f s e t ; } Every tree node carries the header. The block header contains a checksum for theblock contents, the uuid of the filesystem that owns the block, the level of the block inthe tree, and the block number where this block is supposed to live. These fields allow the contents of the metadata to be verified when the data is read.The generation fieldcorresponds to the transaction id that allocated the block. So nodes have pointer array which points to other leafs or node (i.e. some blocks on disk) using blockptr field in key. Now we will look at more details about the leaf structure.Leaf node containsthe header and the array of items. Now, each logical object in file-system (e.g. files anddirectories) contains various items. B-tree implementation are used to store these itemssorted on a 136-bit key (struct btrfs- key) in leaf. The first 64 bits of the key are aobjectidwhich is unique id for each logical object. This id is reported as the inode number. Typesof items in leaf can be inode, directory entries, extent and so on, associated with object. Next field in btrfs-key is type which tells information about type of item associated withobject.Next field in key is offset which tell about the position of item in leaf.Now, interesting thing is, as objectid forms MSB in the btrfs-key of items, so allitems related to the object ends up being adjacent to each other i.e. they are automaticallygrouped together. This means metadata and optionally data associated with an object isgrouped together. This results in compact packing of the data and metadata. Suppose thethat there are N items in the leaf, then
  • 18. 18 index data-item associated with item[X] is dataitem[N-X]. This means the items and the data associated with the items grow towards each other in leaf. Following Figure 3.1 summarize this paragraph.Now let’s see some information about the disk layout. The scheme to storeitems in leaf associated with an object is space and time efficient as well. Normally, filesystems put only one kind of data - bitmaps, or inodes, or directory entries - in any given Leaf structure in Btrfs file system block. This wastes disk space, since unused space in one kind of block can’tbe used for any other purpose, and it wastes time, since getting to one particular pieceof file data requires reading several different kinds of metadata, all located in differentblocks in the file system. In btrfs, items are packed together (or pushed out to leaves) inarrangements that optimize both access time and disk space. You can see the difference in these (very schematic, very simplified) diagrams. Old-school filesystems tend to organize data as shown in Btrfs, instead, creates a disklayoutwhich looks as shown in As we can see, there is no fixed block for the inodes, bitmaps, dir entries, file data or block pointer. The blocks associated with these can overlap for the sake of compaction.The red arrows in Figure shows the disk seeks to locate data or meta- data.The red portion in the blocks shows the unused or wasted space. As all metadata relatedtoan
  • 19. 19 object is closely packed, the there are less disk seek. Hence the scheme is time andspace efficient. There are various b-trees in btrfs. Everything is stored the btrees. There issingle tree manipulation code. Also trees does not care about the object types in theb-tree. So same code can be reused for all kinds of trees that are there in the btrfs. Hence scheme are not only space and time efficient, but code efficient too. Data organization in old-school filesystems
  • 20. 20 Data organization in Btrfs 4.1 Root tree- Every tree appears as an object in the root tree (or tree of tree roots). Some trees, such as file system trees and log trees, have a variable number of instances, each of which is given its own object id. Trees which are singletons (the data relocation, extent and chunk trees) are assigned special, fixed object ids ≤256. The root tree appears in itself as a tree with object id 1. Trees refer to each other by object id. They may also refer to individual nodes in other trees as a triplet of the tree's object id, the node's level within the tree and its leftmost key value. Such references are independent of where the tree is actually stored. 4.2 File system tree- subvolume. Subvolumes can nest, in which case they appear as a directory item (described User- visible files and directories all live in a file system tree. There is one file system tree per below) whose data is a reference to the nested subvolume's file system tree. Within the file system tree, each file and directory objects has an inode item. Extended attributes and ACL entries are stored alongside in separate items.
  • 21. 21 Within each directory, directory entries appear as directory items, whose right-hand key values are a CRC32C hash of their filename. Their data is a location key, or the key of the inode item it points to. Directory items together can thus act as an index for path-to-inode lookups, but are not used for iteration because they are sorted by their hash, effectively randomly permuting them. This means user applications iterating over and opening files in a large directory would thus generate many more disk seeks between non-adjacent files—a notable performance drain in other file systems with hash-ordered directories such as ReiserFS, ext3 (with Htree-indexes enabled and ext4, all of which have TEA-hashed filenames. To avoid this, each directory entry has a directory index item, whose right-hand value of the item is set to a per-directory counter that increments with each new directory entry. Iteration over these index items thus returns entries in roughly the same order as they are stored on disk. Besides inode items, files and directories also have a reference item whose right-hand key value is the object id of their parent directory. The data part of the reference item is the filename that inode is known by in that directory. This allows upward traversal through the directory hierarchy by providing a way to map inodes back to paths. Files with hard links in other directories have multiple reference items, one for each parent directory. Files with hard links in the same directory pack all of the links' filenames into the same reference item. This was a design flaw that limited the number of same-directory hard links to however many could fit in a single tree block. (On the default block size of 4 KB, an average filename length of 8 bytes and a per-filename header of 4 bytes, this would be less than 350.) Applications which made heavy use of same-directory hard links, such as git, GNUS, GMame and BackupPCwere later observed to fail after hitting this limit.] The limit was eventually removed (and as of October 2012 has been merged pending release in Linux by introducing spilloverextended reference items to hold hard link filenames which could not otherwise fit. 4.3 Relocation trees- Defragmentation, shrinking and rebalancing operations require extents to be relocated. However, doing a simple copy-on-write of the relocating extent will break sharing between snapshots and
  • 22. 22 consume disk space. To preserve sharing, an update-and-swap algorithm is used, with a special relocation tree serving as scratch space for affected metadata. The extent to be relocated is first copied to its destination. Then, by following back references upward through the affected subvolume's file system tree, metadata pointing to the old extent is progressively updated to point at the new one; any newly updated items are stored in the relocation tree. Once the update is complete, items in the relocation tree are swapped with their counterparts in the affected subvolume, and the relocation tree is discarded. (b) (a) File system forest (b) the changes that occur after modification
  • 23. 23 4.4 Extents- File data are kept outside the tree in extents, which are contiguous runs of disk blocks. Extent blocks default to 4KiB in size, do not have headers and contain only (possibly compressed) file data. In compressed extents, individual blocks are not compressed separately; rather, the compression stream spans the entire extent. Files have extent data items to track the extents which hold their contents. The item's right-hand key value is the starting byte offset of the extent. This makes for efficient seeks in large files with many extents, because the correct extent for any given file offset can be computed with just one tree lookup. Snapshots and cloned files share extents. When a small part of a large such extent is overwritten, the resulting copy-on-write may create three new extents: a small one containing the overwritten data, and two large ones with unmodified data on either side of the overwrite. To avoid having to re-write unmodified data, the copy-on-write may instead create bookend extents, or extents which are simply slices of existing extents. Extent data items allow for this by including an offset into the extent they are tracking: items for bookends are those with non-zero offsets. If the file data is small enough to fit inside a tree node, it is instead pulled in-tree and stored inline in the extent data item. Each tree node is stored in its own tree block—a single uncompressed block with a header. The tree block is regarded as a free-standing, single-block extent.
  • 25. 25 6- Limitations Here we will list few of the problems to be addressed. 1. Transactions (a) Btrfs supports limited transactions without Atomicity-Consistency-Isolation-Durability semantics. (b) Only one transaction may run at a time which is not atomic wrt storage. 2. Checking and recovery (a) fsck tool is available but not recommended as of now.
  • 26. 26 7-Future Development Some of the planned features are: 1. Encryption 2. Data Deduplication 3. Parity Based RAID(RAID5 and RAID6) 4. Ability to handle swap 5. Incremental dumps
  • 27. 27 7-References [1] Ohad Rodeh, Research paper on "B-trees, Shadowing, and Clones" oracle in 2008 [2]Kerner, Sean Michael "A Better File System For Linux". InternetNews.com. Archived from the original on 24 June 2012. Retrieved 2008-10-30. [4] Valerie Aurora, Article on "A short history of btrfs",IEEE Magazine in 2011 [5]MACEDONIA,M.R.: “B-tree file system” IEEE Commune. Magazine , vol. 47 pp. S30-S38 , Mar 2007 [6] Mason, Chris. "Btrfs: a copy on write, snapshotting FS”,(2007-06-12) [7] Brown, Eric "Linux 3.0 scrubs up Btrfs, gets more Xen". Linux devices (eWeek).Archived from the original on 2013-01-27.Retrieved 8 November 2011.
  • 28. 28