SlideShare a Scribd company logo
Open Source Data Deduplication




                        Nick Webb
nickw@redwireservices.com www.redwireservices.com @RedWireServices
                          (206) 829-8621
                          Last updated 8/10/2011
Introduction
●   What is Deduplication? Different kinds?
●   Why do you want it?
●   How does it work?
●   Advantages / Drawbacks
●   Commercial Implementations
●   Open Source implementations, performance,
    reliability, and stability of each
What is Data Deduplication
Wikipedia:

. . . data deduplication is a specialized data compression
technique for eliminating coarse-grained redundant data,
typically to improve storage utilization. In the deduplication
process, duplicate data is deleted, leaving only one copy of
the data to be stored, along with references to the unique
copy of data. Deduplication is able to reduce the required
storage capacity since only the unique data is stored.

Depending on the type of deduplication, redundant files may
be reduced, or even portions of files or other data that are
similar can also be removed . . .
Why Dedupe?
●   Save disk space and money (less disks)
●   Less disks = less power, cooling, and space
●   Improve write performance (of duplicate data)
●   Be efficient – don’t re-copy or store previously
    stored data
Where does it Work Well?
●   Secondary Storage
    ●   Backups/Archives
    ●   Online backups with limited
        bandwidth/replication
    ●   Save disk space – additional
        full backups take little space
●   Virtual Machines (Primary &
    Secondary)
●   File Shares
Not a Fit
●   Random data
    ●   Video
    ●   Pictures
    ●   Music
    ●   Encrypted files
         –   many vendors dedupe, then encrypt
Types
●   Source / Target
●   Global
●   Fixed/Sliding Block
●   File Based (SIS)
Drawbacks
●   Slow writes, slower reads
●   High CPU/memory utilization (dedicated server
    is a must)
●   Increases data loss risk / corruption
    ●   Collision risk of 1.3x10^-49% chance per PB
    ●   (256 bit hash & 8KB Blocks)
How Does it Work?
Without Dedupe
With Dedupe
Block Reclamation
 ●   In general, blocks are not
     removed/freed when a file is
     removed
 ●   We must periodically check blocks
     for references, a block with no
     reference can be deleted, freeing
     allocated space
 ●   Process can be expensive,
     scheduled during off-peak
Commercial Implementations
●   Just about every backup vendor
    ●   Symantec, CommVault
    ●   Cloud: Asigra, Baracuda, Dropbox (global), JungleDisk,
        Mozy
●   NAS/SAN/Backup Targets
    ●   NEC HydraStor
    ●   DataDomain/EMC Avamar
    ●   Quantum
    ●   NetApp
Open Source Implementations
●   Fuse Based
    ●   Lessfs
    ●   SDFS (OpenDedupe)
●   Others
    ●   ZFS
    ●   btrfs (? Off-line only)
●   Limited (file based / SIS)
    ●   BackupPC (reliable!)
    ●   Rdiff-backup
How Good is it?
●   Many see 10-20x deduplicaiton meaning 10-20
    times more logical object storage than physical
●   Especially true in backup or virtual
    environments
SDFS / OpenDedupe
                 www.opendedup.org
●   Java 7 Based / platform agnostic
●   Uses fuse
●   S3 storage support
●   Snapshots
●   Inline or batch mode deduplication
●   Supposedly fast (290MBps+ on great H/W)
●   Support for global/clustered dedupe
●   Probably most mature OSS Dedupe (IMHO)
SDFS
SDFS Install & Go

Install Java
# rpm –Uvh SDFS-1.0.7-2.x86_64.rpm
# sudo mkfs.sdfs --volume-name=sdfs_128k 
     --io-max-file-write-buffers=32 
     --volume-capacity=550GB 
     --io-chunk-size=128 
     --chunk-store-data-location=/mnt/data
# sudo modprobe fuse
# sudo mount.sdfs -v sdfs_128k -m 
     /mnt/dedupe
SDFS
●   Pro
    ●   Works when configured properly
    ●   Appears to be multithreaded
●   Con
    ●   Slow / resource intensive (CPU/Memory)
    ●   Fragile, easy to mess up options, leading to crashes, little
        user feedback
    ●   Standard POSIX utilities do not show accurate data (e.g. df,
        must use getfattr -d <mount point>, and calculate bytes →
        GB/TB and % free yourself)
    ●   Slow with 4k blocks, recommended for VMs
LessFS
                          www.lessfs.com

●   Written in C = Less CPU Overhead
●   Have to build yourself (configure && make && make install)
●   Has replication, encryption
●   Uses fuse
LessFS Install
wget http://...lessfs-1.4.2.tar.gz
tar zxvf *.tar.gz
wget http://...db-4.8.30.tar.gz
yum install buildstuff…
. . .
echo never >
/sys/kernel/mm/redhat_transparent_hugepage/defrag
echo no >
/sys/kernel/mm/redhat_transparent_hugepage/khugep
aged/defrag
LessFS Go
sudo vi /etc/lessfs.cfg
BLOCKDATA_PATH=/mnt/data/dta/blockdata.dta
META_PATH=/mnt/meta/mta
BLKSIZE=4096 # only 4k supported on centos 5
ENCRYPT_DATA=on
ENCRYPT_META=off


mklessfs -c /etc/lessfs.cfg
lessfs /etc/lessfs.cfg /mnt/dedupe
LessFS
●   Pro
    ●   Does inline compression by default as well
    ●   Reasonable VM compression with 128k blocks
●   Con
    ●   Fragile
    ●   Stats/FS info hard to see (per file accounting, no totals)
    ●   Kernel >= 2.6.26 required for blocks > 4k (RHEL6 only)
    ●   Running with 4k blocks is not really feasible
LessFS
Other OSS
●   ZFS?
    ●   Tried it, and empirically it was a drag, but I have no
        hard data (got like 3x dedupe with identical full
        backups of VMs)
    ●   At least it’s stable…
Kick the Tires
●   Test data set; ~330GB of data
    ●   22GB of documents, pictures, music
    ●   Virtual Machines
        –   220GB Windows 2003 Server with SQL Data
        –   2003 AD DC ~60GB
        –   2003 Server ~8GB
        –   Two OpenSolaris VMs, 1.5 & 2.7GB
        –   3GB Windows 2000 VM
        –   15GB XP Pro VM
Kick the Tires
●   Test Environment
    ●   AWS High CPU Extra Large Instance
    ●   ~7GB of RAM
    ●   ~Eight Cores ~2.5GHz each
    ●   ext4
Compression Performance
●   First round (all “unique” data)
●   If another copy was put in (like another full), we should expect
    100% reduction for that non-unique data (1x dedupe per run)
      FS              Home   % Home      VM     % VM        Combined % Total     MBps
                      Data   Reduction   Data   Reduction            Reduction
      SDFS 4k         21GB   4.50%       109    64%         128GB    61%         16
                                         GB
      lessfs 4k       24GB   -9%         N/A    51%         N/A      50%         4
      (est.)
      SDFS 128k       21GB   4.50%       255    16%         276GB    15%         40
                                         GB
      lessfs 128k     21GB   4.50%       130    57%         183GB    44%         24
                                         GB
      tar/gz --fast   21GB   4.50%       178    41%         199GB    39%         35
                                         GB
Open Source Data Deduplication
Write Performance
                      (don't trust this)
                                  MBps

40


35


30


25


20                                                                          MBps


15


10


 5


 0
     raw    SDFS 4k   lessfs 4k   SDFS 128k   lessfs 128k   tar/gz --fast
Kick the Tires: Part 2
●   Test data set – two ~204GB full backup
    archives from a popular commercial vendor
●   Test Environment
    ●   VirtualBox VM, 2GB RAM, 2 Cores, 2x7200RPM
        SATA drives (meta & data separated for LessFS)
    ●   Physical CPU: Quad Core Xeon
Write Performance
                                 MBps

40


35


30


25


20                                                                           MBps


15


10


 5


 0
     raw   SDFS 128k W   SDFS 128k Re-W   LessFS 128k W   LessFS 128k Re-W
Open Source Data Deduplication
Load
(SDFS 128k)
Open Source Dedupe
●   Pro
    ●   Free
    ●   Can be stable, if well managed
●   Con
    ●   Not in repos yet
    ●   Efforts behind them seem very limited, 1 dev each
    ●   No/Poor documentation
The Future
●   Eventual Commodity?
●   brtfs
    ●   Dedupe planned (off-line only)
Conclusion/Recommendations
●   Dedupe is great, if it works and it meets your
    performance and storage requirements
●   OSS Dedupe has a way to go
●   SDFS/OpenDedupe is best OSS option right
    now
●   JungleDisk is good and cheap, but not OSS
About Red Wire Services
If you found this presentation helpful, consider
Red Wire Services for your next
Backup, Archive, or IT Disaster Recovery
Planning project.
Learn more at www.RedWireServices.com
About Nick Webb
Nick Webb is the founder of Red Wire Services, in
Seattle, WA. Nick is available to speak on a variety of IT
Disaster Recovery related topics, including:
●   Preserving Your Digital Legacy
●   Getting Started with your Small Business Disaster
    Recovery Plan
●   Archive Storage for SMBs
If interested in having Nick speak to your group, please
call (206) 829-8621 or email info@redwireservices.com

More Related Content

What's hot (18)

PPTX
Hadoop HDFS NameNode HA
Hanborq Inc.
 
PPT
CDW: SAN vs. NAS
Spiceworks
 
PDF
IBM Spectrum Scale for File and Object Storage
Tony Pearson
 
PPTX
Network Attached Storage (NAS)
sandeepgodfather
 
ODP
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS
 
PDF
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
inside-BigData.com
 
PPTX
Ibm spectrum scale_backup_n_archive_v03_ash
Ashutosh Mate
 
PPTX
Walk Through a Software Defined Everything PoC
Ceph Community
 
PPTX
Spectrum Scale Unified File and Object with WAN Caching
Sandeep Patil
 
PPTX
Quantum NDX - NAS Based Data Protection
Quantum
 
PDF
IBM Spectrum Scale Networking Flow
Sandeep Patil
 
PDF
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics
 
PPTX
Vancouver bug enterprise storage and zfs
Rami Jebara
 
PPT
Continuity Software 4.3 Detailed Gaps
GilHecht
 
PPTX
958 and 959 sales exam prep
Jason Wong
 
PPTX
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Community
 
PPTX
IBM Spectrum Scale Security
Sandeep Patil
 
PDF
Maginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
Maginatics
 
Hadoop HDFS NameNode HA
Hanborq Inc.
 
CDW: SAN vs. NAS
Spiceworks
 
IBM Spectrum Scale for File and Object Storage
Tony Pearson
 
Network Attached Storage (NAS)
sandeepgodfather
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS
 
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
inside-BigData.com
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ashutosh Mate
 
Walk Through a Software Defined Everything PoC
Ceph Community
 
Spectrum Scale Unified File and Object with WAN Caching
Sandeep Patil
 
Quantum NDX - NAS Based Data Protection
Quantum
 
IBM Spectrum Scale Networking Flow
Sandeep Patil
 
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics
 
Vancouver bug enterprise storage and zfs
Rami Jebara
 
Continuity Software 4.3 Detailed Gaps
GilHecht
 
958 and 959 sales exam prep
Jason Wong
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Community
 
IBM Spectrum Scale Security
Sandeep Patil
 
Maginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
Maginatics
 

Viewers also liked (16)

PPTX
Deduplication
Lars Marius Garshol
 
PPTX
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
Pasquale Puzio
 
PPTX
Netapp Deduplication concepts
Saroj Sahu
 
PPTX
Session 03 acquiring data
Sara-Jayne Terp
 
PPTX
Avamar presales 1.0
maestriausma2012
 
PPTX
Accurate Hotel Mapping to Increase your Bookings
Tavisca Solutions
 
PPTX
Creative portfolio tavisca 2014
Tavisca Solutions
 
PDF
Presentation deduplication backup software and system
xKinAnx
 
PPT
Hotel map process
José María Apellidos
 
DOCX
Secure auditing and deduplicating data in cloud
nexgentech15
 
PPTX
Accurate Hotel Mapping to Increase your Bookings
Tavisca Solutions
 
DOCX
A Hybrid Cloud Approach for Secure Authorized Deduplication
SWAMI06
 
PPTX
Linking data without common identifiers
Lars Marius Garshol
 
PPTX
Source of Data in Research
Manu K M
 
PPTX
EMC Deduplication Fundamentals
emcbaltics
 
PDF
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Lucidworks
 
Deduplication
Lars Marius Garshol
 
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
Pasquale Puzio
 
Netapp Deduplication concepts
Saroj Sahu
 
Session 03 acquiring data
Sara-Jayne Terp
 
Avamar presales 1.0
maestriausma2012
 
Accurate Hotel Mapping to Increase your Bookings
Tavisca Solutions
 
Creative portfolio tavisca 2014
Tavisca Solutions
 
Presentation deduplication backup software and system
xKinAnx
 
Hotel map process
José María Apellidos
 
Secure auditing and deduplicating data in cloud
nexgentech15
 
Accurate Hotel Mapping to Increase your Bookings
Tavisca Solutions
 
A Hybrid Cloud Approach for Secure Authorized Deduplication
SWAMI06
 
Linking data without common identifiers
Lars Marius Garshol
 
Source of Data in Research
Manu K M
 
EMC Deduplication Fundamentals
emcbaltics
 
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Lucidworks
 
Ad

Similar to Open Source Data Deduplication (20)

PPTX
Openstorage with OpenStack, by Bradley
Hui Cheng
 
PPTX
Pm 01 bradley stone_openstorage_openstack
OpenCity Community
 
PDF
Scalable Storage for Massive Volume Data Systems
Lars Nielsen
 
PDF
Openstorage Openstack
OpenCity Community
 
PDF
Extlect03
Vin Voro
 
PDF
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
NETWAYS
 
PPTX
SPFS - A filesystem for Spectrum Protect
tdalebjork
 
PPTX
SPFS - A filesystem for Spectrum Protect
tdalebjork
 
PDF
Inexpensive storage
Manfred Furuholmen
 
ODP
ZFS by PWR 2013
pwrsoft
 
PDF
Smart Backup Architectures for Big Data in Complex, Open Source Based IT Envi...
Hubert Schweinesbein
 
PDF
BACKUP STORAGE BLOCK-LEVEL DEDUPLICATION WITH DDUMBFS AND BACULA
ijait
 
PDF
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
Thomas Uhl
 
PDF
A32 Database Virtulization Technologies
Insight Technology, Inc.
 
PDF
Osdc2011.ext4btrfs.talk
Udo Seidel
 
PDF
Why btrfs is the Bread and Butter of Filesystems
degarden
 
PDF
LinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdf
degarden
 
PPTX
Windows Server 2012 Disk Dedupe
Microsoft TechNet - Belgium and Luxembourg
 
PPTX
Key Considerations For Deduplication In The Enterprise
Quantum
 
PDF
Quixote
ceiparua
 
Openstorage with OpenStack, by Bradley
Hui Cheng
 
Pm 01 bradley stone_openstorage_openstack
OpenCity Community
 
Scalable Storage for Massive Volume Data Systems
Lars Nielsen
 
Openstorage Openstack
OpenCity Community
 
Extlect03
Vin Voro
 
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
NETWAYS
 
SPFS - A filesystem for Spectrum Protect
tdalebjork
 
SPFS - A filesystem for Spectrum Protect
tdalebjork
 
Inexpensive storage
Manfred Furuholmen
 
ZFS by PWR 2013
pwrsoft
 
Smart Backup Architectures for Big Data in Complex, Open Source Based IT Envi...
Hubert Schweinesbein
 
BACKUP STORAGE BLOCK-LEVEL DEDUPLICATION WITH DDUMBFS AND BACULA
ijait
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
Thomas Uhl
 
A32 Database Virtulization Technologies
Insight Technology, Inc.
 
Osdc2011.ext4btrfs.talk
Udo Seidel
 
Why btrfs is the Bread and Butter of Filesystems
degarden
 
LinuxCon_2013_NA_Eckermann_Filesystems_btrfs.pdf
degarden
 
Windows Server 2012 Disk Dedupe
Microsoft TechNet - Belgium and Luxembourg
 
Key Considerations For Deduplication In The Enterprise
Quantum
 
Quixote
ceiparua
 
Ad

Recently uploaded (20)

PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 

Open Source Data Deduplication

  • 1. Open Source Data Deduplication Nick Webb [email protected] www.redwireservices.com @RedWireServices (206) 829-8621 Last updated 8/10/2011
  • 2. Introduction ● What is Deduplication? Different kinds? ● Why do you want it? ● How does it work? ● Advantages / Drawbacks ● Commercial Implementations ● Open Source implementations, performance, reliability, and stability of each
  • 3. What is Data Deduplication Wikipedia: . . . data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored, along with references to the unique copy of data. Deduplication is able to reduce the required storage capacity since only the unique data is stored. Depending on the type of deduplication, redundant files may be reduced, or even portions of files or other data that are similar can also be removed . . .
  • 4. Why Dedupe? ● Save disk space and money (less disks) ● Less disks = less power, cooling, and space ● Improve write performance (of duplicate data) ● Be efficient – don’t re-copy or store previously stored data
  • 5. Where does it Work Well? ● Secondary Storage ● Backups/Archives ● Online backups with limited bandwidth/replication ● Save disk space – additional full backups take little space ● Virtual Machines (Primary & Secondary) ● File Shares
  • 6. Not a Fit ● Random data ● Video ● Pictures ● Music ● Encrypted files – many vendors dedupe, then encrypt
  • 7. Types ● Source / Target ● Global ● Fixed/Sliding Block ● File Based (SIS)
  • 8. Drawbacks ● Slow writes, slower reads ● High CPU/memory utilization (dedicated server is a must) ● Increases data loss risk / corruption ● Collision risk of 1.3x10^-49% chance per PB ● (256 bit hash & 8KB Blocks)
  • 9. How Does it Work?
  • 12. Block Reclamation ● In general, blocks are not removed/freed when a file is removed ● We must periodically check blocks for references, a block with no reference can be deleted, freeing allocated space ● Process can be expensive, scheduled during off-peak
  • 13. Commercial Implementations ● Just about every backup vendor ● Symantec, CommVault ● Cloud: Asigra, Baracuda, Dropbox (global), JungleDisk, Mozy ● NAS/SAN/Backup Targets ● NEC HydraStor ● DataDomain/EMC Avamar ● Quantum ● NetApp
  • 14. Open Source Implementations ● Fuse Based ● Lessfs ● SDFS (OpenDedupe) ● Others ● ZFS ● btrfs (? Off-line only) ● Limited (file based / SIS) ● BackupPC (reliable!) ● Rdiff-backup
  • 15. How Good is it? ● Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical ● Especially true in backup or virtual environments
  • 16. SDFS / OpenDedupe www.opendedup.org ● Java 7 Based / platform agnostic ● Uses fuse ● S3 storage support ● Snapshots ● Inline or batch mode deduplication ● Supposedly fast (290MBps+ on great H/W) ● Support for global/clustered dedupe ● Probably most mature OSS Dedupe (IMHO)
  • 17. SDFS
  • 18. SDFS Install & Go Install Java # rpm –Uvh SDFS-1.0.7-2.x86_64.rpm # sudo mkfs.sdfs --volume-name=sdfs_128k --io-max-file-write-buffers=32 --volume-capacity=550GB --io-chunk-size=128 --chunk-store-data-location=/mnt/data # sudo modprobe fuse # sudo mount.sdfs -v sdfs_128k -m /mnt/dedupe
  • 19. SDFS ● Pro ● Works when configured properly ● Appears to be multithreaded ● Con ● Slow / resource intensive (CPU/Memory) ● Fragile, easy to mess up options, leading to crashes, little user feedback ● Standard POSIX utilities do not show accurate data (e.g. df, must use getfattr -d <mount point>, and calculate bytes → GB/TB and % free yourself) ● Slow with 4k blocks, recommended for VMs
  • 20. LessFS www.lessfs.com ● Written in C = Less CPU Overhead ● Have to build yourself (configure && make && make install) ● Has replication, encryption ● Uses fuse
  • 21. LessFS Install wget http://...lessfs-1.4.2.tar.gz tar zxvf *.tar.gz wget http://...db-4.8.30.tar.gz yum install buildstuff… . . . echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag echo no > /sys/kernel/mm/redhat_transparent_hugepage/khugep aged/defrag
  • 22. LessFS Go sudo vi /etc/lessfs.cfg BLOCKDATA_PATH=/mnt/data/dta/blockdata.dta META_PATH=/mnt/meta/mta BLKSIZE=4096 # only 4k supported on centos 5 ENCRYPT_DATA=on ENCRYPT_META=off mklessfs -c /etc/lessfs.cfg lessfs /etc/lessfs.cfg /mnt/dedupe
  • 23. LessFS ● Pro ● Does inline compression by default as well ● Reasonable VM compression with 128k blocks ● Con ● Fragile ● Stats/FS info hard to see (per file accounting, no totals) ● Kernel >= 2.6.26 required for blocks > 4k (RHEL6 only) ● Running with 4k blocks is not really feasible
  • 25. Other OSS ● ZFS? ● Tried it, and empirically it was a drag, but I have no hard data (got like 3x dedupe with identical full backups of VMs) ● At least it’s stable…
  • 26. Kick the Tires ● Test data set; ~330GB of data ● 22GB of documents, pictures, music ● Virtual Machines – 220GB Windows 2003 Server with SQL Data – 2003 AD DC ~60GB – 2003 Server ~8GB – Two OpenSolaris VMs, 1.5 & 2.7GB – 3GB Windows 2000 VM – 15GB XP Pro VM
  • 27. Kick the Tires ● Test Environment ● AWS High CPU Extra Large Instance ● ~7GB of RAM ● ~Eight Cores ~2.5GHz each ● ext4
  • 28. Compression Performance ● First round (all “unique” data) ● If another copy was put in (like another full), we should expect 100% reduction for that non-unique data (1x dedupe per run) FS Home % Home VM % VM Combined % Total MBps Data Reduction Data Reduction Reduction SDFS 4k 21GB 4.50% 109 64% 128GB 61% 16 GB lessfs 4k 24GB -9% N/A 51% N/A 50% 4 (est.) SDFS 128k 21GB 4.50% 255 16% 276GB 15% 40 GB lessfs 128k 21GB 4.50% 130 57% 183GB 44% 24 GB tar/gz --fast 21GB 4.50% 178 41% 199GB 39% 35 GB
  • 30. Write Performance (don't trust this) MBps 40 35 30 25 20 MBps 15 10 5 0 raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k tar/gz --fast
  • 31. Kick the Tires: Part 2 ● Test data set – two ~204GB full backup archives from a popular commercial vendor ● Test Environment ● VirtualBox VM, 2GB RAM, 2 Cores, 2x7200RPM SATA drives (meta & data separated for LessFS) ● Physical CPU: Quad Core Xeon
  • 32. Write Performance MBps 40 35 30 25 20 MBps 15 10 5 0 raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W
  • 35. Open Source Dedupe ● Pro ● Free ● Can be stable, if well managed ● Con ● Not in repos yet ● Efforts behind them seem very limited, 1 dev each ● No/Poor documentation
  • 36. The Future ● Eventual Commodity? ● brtfs ● Dedupe planned (off-line only)
  • 37. Conclusion/Recommendations ● Dedupe is great, if it works and it meets your performance and storage requirements ● OSS Dedupe has a way to go ● SDFS/OpenDedupe is best OSS option right now ● JungleDisk is good and cheap, but not OSS
  • 38. About Red Wire Services If you found this presentation helpful, consider Red Wire Services for your next Backup, Archive, or IT Disaster Recovery Planning project. Learn more at www.RedWireServices.com
  • 39. About Nick Webb Nick Webb is the founder of Red Wire Services, in Seattle, WA. Nick is available to speak on a variety of IT Disaster Recovery related topics, including: ● Preserving Your Digital Legacy ● Getting Started with your Small Business Disaster Recovery Plan ● Archive Storage for SMBs If interested in having Nick speak to your group, please call (206) 829-8621 or email [email protected]

Editor's Notes

  • #4: Different types of deduplication levels:File levelBlock levelVariable block versus fixed block Quantum/DD Variable Blocks
  • #6: Pretty much the same as all compression