SlideShare a Scribd company logo
Yottabytes and Beyond Demystifying Storage and  Building large Storage Networks  Part I by Bhavin Turakhia, CEO, Directi bhavin.t@directi.com
Why is storage important? Web 2.0 applications are an extension of your Desktop SaaS is here and growing Broadband is a reality Storage costs are dropping Everyone expects near-unlimited storage online – Youtube, Flickr, Facebook et al are storing your life online* (.. And yea … lets not forget your personal bit-torrent collection) * it would take 1400 TB to store your entire life in video. 5700 TB if you want to know what was happening around you. Another 73 TB for the audio files of everything you heard (MP3 quality). That’s about 6000 TB for a copy of your life
Agenda Hard disks SATA, SAS, FC, Solidstate RAID DAS SAN
“ Large scale storage requires careful planning”
Choosing your Hard Disk (SATA, FC, SAS, SCSI, Solidstate)
Introduction to Hard Drives Basic physical storage unit (aka Physical block device) Variables to consider when selecting a drive Type (SAS, SATA, FC) RPM Capacity MTBF (Mean Time between Failures) Life Expectancy
Hard Disk types SATA (Serial ATA) SAS (Serial Attached SCSI) FC (Fibre Channel) Typical Use low-cost, high-volume, low-speed, large-storage environments CDP / Backups Replacement for SCSI High performance transaction oriented applications with high  IOPs  requirement High performance transaction oriented applications with high IOPs requirement Performance Average Typically 7200 RPM Good (Similar to FC) 10k / 15k RPM Good (Similar to SAS) 10k / 15k RPM Hard drive capacities Typically - 250 GB, 500 GB, 750 GB, 1TB Typically – 73 GB, 146 GB, 300 GB, 400 GB Typically – 73 GB, 146 GB, 300 GB, 400 GB
Hard Disk types SATA (Serial ATA) SAS (Serial Attached SCSI) FC (Fibre Channel) Price per Gig (based on max drive capacity retail web price) $ 0.33 $2 $3 Misc - Backward compatible with SATA Allows mixing SATA drives on same  backplane -
Hard Disk Conclusions For high IOPs, database applications, low-storage requirements – you have a choice between FC and SAS SAS currently seems like the better option Future SAS standards promise to be faster than FC (though it is likely they may remain neck to neck) For high-storage requirements (video server, file servers, photo storage, archivals, mail servers, backup servers) SATA is the way to go One may combine SAS and SATA to reduce average cost and achieve your goals – especially since the backplanes are cross-compatible Readup the spec sheet of the hard drives you plan on using for determining specifics
Solid State Drives Uses solid state memory to store persistent data Eliminates mechanical parts Useful for creating efficient in-between caches or storing small to mid-sized high performance databases
Solid State Drives References Intro -  http://en.wikipedia.org/wiki/Solid_state_disk RAM vs Flash based -  http://www.storagesearch.com/ssd-ram-v-flash.html SSD based SAN!!!    -  http://www.superssd.com/ Advantages Disadvantages Faster startup – no spinning Significantly faster on Random IO (From 250x to 1000x+) Extremely low latency (25x to 200x better) No noise Lower power consumption Lesser heat production Significantly more expensive ($10-30/GB for Flash based, $100-200/GB for DDR RAM based) Slightly slower on large sequential reads Slower random write speeds incase of Flash based storage
RAID Primer (0, 1, 2, 3, 4, 5, 6, TP, 0+1, 10, 50, 60)
Introduction to RAID allows multiple disks to appear as a single contiguous physical block device provides redundancy / high availability A raid group appears as a single physical block device HD1 HD2 HD1 HD2 RAID
Comparison of Single RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 Diagram Description Striping Mirroring Striping with Parity Striping with Dual Parity Minimum Disks 2 2 3 4 Maximum Disks Controller Dependant 2 Controller Dependant Controller Dependant Array Capacity No. of Drives x Drive Capacity Drive Capacity (No. of Drives - 1) x Drive Capacity (No. of Drives - 2) x Drive Capacity
Comparison of Single RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 Storage Efficiency 100% 50% (Num of drives – 1) / Num of drives (Num of drives – 2) / Num of drives Fault Tolerance None 1 Drive failure 1 Drive failure 2 Drive failures High Availability None Good Good Very Good Degradation during  rebuild NA Slight degradation Rebuilds very fast High degradation Slow Rebuild (due to write penalty of parity) Very High degradation Very Slow Rebuild (due to write penalty of dual parity)
Comparison of Single RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 Random Read Performance Very Good Good Very Good Very Good Random Write Performance Very Good Good (slightly worse than single drive) Fair (Parity overhead) Poor (Dual Parity Overhead) Sequential Read Performance Very Good Fair Good Good Sequential Write Performance Very Good Good Fair Fair Cost Lowest High Moderate Moderate+
Comparison of Single RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 Use Case Non critical data High speed requirements Data backed up elsewhere Typically used as RAID 10 in OLTP / OLAP applications Non-write intensive OLTP applications / file servers etc Non-write intensive OLTP applications / file servers etc Misc - - Parity can considerably slow down system Not supported on all RAID cards
Understanding the Parity Penalty RAID 5 and RAID 6 store parity information against data for rebuild Single Parity can be calculated using a simple XOR eg– “abcdefghijkl” on a 4 disk RAID 5 array If Disk 2 fails then the data “B” can be recalculated as (01000001 XOR 01000011 XOR 01000000) => 01000010 => B +12124286429  Disk 1 Disk 2 Disk 3 Disk 4 A (01000001) B (01000010) C (01000011) {P – 01000000} Parity {P} D E F G Parity {P} H I J K Parity {P} L
Understanding the Parity Penalty Steps to change “B” to “X” on Disk 2 Read A, C and {P} Recalculate {P} as ‘A’ XOR ‘X’ XOR ‘C’ Write ‘X’ and {P} A single update required 3 reads and 2 writes Random writes in RAID 5 and RAID 6 are  very very expensive Disk 1 Disk 2 Disk 3 Disk 4 A (01000001) B->X (01000010) -> (01011000) C (01000011) {P – 01000000}
Understanding the Parity Penalty Rebuilding in RAID 5 and RAID 6 is expensive The cost increases with increase in number of disks As if this isnt enough there is an additional penalty All the writes after the computation (ie parity and the changed block) must be simultaneous (involving a two-phase commit operation) The impact can be marginally reduced through write-back caching
Comparison of Nested RAID Levels RAID 10 RAID 50 Diagram Description Mirroring then Striping Striping with Parity then Striping without parity Minimum Disks Even number > 4 > 6 Maximum Disks Controller Dependant Controller Dependant Array Capacity (Size of Drive) * (Number of Drives ) / 2 (Size of Drive) * (No. of Drives In Each RAID 5 Set - 1) * (No of RAID 5 Sets)
Comparison of Nested RAID Levels RAID 10 RAID 50 Storage Efficiency 50% ((No. of Drives In Each RAID 5 Set - 1) / No.  of Drives In Each RAID 5 Set) Fault Tolerance Multiple drive failure as long as 2 drives from same RAID 1 set do not fail Multiple drive failure as long as 2 drives from same RAID 5 set do not fail High Availability Excellent Excellent Degradation during rebuild Minor Moderate degradation Slow Rebuild (due to write penalty of parity)
Comparison of Nested RAID Levels RAID 10 RAID 50 Read Performance Very Good Very Good Write Performance Very Good Good Use Case OLTP / OLAP applications Medium-write intensive OLTP / OLAP applications
Nested RAID Misc Notes RAID 10 is faster and better than RAID 0+1 for the same cost RAID 60 is similar to RAID 50 except that the striped sets with parity contain dual parity Ideally RAID 10 and RAID 50 will be the only nested RAID levels you will use
RAID Considerations Select your Stripe Size by empirical testing smaller stripe size increases transfer performance, decreases positioning performance, and vice versa ideal stripe sizes depend on your application, typical data read in a read, sequential vs random reads etc Try and select hard drives from separate production batches Maintain sufficient Spares in a large array (typically 1 per 10-15 disks is sufficient) Use Global spares across RAID groups if your controller supports it
RAID Considerations Use hardware RAID unless performance is not a consideration Especially nested RAID levels or parity based RAID – consume more CPU cycles and increase rebuild time if implemented in software General rule about Controller Cache – the higher the better Ensure the controller has battery backup to retain its cache in case of power failure For internal RAID Controller cards use faster PCI buses (PCI-x)
The Fun starts – Lets build our storage system
Passive Disk Enclosure based Direct Attached Storage (PDE based DAS)
Passive Disk Enclosure based DAS DAS – Direct Attached storage RAID controller inside host machine External chasis is simply a JBOD (Just a Bunch Of Disks) (or what I’d like to call  Passive Disk Enclosure  or PDE) PDE enables stringing larger number of drives together as compared to internal RAID array Eg Dell Powervault MD1000
Passive Disk Enclosure based DAS Passive Disk Enclosure  can consist of SAS, SATA or FC drives Passive Disk Enclosure to RAID Controller connectivity can be SAS, FC, SCSI (possibly different from the backplane) Multiple PDEs can be daisy chained if they support it RAID card is a single point of failure Only one host machine supported Array of disks can be divided into multiple RAID groups
Passive Disk Enclosure based DAS Array of disks can be divided into multiple heterogeneous  RAID groups Size and type of a RAID group depends on RAID card PDE may have multiple paths to system with possibility of multiplexing for increased speed Global spares can be defined on the RAID card Maximum storage size = maximum number of PDEs that can be daisy chained x size of drives
Passive Disk Enclosure based DAS Performance Considerations Drives RAID configuration PDE Interconnect PDE to RAID Card connect RAID card config (cache etc) PCI bus
Active Disk Enclosure based Direct Attached Storage (ADE based DAS)
Active Disk Enclosure based DAS ADE Difference -> RAID Card is not in the host machine but in the enclosure Host machine has a SAS/FC Host Bus Adaptor (HBA) depending on ADE to Host connectivity support Some ADEs may support multiple connection protocols ADE may support SAS/FC/SATA drives ADE can support daisy-chaining PDEs Eg of ADE – Dell MD 3000, Infortrend eonstor devices, Nexsan Satabeast and Sataboy etc
Active Disk Enclosure based DAS ADE may support dual RAID Controllers RAID Controllers can be used as Active-Active (incase of multiple RAID Groups) – otherwise as Active Passive RAID Controller to HBA connectivity can be multiplexed - if supported - for higher throughput ADEs are wrongly but commonly referred as SAN (SAN device would still be alright)
Partitioning and Mounting
Logical Volumes A RAID Group is a physical unit of storage At the Operating System a Logical Group can be created out of multiple RAID Groups Each Logical Group can be further divided into Logical Volumes Each Logical Volume represents a mountable block device In Linux this is done using LVM  In LVM Logical Volumes are resizable
SAN (Storage Area Network)
SAN Multiple host machines connected to an ADE through a SAN switch SAN refers to the interconnect + Switch + ADE + PDE Switch and HBA can be SAS / FC depending on interconnect type supported by ADE ADE would support creation of Volumes These can be mounted onto Client and further subdivided
SAN Care must be taken to mount each Logical Volume onto a single client (unless you are running a Clustered File System) This can be achieved by host masking supported by ADE and/or the Switch Without careful host masking and mounting data corruption can take place
SAN Complex SAN configs include  multiple hosts and multiple ADEs  connected to active-active switches with multiplexed connections Client hosts can be of heterogeneous operating systems (Funnily ADE to PDE paths sometimes are not be multiplexed)
SAN While this looks complex – just think of it as removing hard disks from the machine and hosting them outside in separate enclosures Each machine mounts an independent partition from the SAN
SAN Performance Considerations All variables we covered before Switch config Ensure that switch / HBA / interconnect does not become the bottleneck and full hdd throughput can be utilized
Throughput Calculations Hard disk performance – Type, RPM etc Data distribution and Type of Data access RAID performance, number of drives, RAID type RAID card performance – cache, active-active config etc ADE to switch connection speed Switch to HBA connection speed HBA to PCI bus speed
That’s all Folks “ Lets go build out our Yottabyte arrays and fill ‘em up” [Considerably exaggerated hyperbole given that the combined space of all computers in the world today (2007) doesn’t add up to 1 Yottabyte (2 ^ 80 bytes). Infact the entire worlds storage is projected to hit 988 exabytes (2 ^ 60) by 2010]  [6 th  Sep 2007 -  http://www.networkworld.com/newsletters/stor/2007/0903stor2.html  – Nanotech breakthrough could put entire YouTube contents on an iPod-size device]
Part II sneak preview Complex SAN configurations iSCSI NAS Clustered Storage GFS Backups Storage Monitoring Storage Benchmarking Some Commercial storage vendors
Shameless HR Propaganda Slide Directi builds cool Web products Deployed on distributed architecture Using terrabytes of storage Used by millions of users Generating billions of pageviews and transactions Spanning every possible software engineering technology For more info visit  http://careers.directi.com Blog:  http://bhavin.directi.com Mail:  [email_address]

More Related Content

What's hot (20)

PPTX
JetStor JBOD Microsoft Storage Spaces Xces BV
Gene Leyzarovich
 
PPTX
Raid
Ankita Jadhao
 
PPT
Raid+controllers
ismaelhaider
 
PDF
Webinar NETGEAR - ReadyNAS, le novità hardware e software
Netgear Italia
 
PPTX
Raid 5
Ankita Jadhao
 
PPTX
Raid level
Suveeksha
 
PPTX
VLDB Administration Strategies
Murilo Miranda
 
PPTX
Spinning Brown Donuts: Why Storage Still Counts
Sparkhound Inc.
 
PDF
Firebird and RAID
Mind The Firebird
 
PDF
Why does my choice of storage matter with cassandra?
Johnny Miller
 
PPTX
Performance evolution of raid
Zubair Sami
 
PDF
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
DataStax Academy
 
PPTX
[3]dell storage spaces c 1
Megan Warren
 
PDF
robust-storage-solution
Tecsun Yeep
 
PDF
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
DataStax Academy
 
PDF
NetServ - CD / DVD Server + NAS Storage – PrimeArray
Prime Array
 
DOCX
Mass storagestructure pre-final-formatting
marangburu42
 
PPT
V L S
darulquthni
 
PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
PPT
Storage
sanjeev agarwal
 
JetStor JBOD Microsoft Storage Spaces Xces BV
Gene Leyzarovich
 
Raid+controllers
ismaelhaider
 
Webinar NETGEAR - ReadyNAS, le novità hardware e software
Netgear Italia
 
Raid level
Suveeksha
 
VLDB Administration Strategies
Murilo Miranda
 
Spinning Brown Donuts: Why Storage Still Counts
Sparkhound Inc.
 
Firebird and RAID
Mind The Firebird
 
Why does my choice of storage matter with cassandra?
Johnny Miller
 
Performance evolution of raid
Zubair Sami
 
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
DataStax Academy
 
[3]dell storage spaces c 1
Megan Warren
 
robust-storage-solution
Tecsun Yeep
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
DataStax Academy
 
NetServ - CD / DVD Server + NAS Storage – PrimeArray
Prime Array
 
Mass storagestructure pre-final-formatting
marangburu42
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 

Similar to Demystifying Storage (20)

PPTX
Nagendra Srivastava
Nagendra65
 
PPTX
Overview of Redundant Disk Arrays
Andrew Robinson
 
PPTX
What is R.A.I.D?
Sumit kumar
 
PPTX
Information_Storage_Management_Module 2_RAID.pptx
shruthis866876
 
PDF
GeoVision : CCTV Solutions : RAID vs Non-RAID System for Storing Surveillance...
TSOLUTIONS
 
PPT
disk structure and multiple RAID levels .ppt
RAJASEKHARV10
 
PPTX
6-5-20256-5-20256-5-20256-5-20256-5-2025.pptx
FutureTechnologies3
 
PDF
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Raid Data Recovery
 
PPTX
9_Storage_Devices.pptx
AbdulhseynAayev1
 
PPT
RAID CONCEPT
Ramasubbu .P
 
DOCX
Raid the redundant array of independent disks technology overview
IT Tech
 
PPTX
DAS RAID NAS SAN
Ghassen Smida
 
PPTX
UNIT III.pptx
NIVETHA37590
 
DOCX
Various raid levels pros & cons
IT Tech
 
PPTX
presentasi-raid-server-cloud-computing.pptx
sendukedian
 
PPTX
Pace IT - Storage Devices (part 1)
Pace IT at Edmonds Community College
 
PPT
Unit 6 Device management.ppt Unit 6 Device management.ppt
hamowi2047
 
PPTX
RAID-CONFIGURATION (2023).pptx
KathrynAnnFlorentino
 
Nagendra Srivastava
Nagendra65
 
Overview of Redundant Disk Arrays
Andrew Robinson
 
What is R.A.I.D?
Sumit kumar
 
Information_Storage_Management_Module 2_RAID.pptx
shruthis866876
 
GeoVision : CCTV Solutions : RAID vs Non-RAID System for Storing Surveillance...
TSOLUTIONS
 
disk structure and multiple RAID levels .ppt
RAJASEKHARV10
 
6-5-20256-5-20256-5-20256-5-20256-5-2025.pptx
FutureTechnologies3
 
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Raid Data Recovery
 
9_Storage_Devices.pptx
AbdulhseynAayev1
 
RAID CONCEPT
Ramasubbu .P
 
Raid the redundant array of independent disks technology overview
IT Tech
 
DAS RAID NAS SAN
Ghassen Smida
 
UNIT III.pptx
NIVETHA37590
 
Various raid levels pros & cons
IT Tech
 
presentasi-raid-server-cloud-computing.pptx
sendukedian
 
Pace IT - Storage Devices (part 1)
Pace IT at Edmonds Community College
 
Unit 6 Device management.ppt Unit 6 Device management.ppt
hamowi2047
 
RAID-CONFIGURATION (2023).pptx
KathrynAnnFlorentino
 
Ad

Recently uploaded (20)

PDF
Infrastructure and geopolitics.AM.ENG.docx.pdf
Andrea Mennillo
 
PDF
How BrainManager.io Boosts Productivity.
brainmanagerious
 
PDF
12 Oil and Gas Companies in India Driving the Energy Sector.pdf
Essar Group
 
PDF
MBA-I-Year-Session-2024-20hzuxutiytidydy
cminati49
 
PDF
Alan Stalcup - Principal Of GVA Real Estate Investments
Alan Stalcup
 
PDF
Followers to Fees - Social media for Speakers
Corey Perlman, Social Media Speaker and Consultant
 
PPTX
Chapter 3 Distributive Negotiation: Claiming Value
badranomar1990
 
PPTX
Appreciations - July 25.pptxdddddddddddss
anushavnayak
 
PDF
Unlocking Productivity: Practical AI Skills for Professionals
LOKAL
 
DOCX
Andrew C. Belton, MBA Resume - July 2025
Andrew C. Belton
 
PPTX
Appreciations - July 25.pptxsdsdsddddddsssss
anushavnayak
 
PDF
A Study on Analysing the Financial Performance of AU Small Finance and Ujjiva...
AI Publications
 
PDF
Gregory Felber - A Dedicated Researcher
Gregory Felber
 
PPTX
Andrew C. Belton, MBA Experience Portfolio July 2025
Andrew C. Belton
 
PPTX
Appreciations - July 25.pptxffsdjjjjjjjjjjjj
anushavnayak
 
PPTX
Piper 2025 Financial Year Shareholder Presentation
Piper Industries
 
PDF
Driving the Energy Transition India’s Top Renewable Energy Solution Providers...
Essar Group
 
PDF
From Fossil to Future Green Energy Companies Leading India’s Energy Transitio...
Essar Group
 
PPTX
Struggling to Land a Social Media Marketing Job Here’s How to Navigate the In...
RahulSharma280537
 
PPTX
Lecture on E Business course Topic 24-34.pptx
MuhammadUzair737846
 
Infrastructure and geopolitics.AM.ENG.docx.pdf
Andrea Mennillo
 
How BrainManager.io Boosts Productivity.
brainmanagerious
 
12 Oil and Gas Companies in India Driving the Energy Sector.pdf
Essar Group
 
MBA-I-Year-Session-2024-20hzuxutiytidydy
cminati49
 
Alan Stalcup - Principal Of GVA Real Estate Investments
Alan Stalcup
 
Followers to Fees - Social media for Speakers
Corey Perlman, Social Media Speaker and Consultant
 
Chapter 3 Distributive Negotiation: Claiming Value
badranomar1990
 
Appreciations - July 25.pptxdddddddddddss
anushavnayak
 
Unlocking Productivity: Practical AI Skills for Professionals
LOKAL
 
Andrew C. Belton, MBA Resume - July 2025
Andrew C. Belton
 
Appreciations - July 25.pptxsdsdsddddddsssss
anushavnayak
 
A Study on Analysing the Financial Performance of AU Small Finance and Ujjiva...
AI Publications
 
Gregory Felber - A Dedicated Researcher
Gregory Felber
 
Andrew C. Belton, MBA Experience Portfolio July 2025
Andrew C. Belton
 
Appreciations - July 25.pptxffsdjjjjjjjjjjjj
anushavnayak
 
Piper 2025 Financial Year Shareholder Presentation
Piper Industries
 
Driving the Energy Transition India’s Top Renewable Energy Solution Providers...
Essar Group
 
From Fossil to Future Green Energy Companies Leading India’s Energy Transitio...
Essar Group
 
Struggling to Land a Social Media Marketing Job Here’s How to Navigate the In...
RahulSharma280537
 
Lecture on E Business course Topic 24-34.pptx
MuhammadUzair737846
 
Ad

Demystifying Storage

  • 1. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi [email protected]
  • 2. Why is storage important? Web 2.0 applications are an extension of your Desktop SaaS is here and growing Broadband is a reality Storage costs are dropping Everyone expects near-unlimited storage online – Youtube, Flickr, Facebook et al are storing your life online* (.. And yea … lets not forget your personal bit-torrent collection) * it would take 1400 TB to store your entire life in video. 5700 TB if you want to know what was happening around you. Another 73 TB for the audio files of everything you heard (MP3 quality). That’s about 6000 TB for a copy of your life
  • 3. Agenda Hard disks SATA, SAS, FC, Solidstate RAID DAS SAN
  • 4. “ Large scale storage requires careful planning”
  • 5. Choosing your Hard Disk (SATA, FC, SAS, SCSI, Solidstate)
  • 6. Introduction to Hard Drives Basic physical storage unit (aka Physical block device) Variables to consider when selecting a drive Type (SAS, SATA, FC) RPM Capacity MTBF (Mean Time between Failures) Life Expectancy
  • 7. Hard Disk types SATA (Serial ATA) SAS (Serial Attached SCSI) FC (Fibre Channel) Typical Use low-cost, high-volume, low-speed, large-storage environments CDP / Backups Replacement for SCSI High performance transaction oriented applications with high IOPs requirement High performance transaction oriented applications with high IOPs requirement Performance Average Typically 7200 RPM Good (Similar to FC) 10k / 15k RPM Good (Similar to SAS) 10k / 15k RPM Hard drive capacities Typically - 250 GB, 500 GB, 750 GB, 1TB Typically – 73 GB, 146 GB, 300 GB, 400 GB Typically – 73 GB, 146 GB, 300 GB, 400 GB
  • 8. Hard Disk types SATA (Serial ATA) SAS (Serial Attached SCSI) FC (Fibre Channel) Price per Gig (based on max drive capacity retail web price) $ 0.33 $2 $3 Misc - Backward compatible with SATA Allows mixing SATA drives on same backplane -
  • 9. Hard Disk Conclusions For high IOPs, database applications, low-storage requirements – you have a choice between FC and SAS SAS currently seems like the better option Future SAS standards promise to be faster than FC (though it is likely they may remain neck to neck) For high-storage requirements (video server, file servers, photo storage, archivals, mail servers, backup servers) SATA is the way to go One may combine SAS and SATA to reduce average cost and achieve your goals – especially since the backplanes are cross-compatible Readup the spec sheet of the hard drives you plan on using for determining specifics
  • 10. Solid State Drives Uses solid state memory to store persistent data Eliminates mechanical parts Useful for creating efficient in-between caches or storing small to mid-sized high performance databases
  • 11. Solid State Drives References Intro - http://en.wikipedia.org/wiki/Solid_state_disk RAM vs Flash based - http://www.storagesearch.com/ssd-ram-v-flash.html SSD based SAN!!!  - http://www.superssd.com/ Advantages Disadvantages Faster startup – no spinning Significantly faster on Random IO (From 250x to 1000x+) Extremely low latency (25x to 200x better) No noise Lower power consumption Lesser heat production Significantly more expensive ($10-30/GB for Flash based, $100-200/GB for DDR RAM based) Slightly slower on large sequential reads Slower random write speeds incase of Flash based storage
  • 12. RAID Primer (0, 1, 2, 3, 4, 5, 6, TP, 0+1, 10, 50, 60)
  • 13. Introduction to RAID allows multiple disks to appear as a single contiguous physical block device provides redundancy / high availability A raid group appears as a single physical block device HD1 HD2 HD1 HD2 RAID
  • 14. Comparison of Single RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 Diagram Description Striping Mirroring Striping with Parity Striping with Dual Parity Minimum Disks 2 2 3 4 Maximum Disks Controller Dependant 2 Controller Dependant Controller Dependant Array Capacity No. of Drives x Drive Capacity Drive Capacity (No. of Drives - 1) x Drive Capacity (No. of Drives - 2) x Drive Capacity
  • 15. Comparison of Single RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 Storage Efficiency 100% 50% (Num of drives – 1) / Num of drives (Num of drives – 2) / Num of drives Fault Tolerance None 1 Drive failure 1 Drive failure 2 Drive failures High Availability None Good Good Very Good Degradation during rebuild NA Slight degradation Rebuilds very fast High degradation Slow Rebuild (due to write penalty of parity) Very High degradation Very Slow Rebuild (due to write penalty of dual parity)
  • 16. Comparison of Single RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 Random Read Performance Very Good Good Very Good Very Good Random Write Performance Very Good Good (slightly worse than single drive) Fair (Parity overhead) Poor (Dual Parity Overhead) Sequential Read Performance Very Good Fair Good Good Sequential Write Performance Very Good Good Fair Fair Cost Lowest High Moderate Moderate+
  • 17. Comparison of Single RAID Levels RAID 0 RAID 1 RAID 5 RAID 6 Use Case Non critical data High speed requirements Data backed up elsewhere Typically used as RAID 10 in OLTP / OLAP applications Non-write intensive OLTP applications / file servers etc Non-write intensive OLTP applications / file servers etc Misc - - Parity can considerably slow down system Not supported on all RAID cards
  • 18. Understanding the Parity Penalty RAID 5 and RAID 6 store parity information against data for rebuild Single Parity can be calculated using a simple XOR eg– “abcdefghijkl” on a 4 disk RAID 5 array If Disk 2 fails then the data “B” can be recalculated as (01000001 XOR 01000011 XOR 01000000) => 01000010 => B +12124286429 Disk 1 Disk 2 Disk 3 Disk 4 A (01000001) B (01000010) C (01000011) {P – 01000000} Parity {P} D E F G Parity {P} H I J K Parity {P} L
  • 19. Understanding the Parity Penalty Steps to change “B” to “X” on Disk 2 Read A, C and {P} Recalculate {P} as ‘A’ XOR ‘X’ XOR ‘C’ Write ‘X’ and {P} A single update required 3 reads and 2 writes Random writes in RAID 5 and RAID 6 are very very expensive Disk 1 Disk 2 Disk 3 Disk 4 A (01000001) B->X (01000010) -> (01011000) C (01000011) {P – 01000000}
  • 20. Understanding the Parity Penalty Rebuilding in RAID 5 and RAID 6 is expensive The cost increases with increase in number of disks As if this isnt enough there is an additional penalty All the writes after the computation (ie parity and the changed block) must be simultaneous (involving a two-phase commit operation) The impact can be marginally reduced through write-back caching
  • 21. Comparison of Nested RAID Levels RAID 10 RAID 50 Diagram Description Mirroring then Striping Striping with Parity then Striping without parity Minimum Disks Even number > 4 > 6 Maximum Disks Controller Dependant Controller Dependant Array Capacity (Size of Drive) * (Number of Drives ) / 2 (Size of Drive) * (No. of Drives In Each RAID 5 Set - 1) * (No of RAID 5 Sets)
  • 22. Comparison of Nested RAID Levels RAID 10 RAID 50 Storage Efficiency 50% ((No. of Drives In Each RAID 5 Set - 1) / No. of Drives In Each RAID 5 Set) Fault Tolerance Multiple drive failure as long as 2 drives from same RAID 1 set do not fail Multiple drive failure as long as 2 drives from same RAID 5 set do not fail High Availability Excellent Excellent Degradation during rebuild Minor Moderate degradation Slow Rebuild (due to write penalty of parity)
  • 23. Comparison of Nested RAID Levels RAID 10 RAID 50 Read Performance Very Good Very Good Write Performance Very Good Good Use Case OLTP / OLAP applications Medium-write intensive OLTP / OLAP applications
  • 24. Nested RAID Misc Notes RAID 10 is faster and better than RAID 0+1 for the same cost RAID 60 is similar to RAID 50 except that the striped sets with parity contain dual parity Ideally RAID 10 and RAID 50 will be the only nested RAID levels you will use
  • 25. RAID Considerations Select your Stripe Size by empirical testing smaller stripe size increases transfer performance, decreases positioning performance, and vice versa ideal stripe sizes depend on your application, typical data read in a read, sequential vs random reads etc Try and select hard drives from separate production batches Maintain sufficient Spares in a large array (typically 1 per 10-15 disks is sufficient) Use Global spares across RAID groups if your controller supports it
  • 26. RAID Considerations Use hardware RAID unless performance is not a consideration Especially nested RAID levels or parity based RAID – consume more CPU cycles and increase rebuild time if implemented in software General rule about Controller Cache – the higher the better Ensure the controller has battery backup to retain its cache in case of power failure For internal RAID Controller cards use faster PCI buses (PCI-x)
  • 27. The Fun starts – Lets build our storage system
  • 28. Passive Disk Enclosure based Direct Attached Storage (PDE based DAS)
  • 29. Passive Disk Enclosure based DAS DAS – Direct Attached storage RAID controller inside host machine External chasis is simply a JBOD (Just a Bunch Of Disks) (or what I’d like to call Passive Disk Enclosure or PDE) PDE enables stringing larger number of drives together as compared to internal RAID array Eg Dell Powervault MD1000
  • 30. Passive Disk Enclosure based DAS Passive Disk Enclosure can consist of SAS, SATA or FC drives Passive Disk Enclosure to RAID Controller connectivity can be SAS, FC, SCSI (possibly different from the backplane) Multiple PDEs can be daisy chained if they support it RAID card is a single point of failure Only one host machine supported Array of disks can be divided into multiple RAID groups
  • 31. Passive Disk Enclosure based DAS Array of disks can be divided into multiple heterogeneous RAID groups Size and type of a RAID group depends on RAID card PDE may have multiple paths to system with possibility of multiplexing for increased speed Global spares can be defined on the RAID card Maximum storage size = maximum number of PDEs that can be daisy chained x size of drives
  • 32. Passive Disk Enclosure based DAS Performance Considerations Drives RAID configuration PDE Interconnect PDE to RAID Card connect RAID card config (cache etc) PCI bus
  • 33. Active Disk Enclosure based Direct Attached Storage (ADE based DAS)
  • 34. Active Disk Enclosure based DAS ADE Difference -> RAID Card is not in the host machine but in the enclosure Host machine has a SAS/FC Host Bus Adaptor (HBA) depending on ADE to Host connectivity support Some ADEs may support multiple connection protocols ADE may support SAS/FC/SATA drives ADE can support daisy-chaining PDEs Eg of ADE – Dell MD 3000, Infortrend eonstor devices, Nexsan Satabeast and Sataboy etc
  • 35. Active Disk Enclosure based DAS ADE may support dual RAID Controllers RAID Controllers can be used as Active-Active (incase of multiple RAID Groups) – otherwise as Active Passive RAID Controller to HBA connectivity can be multiplexed - if supported - for higher throughput ADEs are wrongly but commonly referred as SAN (SAN device would still be alright)
  • 37. Logical Volumes A RAID Group is a physical unit of storage At the Operating System a Logical Group can be created out of multiple RAID Groups Each Logical Group can be further divided into Logical Volumes Each Logical Volume represents a mountable block device In Linux this is done using LVM In LVM Logical Volumes are resizable
  • 38. SAN (Storage Area Network)
  • 39. SAN Multiple host machines connected to an ADE through a SAN switch SAN refers to the interconnect + Switch + ADE + PDE Switch and HBA can be SAS / FC depending on interconnect type supported by ADE ADE would support creation of Volumes These can be mounted onto Client and further subdivided
  • 40. SAN Care must be taken to mount each Logical Volume onto a single client (unless you are running a Clustered File System) This can be achieved by host masking supported by ADE and/or the Switch Without careful host masking and mounting data corruption can take place
  • 41. SAN Complex SAN configs include multiple hosts and multiple ADEs connected to active-active switches with multiplexed connections Client hosts can be of heterogeneous operating systems (Funnily ADE to PDE paths sometimes are not be multiplexed)
  • 42. SAN While this looks complex – just think of it as removing hard disks from the machine and hosting them outside in separate enclosures Each machine mounts an independent partition from the SAN
  • 43. SAN Performance Considerations All variables we covered before Switch config Ensure that switch / HBA / interconnect does not become the bottleneck and full hdd throughput can be utilized
  • 44. Throughput Calculations Hard disk performance – Type, RPM etc Data distribution and Type of Data access RAID performance, number of drives, RAID type RAID card performance – cache, active-active config etc ADE to switch connection speed Switch to HBA connection speed HBA to PCI bus speed
  • 45. That’s all Folks “ Lets go build out our Yottabyte arrays and fill ‘em up” [Considerably exaggerated hyperbole given that the combined space of all computers in the world today (2007) doesn’t add up to 1 Yottabyte (2 ^ 80 bytes). Infact the entire worlds storage is projected to hit 988 exabytes (2 ^ 60) by 2010] [6 th Sep 2007 - http://www.networkworld.com/newsletters/stor/2007/0903stor2.html – Nanotech breakthrough could put entire YouTube contents on an iPod-size device]
  • 46. Part II sneak preview Complex SAN configurations iSCSI NAS Clustered Storage GFS Backups Storage Monitoring Storage Benchmarking Some Commercial storage vendors
  • 47. Shameless HR Propaganda Slide Directi builds cool Web products Deployed on distributed architecture Using terrabytes of storage Used by millions of users Generating billions of pageviews and transactions Spanning every possible software engineering technology For more info visit http://careers.directi.com Blog: http://bhavin.directi.com Mail: [email_address]