SlideShare a Scribd company logo
1
How to Build a
True no Single Point of Failure
Ceph Cluster
周振倫
Aaron JOUE
Founder & CEO
Agenda
• About Ambedded
• Ceph Architecture
• Why Ceph is no single point of failure
• How to build a true no single point of failure Ceph cluster
• High Availability
• Scalable
• How does Ceph Support OpenStack?
• Build a no Single Point of Failure Ceph Cluster
• Build a OpenStack A-Team Taiwan
2
About Ambedded Technology
Y2013
Y2017
Y2016
Y2014-2015
Founded in Taiwan Taipei,
Office in National Taiwan University Innovative Innovation Center
Deliver 2000+ Gen 1 microservers to partner Cynny for its Cloud
Storage Service. 9 Petabytes capacity on service till now.
Demo in ARM Global Partner Meeting UK Cambridge.
• Launch the 1st ever Ceph Storage Appliance powered by
Gen 2 ARM microserver
• Awarded as the 2016 Best of INTEROP Las Vegas Storage
product. Defeat VMware virtual SAN.
3
• Won Computex 2017 Best Choice Golden Award
• Product Mars 200 is successfully deployed to France
and Taiwan tier 1 telecom companies
4
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-
managing, intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and
fully-distributed
block device, with
a Linux kernel
client and a
QEMU/KVM driver
CephFS
A POSIX-
compliant
distributed file
system, with a
Linux kernel client
and support for
FUSE
RADOS Gateway
A bucket-based
REST gateway,
compatible with
S3 and Swift
APP APP HOST/VM CLIENT
Ceph is Unified Storage
MDS MDS
OSD OSD OSD OSD OSD OSD MON MON MON
Object Storage
Object Storage Block Storage File System
Why Ceph is NO Single Point of Failure?
• Distributed data and replications throughout disks in cluster
• CRUSH algorithm - Controlled Replication Under Scalable Hashing
• Distribute object across OSDs according to pre-defined a failure
domain
• CRUSH rule ensure data is never stored in same failure domain
• Self-healing automates data recovering while server/device fail
• No controller, no bottleneck limit the scalability
• Clients use maps and Hash calculation to write/read objects to/from
OSDs
• Geo replication & Mirroring
5
CRUSH Algorithm & Replication
6
3 2
1
1
2 3
1
3 2
1
2 3
CRUSH rule ensures replicated data are
located in different server node
Failure domain can be defined as: Node, Chassis, Rack, Data Center
Cluster map from MON
Client Server
Compute Object Location
Placement Group
Primary OSD
1
2
3
CRUSH Map of a Large Cluster
7
Root
Rack 11 Rack 12 Rack 12
Disk 1
.
.
.
.
.
.
Disk 8
Node 11 Node 12 Node 13 Node 21 Node 22 Node 23 Node 31 Node 32 Node 33
1
2
3
CRUSH Algorithm & Erasure Coding
8
Client Server
Compute Object
Location
Placement Group -> (K+M) OSD located in
different failure domain
Data Chunks
Coding Chunks
K=4
M=2
K+M = 4+2, Allows max. 2 OSDs fail.
Capacity consumed is (K+M)/K of original data
9
Self Healing
Auto-detection,
Re-generate the
missing copies of data
The re-generated
data copies will save
to existing cluster
follow the CRUSH rule
setting via UVS
Autonomic!
The Self Healing will be active when cluster detect
the data risk, no need with human hand-in.
Auto Balance vs. Auto Scale-Out
EMPTY
EMPTY
EMPTY
EMPTY
EMPTY
FULL
FULL
FULL
FULL
BALANCED
BALANCED
BALANCED
BALANCED
BALANCED
When new Mars200/201
join the cluster, the total
capacity will scale out
automatically
Autonomic scale out the
performance and
capacity load
OSD Self-Heal vs. RAID Re-build
11
Test Condition Microserver Ceph Cluster Disk Array
Disk number/capacity 16 x 10TB HDD OSD 16 x 3TB HDD
Data Protection Replica = 2 RAID 5
Data Stored in the disk 3TB Not related
Time for re-heal/re-build 5 hours, 10 min. 41 Hours
Administrator involvement Re-heal activate
automatically
Re-build after
replacing a new disk
Re-heal rate 169 MB/s
10MB/s/OSD
21 MB/s
Re-heal time vs. total
number of disks
More disk - > less recover
time
More disk -> longer
recover time
*OSD Backfilling configuration is default value
Build a no Single Point of Failure Ceph Cluster
• Hardware will always fail
• Protect data by software intelligence instead of
using hardware redundancy
• Minimize and Configurable Failure Domain
12
Issues of Using Single Server Node
with Multiple Ceph OSDs
• Large Failure Domain: One Server failure causes many
OSDs down.
• CPU utility is only 30%-40% when network is saturated. The
bottleneck is network - not computing.
• The power consumption and thermal / heat is eating
your budget
13
1x OSD with 1x Micro Server
X 8 X 8 X 8
Network
M
S
M
S
xN M
S
M
S
xN M
S
M
S
xNM
S
M
S
M
S
4x 100Gb 4x 100Gb 4x10Gb
Micro server
cluster
Micro server
cluster
Micro server
cluster
ARM micro server
cluster
- 1 to 1 to reduce
failure risk
- Aggregated
network bandwidth
without bottleneck
Traditional
Server #1
Traditional
Server #2
Traditional
Server #3
x N x N x N
Client #1 Client #2
Network
20Gb 20Gb 20Gb
Traditional server
- 1 to many causes
higher risk of a
server failure
- CPU utility is low
due to Network
bottleneck
14
Mars 200: 8-Node ARM Microserver Cluster
8x 1.6GHz ARM v7 Dual Core microserver
- 2G Bytes DRAM
- 8G Bytes Flash: System disk
- 5 Gbps LAN
- < 5 Watts power consumption
Every node can be OSD, MON, MDS, Gateways
Storage Device
- 8x SATA3 HDD/SSD OSD
- 8x SATA3 Journal SSD
OOB BMC port
Dual uplink switches
- Total 4x 10 Gbps
15
Hot Swappable
 Micro Server
 HDD/SSD
 Ethernet Switch
 Power supply
The Basic High Availability Cluster
16
Scale it out
The Benefit of Using
1 Node to 1 OSD Architecture on CEPH
• Minimize the failure domain to single OSD.
• The MTBF of a micro server is much higher than an all-in-one
motherboard ( MTBF>120K hours)
• High Availability: 15x9 (3 replication)
• High Performance: Dedicated H/W resource
– CPU, Memory, Network, SATA interface, SSD Journal disk
• High Bandwidth: Aggregated network bandwidth with failover
• 60Watts Low power consumption and cooling cost savings
• 3 x 1U chassis forms a high availability cluster
17
Ceph Storage Appliance
18
2U 8 Nodes
Front Panel Disk
Access
1U 8 Nodes
High Density
2017
RBD Performance Test
19
40 VM clients on Xeon Server as load workers
4 x 10Gbps
10G switch2x10Gbps
21 x SSD OSD + SSD journal ,
3 MON
Ceph cluster
40 x RBD
Use fio from 1x client up
to 40 clients.
Use the maximum un-
saturated bandwidth as
the aggregated
performance
Scale Out Test (SSD)
62,546
125,092
187,639
8,955
17,910
26,866
0
5,000
10,000
15,000
20,000
25,000
30,000
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
0 5 10 15 20 25
4K Read
4K Write
Number of OSDs
7 OSD
14 OSD
21 OSD
Random
Read
IOPS
Random
write
IOPS
20
Network does Matters
The purpose of this test is to measure improvement when the uplink bandwidth is increased
from 20Gb to 40Gb. Mars 200 has 4x 10Gb uplinks ports. The test results show 42-57% IOPS
improvement.
21
Ceph Management GUI
Demo
22
Build a OpenStack A-Team Taiwan
23
晨宇創新 數位無限
The Power of Partnership
24
Aaron Joue
aaron@ambedded.com.tw
LIBRADOS
OpenStack
Ceph and OpenStack
25
KEYSTONE SWIFT CINDER GLANCE NOVA MANILA
CEILOMETE
R
OSD
OSD
MON
OSD
OSD
MON
OSD
OSD
MON
OSD
OSD
MDS
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
RADOS Gateway
librgw
RADOS CLUSTER
LIBRBD libcephfs
KVM/QEM
U
libvirt

More Related Content

PDF
Arm - ceph on arm update
inwin stack
 
PDF
SUSE - performance analysis-with_ceph
inwin stack
 
PDF
Redhat - rhcs 2017 past, present and future
inwin stack
 
PDF
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
inwin stack
 
PDF
inwinSTACK - ceph integrate with kubernetes
inwin stack
 
PDF
2016-JAN-28 -- High Performance Production Databases on Ceph
Ceph Community
 
PPTX
Walk Through a Software Defined Everything PoC
Ceph Community
 
PPTX
Ceph Day Melabourne - Community Update
Ceph Community
 
Arm - ceph on arm update
inwin stack
 
SUSE - performance analysis-with_ceph
inwin stack
 
Redhat - rhcs 2017 past, present and future
inwin stack
 
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
inwin stack
 
inwinSTACK - ceph integrate with kubernetes
inwin stack
 
2016-JAN-28 -- High Performance Production Databases on Ceph
Ceph Community
 
Walk Through a Software Defined Everything PoC
Ceph Community
 
Ceph Day Melabourne - Community Update
Ceph Community
 

What's hot (20)

PPTX
Ceph: Low Fail Go Scale
Ceph Community
 
PDF
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 
PDF
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
 
PPTX
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Community
 
PPTX
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red_Hat_Storage
 
PPTX
Ceph Day KL - Ceph on All-Flash Storage
Ceph Community
 
PPTX
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
 
PPTX
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Community
 
PPTX
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Community
 
PDF
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Danielle Womboldt
 
PPTX
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Community
 
PDF
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Community
 
PDF
Ceph, the future of Storage - Sage Weil
Ceph Community
 
ODP
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Community
 
PDF
Developing a Ceph Appliance for Secure Environments
Ceph Community
 
PDF
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Community
 
PDF
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Community
 
PPTX
Which Hypervisor is Best?
Kyle Bader
 
PPTX
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red_Hat_Storage
 
PPTX
Red Hat Storage Day Boston - Supermicro Super Storage
Red_Hat_Storage
 
Ceph: Low Fail Go Scale
Ceph Community
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Community
 
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red_Hat_Storage
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Community
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Community
 
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Danielle Womboldt
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Community
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Community
 
Ceph, the future of Storage - Sage Weil
Ceph Community
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Community
 
Developing a Ceph Appliance for Secure Environments
Ceph Community
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Community
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Community
 
Which Hypervisor is Best?
Kyle Bader
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red_Hat_Storage
 
Red Hat Storage Day Boston - Supermicro Super Storage
Red_Hat_Storage
 
Ad

Similar to Ambedded - how to build a true no single point of failure ceph cluster (20)

PDF
How Ceph performs on ARM Microserver Cluster
Aaron Joue
 
PPTX
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
Aaron Joue
 
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
Patrick McGarry
 
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
Ceph Community
 
PPTX
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Eric Carter
 
PDF
Presentazione VMware @ VMUGIT UserCon 2015
VMUG IT
 
PDF
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
Ceph Community
 
PDF
VMworld Europe 2014: Virtual SAN Best Practices and Use Cases
VMworld
 
PDF
Reference Architecture: Architecting Ceph Storage Solutions
Ceph Community
 
PPTX
Databases love nutanix
NEXTtour
 
PDF
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
PDF
Quick-and-Easy Deployment of a Ceph Storage Cluster
Patrick Quairoli
 
PPTX
Energy Saving ARM Server Cluster Born for Distributed Storage & Computing
Aaron Joue
 
PDF
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY
 
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
PDF
PhegData X - High Performance EBS
Hanson Dong
 
PPTX
BigData Developers MeetUp
Christian Johannsen
 
PPTX
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld
 
PDF
Presentation architecting a cloud infrastructure
solarisyourep
 
PDF
Presentation architecting a cloud infrastructure
xKinAnx
 
How Ceph performs on ARM Microserver Cluster
Aaron Joue
 
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
Aaron Joue
 
QCT Ceph Solution - Design Consideration and Reference Architecture
Patrick McGarry
 
QCT Ceph Solution - Design Consideration and Reference Architecture
Ceph Community
 
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Eric Carter
 
Presentazione VMware @ VMUGIT UserCon 2015
VMUG IT
 
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
Ceph Community
 
VMworld Europe 2014: Virtual SAN Best Practices and Use Cases
VMworld
 
Reference Architecture: Architecting Ceph Storage Solutions
Ceph Community
 
Databases love nutanix
NEXTtour
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Patrick Quairoli
 
Energy Saving ARM Server Cluster Born for Distributed Storage & Computing
Aaron Joue
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY
 
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
PhegData X - High Performance EBS
Hanson Dong
 
BigData Developers MeetUp
Christian Johannsen
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld
 
Presentation architecting a cloud infrastructure
solarisyourep
 
Presentation architecting a cloud infrastructure
xKinAnx
 
Ad

More from inwin stack (20)

PDF
Migrating to Cloud Native Solutions
inwin stack
 
PDF
Cloud Native 下的應用網路設計
inwin stack
 
PDF
當電子發票遇見 Google Cloud Function
inwin stack
 
PDF
運用高效、敏捷全新平台極速落實雲原生開發
inwin stack
 
PDF
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
inwin stack
 
PDF
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
inwin stack
 
PDF
An Open, Open source way to enable your Cloud Native Journey
inwin stack
 
PDF
維運Kubernetes的兩三事
inwin stack
 
PDF
Serverless framework on kubernetes
inwin stack
 
PDF
Train.IO 【第六期-OpenStack 二三事】
inwin stack
 
PDF
Web後端技術的演變
inwin stack
 
PDF
以 Kubernetes 部屬 Spark 大數據計算環境
inwin stack
 
PDF
Setup Hybrid Clusters Using Kubernetes Federation
inwin stack
 
PDF
基於 K8S 開發的 FaaS 專案 - riff
inwin stack
 
PPTX
使用 Prometheus 監控 Kubernetes Cluster
inwin stack
 
PDF
Extend the Kubernetes API with CRD and Custom API Server
inwin stack
 
PDF
利用K8S實現高可靠應用
inwin stack
 
PPTX
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
inwin stack
 
PPTX
Distributed tensorflow on kubernetes
inwin stack
 
PDF
Build your own kubernetes apiserver and resource type
inwin stack
 
Migrating to Cloud Native Solutions
inwin stack
 
Cloud Native 下的應用網路設計
inwin stack
 
當電子發票遇見 Google Cloud Function
inwin stack
 
運用高效、敏捷全新平台極速落實雲原生開發
inwin stack
 
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
inwin stack
 
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
inwin stack
 
An Open, Open source way to enable your Cloud Native Journey
inwin stack
 
維運Kubernetes的兩三事
inwin stack
 
Serverless framework on kubernetes
inwin stack
 
Train.IO 【第六期-OpenStack 二三事】
inwin stack
 
Web後端技術的演變
inwin stack
 
以 Kubernetes 部屬 Spark 大數據計算環境
inwin stack
 
Setup Hybrid Clusters Using Kubernetes Federation
inwin stack
 
基於 K8S 開發的 FaaS 專案 - riff
inwin stack
 
使用 Prometheus 監控 Kubernetes Cluster
inwin stack
 
Extend the Kubernetes API with CRD and Custom API Server
inwin stack
 
利用K8S實現高可靠應用
inwin stack
 
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
inwin stack
 
Distributed tensorflow on kubernetes
inwin stack
 
Build your own kubernetes apiserver and resource type
inwin stack
 

Recently uploaded (20)

PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 

Ambedded - how to build a true no single point of failure ceph cluster

  • 1. 1 How to Build a True no Single Point of Failure Ceph Cluster 周振倫 Aaron JOUE Founder & CEO
  • 2. Agenda • About Ambedded • Ceph Architecture • Why Ceph is no single point of failure • How to build a true no single point of failure Ceph cluster • High Availability • Scalable • How does Ceph Support OpenStack? • Build a no Single Point of Failure Ceph Cluster • Build a OpenStack A-Team Taiwan 2
  • 3. About Ambedded Technology Y2013 Y2017 Y2016 Y2014-2015 Founded in Taiwan Taipei, Office in National Taiwan University Innovative Innovation Center Deliver 2000+ Gen 1 microservers to partner Cynny for its Cloud Storage Service. 9 Petabytes capacity on service till now. Demo in ARM Global Partner Meeting UK Cambridge. • Launch the 1st ever Ceph Storage Appliance powered by Gen 2 ARM microserver • Awarded as the 2016 Best of INTEROP Las Vegas Storage product. Defeat VMware virtual SAN. 3 • Won Computex 2017 Best Choice Golden Award • Product Mars 200 is successfully deployed to France and Taiwan tier 1 telecom companies
  • 4. 4 RADOS A reliable, autonomous, distributed object store comprised of self-healing, self- managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver CephFS A POSIX- compliant distributed file system, with a Linux kernel client and support for FUSE RADOS Gateway A bucket-based REST gateway, compatible with S3 and Swift APP APP HOST/VM CLIENT Ceph is Unified Storage MDS MDS OSD OSD OSD OSD OSD OSD MON MON MON Object Storage Object Storage Block Storage File System
  • 5. Why Ceph is NO Single Point of Failure? • Distributed data and replications throughout disks in cluster • CRUSH algorithm - Controlled Replication Under Scalable Hashing • Distribute object across OSDs according to pre-defined a failure domain • CRUSH rule ensure data is never stored in same failure domain • Self-healing automates data recovering while server/device fail • No controller, no bottleneck limit the scalability • Clients use maps and Hash calculation to write/read objects to/from OSDs • Geo replication & Mirroring 5
  • 6. CRUSH Algorithm & Replication 6 3 2 1 1 2 3 1 3 2 1 2 3 CRUSH rule ensures replicated data are located in different server node Failure domain can be defined as: Node, Chassis, Rack, Data Center Cluster map from MON Client Server Compute Object Location Placement Group Primary OSD 1 2 3
  • 7. CRUSH Map of a Large Cluster 7 Root Rack 11 Rack 12 Rack 12 Disk 1 . . . . . . Disk 8 Node 11 Node 12 Node 13 Node 21 Node 22 Node 23 Node 31 Node 32 Node 33 1 2 3
  • 8. CRUSH Algorithm & Erasure Coding 8 Client Server Compute Object Location Placement Group -> (K+M) OSD located in different failure domain Data Chunks Coding Chunks K=4 M=2 K+M = 4+2, Allows max. 2 OSDs fail. Capacity consumed is (K+M)/K of original data
  • 9. 9 Self Healing Auto-detection, Re-generate the missing copies of data The re-generated data copies will save to existing cluster follow the CRUSH rule setting via UVS Autonomic! The Self Healing will be active when cluster detect the data risk, no need with human hand-in.
  • 10. Auto Balance vs. Auto Scale-Out EMPTY EMPTY EMPTY EMPTY EMPTY FULL FULL FULL FULL BALANCED BALANCED BALANCED BALANCED BALANCED When new Mars200/201 join the cluster, the total capacity will scale out automatically Autonomic scale out the performance and capacity load
  • 11. OSD Self-Heal vs. RAID Re-build 11 Test Condition Microserver Ceph Cluster Disk Array Disk number/capacity 16 x 10TB HDD OSD 16 x 3TB HDD Data Protection Replica = 2 RAID 5 Data Stored in the disk 3TB Not related Time for re-heal/re-build 5 hours, 10 min. 41 Hours Administrator involvement Re-heal activate automatically Re-build after replacing a new disk Re-heal rate 169 MB/s 10MB/s/OSD 21 MB/s Re-heal time vs. total number of disks More disk - > less recover time More disk -> longer recover time *OSD Backfilling configuration is default value
  • 12. Build a no Single Point of Failure Ceph Cluster • Hardware will always fail • Protect data by software intelligence instead of using hardware redundancy • Minimize and Configurable Failure Domain 12
  • 13. Issues of Using Single Server Node with Multiple Ceph OSDs • Large Failure Domain: One Server failure causes many OSDs down. • CPU utility is only 30%-40% when network is saturated. The bottleneck is network - not computing. • The power consumption and thermal / heat is eating your budget 13
  • 14. 1x OSD with 1x Micro Server X 8 X 8 X 8 Network M S M S xN M S M S xN M S M S xNM S M S M S 4x 100Gb 4x 100Gb 4x10Gb Micro server cluster Micro server cluster Micro server cluster ARM micro server cluster - 1 to 1 to reduce failure risk - Aggregated network bandwidth without bottleneck Traditional Server #1 Traditional Server #2 Traditional Server #3 x N x N x N Client #1 Client #2 Network 20Gb 20Gb 20Gb Traditional server - 1 to many causes higher risk of a server failure - CPU utility is low due to Network bottleneck 14
  • 15. Mars 200: 8-Node ARM Microserver Cluster 8x 1.6GHz ARM v7 Dual Core microserver - 2G Bytes DRAM - 8G Bytes Flash: System disk - 5 Gbps LAN - < 5 Watts power consumption Every node can be OSD, MON, MDS, Gateways Storage Device - 8x SATA3 HDD/SSD OSD - 8x SATA3 Journal SSD OOB BMC port Dual uplink switches - Total 4x 10 Gbps 15 Hot Swappable  Micro Server  HDD/SSD  Ethernet Switch  Power supply
  • 16. The Basic High Availability Cluster 16 Scale it out
  • 17. The Benefit of Using 1 Node to 1 OSD Architecture on CEPH • Minimize the failure domain to single OSD. • The MTBF of a micro server is much higher than an all-in-one motherboard ( MTBF>120K hours) • High Availability: 15x9 (3 replication) • High Performance: Dedicated H/W resource – CPU, Memory, Network, SATA interface, SSD Journal disk • High Bandwidth: Aggregated network bandwidth with failover • 60Watts Low power consumption and cooling cost savings • 3 x 1U chassis forms a high availability cluster 17
  • 18. Ceph Storage Appliance 18 2U 8 Nodes Front Panel Disk Access 1U 8 Nodes High Density 2017
  • 19. RBD Performance Test 19 40 VM clients on Xeon Server as load workers 4 x 10Gbps 10G switch2x10Gbps 21 x SSD OSD + SSD journal , 3 MON Ceph cluster 40 x RBD Use fio from 1x client up to 40 clients. Use the maximum un- saturated bandwidth as the aggregated performance
  • 20. Scale Out Test (SSD) 62,546 125,092 187,639 8,955 17,910 26,866 0 5,000 10,000 15,000 20,000 25,000 30,000 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 200,000 0 5 10 15 20 25 4K Read 4K Write Number of OSDs 7 OSD 14 OSD 21 OSD Random Read IOPS Random write IOPS 20
  • 21. Network does Matters The purpose of this test is to measure improvement when the uplink bandwidth is increased from 20Gb to 40Gb. Mars 200 has 4x 10Gb uplinks ports. The test results show 42-57% IOPS improvement. 21
  • 23. Build a OpenStack A-Team Taiwan 23 晨宇創新 數位無限 The Power of Partnership
  • 25. LIBRADOS OpenStack Ceph and OpenStack 25 KEYSTONE SWIFT CINDER GLANCE NOVA MANILA CEILOMETE R OSD OSD MON OSD OSD MON OSD OSD MON OSD OSD MDS OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD RADOS Gateway librgw RADOS CLUSTER LIBRBD libcephfs KVM/QEM U libvirt