Ambedded - how to build a true no single point of failure ceph cluster

1
How to Build a
True no Single Point of Failure
Ceph Cluster
周振倫
Aaron JOUE
Founder & CEO

Agenda
• About Ambedded
• Ceph Architecture
• Why Ceph is no single point of failure
• How to build a true no single point of failure Ceph cluster
• High Availability
• Scalable
• How does Ceph Support OpenStack?
• Build a no Single Point of Failure Ceph Cluster
• Build a OpenStack A-Team Taiwan
2

About Ambedded Technology
Y2013
Y2017
Y2016
Y2014-2015
Founded in Taiwan Taipei,
Office in National Taiwan University Innovative Innovation Center
Deliver 2000+ Gen 1 microservers to partner Cynny for its Cloud
Storage Service. 9 Petabytes capacity on service till now.
Demo in ARM Global Partner Meeting UK Cambridge.
• Launch the 1st ever Ceph Storage Appliance powered by
Gen 2 ARM microserver
• Awarded as the 2016 Best of INTEROP Las Vegas Storage
product. Defeat VMware virtual SAN.
3
• Won Computex 2017 Best Choice Golden Award
• Product Mars 200 is successfully deployed to France
and Taiwan tier 1 telecom companies

4
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-
managing, intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and
fully-distributed
block device, with
a Linux kernel
client and a
QEMU/KVM driver
CephFS
A POSIX-
compliant
distributed file
system, with a
Linux kernel client
and support for
FUSE
RADOS Gateway
A bucket-based
REST gateway,
compatible with
S3 and Swift
APP APP HOST/VM CLIENT
Ceph is Unified Storage
MDS MDS
OSD OSD OSD OSD OSD OSD MON MON MON
Object Storage
Object Storage Block Storage File System

Why Ceph is NO Single Point of Failure?
• Distributed data and replications throughout disks in cluster
• CRUSH algorithm - Controlled Replication Under Scalable Hashing
• Distribute object across OSDs according to pre-defined a failure
domain
• CRUSH rule ensure data is never stored in same failure domain
• Self-healing automates data recovering while server/device fail
• No controller, no bottleneck limit the scalability
• Clients use maps and Hash calculation to write/read objects to/from
OSDs
• Geo replication & Mirroring
5

CRUSH Algorithm & Replication
6
3 2
1
1
2 3
1
3 2
1
2 3
CRUSH rule ensures replicated data are
located in different server node
Failure domain can be defined as: Node, Chassis, Rack, Data Center
Cluster map from MON
Client Server
Compute Object Location
Placement Group
Primary OSD
1
2
3

CRUSH Map of a Large Cluster
7
Root
Rack 11 Rack 12 Rack 12
Disk 1
.
.
.
.
.
.
Disk 8
Node 11 Node 12 Node 13 Node 21 Node 22 Node 23 Node 31 Node 32 Node 33
1
2
3

CRUSH Algorithm & Erasure Coding
8
Client Server
Compute Object
Location
Placement Group -> (K+M) OSD located in
different failure domain
Data Chunks
Coding Chunks
K=4
M=2
K+M = 4+2, Allows max. 2 OSDs fail.
Capacity consumed is (K+M)/K of original data

9
Self Healing
Auto-detection,
Re-generate the
missing copies of data
The re-generated
data copies will save
to existing cluster
follow the CRUSH rule
setting via UVS
Autonomic!
The Self Healing will be active when cluster detect
the data risk, no need with human hand-in.

Auto Balance vs. Auto Scale-Out
EMPTY
EMPTY
EMPTY
EMPTY
EMPTY
FULL
FULL
FULL
FULL
BALANCED
BALANCED
BALANCED
BALANCED
BALANCED
When new Mars200/201
join the cluster, the total
capacity will scale out
automatically
Autonomic scale out the
performance and
capacity load

OSD Self-Heal vs. RAID Re-build
11
Test Condition Microserver Ceph Cluster Disk Array
Disk number/capacity 16 x 10TB HDD OSD 16 x 3TB HDD
Data Protection Replica = 2 RAID 5
Data Stored in the disk 3TB Not related
Time for re-heal/re-build 5 hours, 10 min. 41 Hours
Administrator involvement Re-heal activate
automatically
Re-build after
replacing a new disk
Re-heal rate 169 MB/s
10MB/s/OSD
21 MB/s
Re-heal time vs. total
number of disks
More disk - > less recover
time
More disk -> longer
recover time
*OSD Backfilling configuration is default value

Build a no Single Point of Failure Ceph Cluster
• Hardware will always fail
• Protect data by software intelligence instead of
using hardware redundancy
• Minimize and Configurable Failure Domain
12

Issues of Using Single Server Node
with Multiple Ceph OSDs
• Large Failure Domain: One Server failure causes many
OSDs down.
• CPU utility is only 30%-40% when network is saturated. The
bottleneck is network - not computing.
• The power consumption and thermal / heat is eating
your budget
13

1x OSD with 1x Micro Server
X 8 X 8 X 8
Network
M
S
M
S
xN M
S
M
S
xN M
S
M
S
xNM
S
M
S
M
S
4x 100Gb 4x 100Gb 4x10Gb
Micro server
cluster
Micro server
cluster
Micro server
cluster
ARM micro server
cluster
- 1 to 1 to reduce
failure risk
- Aggregated
network bandwidth
without bottleneck
Traditional
Server #1
Traditional
Server #2
Traditional
Server #3
x N x N x N
Client #1 Client #2
Network
20Gb 20Gb 20Gb
Traditional server
- 1 to many causes
higher risk of a
server failure
- CPU utility is low
due to Network
bottleneck
14

Mars 200: 8-Node ARM Microserver Cluster
8x 1.6GHz ARM v7 Dual Core microserver
- 2G Bytes DRAM
- 8G Bytes Flash: System disk
- 5 Gbps LAN
- < 5 Watts power consumption
Every node can be OSD, MON, MDS, Gateways
Storage Device
- 8x SATA3 HDD/SSD OSD
- 8x SATA3 Journal SSD
OOB BMC port
Dual uplink switches
- Total 4x 10 Gbps
15
Hot Swappable
 Micro Server
 HDD/SSD
 Ethernet Switch
 Power supply

The Basic High Availability Cluster
16
Scale it out

The Benefit of Using
1 Node to 1 OSD Architecture on CEPH
• Minimize the failure domain to single OSD.
• The MTBF of a micro server is much higher than an all-in-one
motherboard ( MTBF>120K hours)
• High Availability: 15x9 (3 replication)
• High Performance: Dedicated H/W resource
– CPU, Memory, Network, SATA interface, SSD Journal disk
• High Bandwidth: Aggregated network bandwidth with failover
• 60Watts Low power consumption and cooling cost savings
• 3 x 1U chassis forms a high availability cluster
17

Ceph Storage Appliance
18
2U 8 Nodes
Front Panel Disk
Access
1U 8 Nodes
High Density
2017

RBD Performance Test
19
40 VM clients on Xeon Server as load workers
4 x 10Gbps
10G switch2x10Gbps
21 x SSD OSD + SSD journal ,
3 MON
Ceph cluster
40 x RBD
Use fio from 1x client up
to 40 clients.
Use the maximum un-
saturated bandwidth as
the aggregated
performance

Scale Out Test (SSD)
62,546
125,092
187,639
8,955
17,910
26,866
0
5,000
10,000
15,000
20,000
25,000
30,000
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
0 5 10 15 20 25
4K Read
4K Write
Number of OSDs
7 OSD
14 OSD
21 OSD
Random
Read
IOPS
Random
write
IOPS
20

Network does Matters
The purpose of this test is to measure improvement when the uplink bandwidth is increased
from 20Gb to 40Gb. Mars 200 has 4x 10Gb uplinks ports. The test results show 42-57% IOPS
improvement.
21

Build a OpenStack A-Team Taiwan
23
晨宇創新數位無限
The Power of Partnership

24
Aaron Joue
aaron@ambedded.com.tw

LIBRADOS
OpenStack
Ceph and OpenStack
25
KEYSTONE SWIFT CINDER GLANCE NOVA MANILA
CEILOMETE
R
OSD
OSD
MON
OSD
OSD
MON
OSD
OSD
MON
OSD
OSD
MDS
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
RADOS Gateway
librgw
RADOS CLUSTER
LIBRBD libcephfs
KVM/QEM
U
libvirt

Ambedded - how to build a true no single point of failure ceph cluster

More Related Content

What's hot (20)

Similar to Ambedded - how to build a true no single point of failure ceph cluster (20)

More from inwin stack (20)

Recently uploaded (20)

Ambedded - how to build a true no single point of failure ceph cluster