SlideShare a Scribd company logo
Building Tomorrow's Ceph
Sage Weil
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
Research beginnings

9
London Ceph Day Keynote: Building Tomorrow's Ceph
UCSC research grant
â—Ź

“Petascale object storage”
â—Ź

DOE: LANL, LLNL, Sandia

â—Ź

Scalability

â—Ź

Reliability

â—Ź

Performance
â—Ź

â—Ź

Raw IO bandwidth, metadata ops/sec

HPC file system workloads
â—Ź

Thousands of clients writing to same file, directory
Distributed metadata management
â—Ź

Innovative design
â—Ź

Subtree-based partitioning for locality, efficiency

â—Ź

Dynamically adapt to current workload

â—Ź

Embedded inodes

â—Ź

Prototype simulator in Java (2004)

â—Ź

First line of Ceph code
â—Ź

Summer internship at LLNL

â—Ź

High security national lab environment

â—Ź

Could write anything, as long as it was OSS
The rest of Ceph
â—Ź

RADOS – distributed object storage cluster (2005)

â—Ź

EBOFS – local object storage (2004/2006)

â—Ź

CRUSH – hashing for the real world (2005)

â—Ź

Paxos monitors – cluster consensus (2006)
→ emphasis on consistent, reliable storage
→ scale by pushing intelligence to the edges
→ a different but compelling architecture
London Ceph Day Keynote: Building Tomorrow's Ceph
Industry black hole
â—Ź

Many large storage vendors
â—Ź

â—Ź

Proprietary solutions that don't scale well

Few open source alternatives (2006)
â—Ź

â—Ź

Limited community and architecture (Lustre)

â—Ź

â—Ź

Very limited scale, or
No enterprise feature sets (snapshots, quotas)

PhD grads all built interesting systems...
â—Ź

â—Ź

...and then went to work for Netapp, DDN, EMC, Veritas.

They want you, not your project
A different path
â—Ź

Change the world with open source
â—Ź

â—Ź

â—Ź

Do what Linux did to Solaris, Irix, Ultrix, etc.
What could go wrong?

License
â—Ź

â—Ź

â—Ź

GPL, BSD...
LGPL: share changes, okay to link to proprietary code

Avoid unsavory practices
â—Ź

Dual licensing

â—Ź

Copyright assignment
Incubation

17
London Ceph Day Keynote: Building Tomorrow's Ceph
DreamHost!
â—Ź

Move back to LA, continue hacking

â—Ź

Hired a few developers

â—Ź

Pure development

â—Ź

No deliverables
Ambitious feature set
â—Ź

Native Linux kernel client (2007-)

â—Ź

Per-directory snapshots (2008)

â—Ź

Recursive accounting (2008)

â—Ź

Object classes (2009)

â—Ź

librados (2009)

â—Ź

radosgw (2009)

â—Ź

strong authentication (2009)

â—Ź

RBD: rados block device (2010)
The kernel client
â—Ź

ceph-fuse was limited, not very fast

â—Ź

Build native Linux kernel implementation

â—Ź

Began attending Linux file system developer events (LSF)
â—Ź

â—Ź

â—Ź

Early words of encouragement from ex-Lustre devs
Engage Linux fs developer community as peer

Initial attempts merge rejected by Linus
â—Ź

â—Ź

â—Ź

Not sufficient evidence of user demand
A few fans and would-be users chimed in...

Eventually merged for v2.6.34 (early 2010)
Part of a larger ecosystem
â—Ź

Ceph need not solve all problems as monolithic stack

â—Ź

Replaced ebofs object file system with btrfs
â—Ź

â—Ź

Avoid reinventing the wheel

â—Ź

Robust, well-supported, well optimized

â—Ź

Kernel-level cache management

â—Ź

â—Ź

Same design goals

Copy-on-write, checksumming, other goodness

Contributed some early functionality
â—Ź

Cloning files

â—Ź

Async snapshots
Budding community
â—Ź

#ceph on irc.oftc.net, ceph-devel@vger.kernel.org

â—Ź

Many interested users

â—Ź

A few developers

â—Ź

Many fans

â—Ź

Too unstable for any real deployments

â—Ź

Still mostly focused on right architecture and technical
solutions
Road to product
â—Ź

â—Ź

DreamHost decides to build an S3-compatible object
storage service with Ceph
Stability
â—Ź

â—Ź

Focus on core RADOS, RBD, radosgw

Paying back some technical debt
â—Ź

â—Ź

â—Ź

Build testing automation
Code review!

Expand engineering team
The reality
â—Ź

Growing incoming commercial interest
â—Ź

Early attempts from organizations large and small

â—Ź

Difficult to engage with a web hosting company

â—Ź

No means to support commercial deployments

â—Ź

Project needed a company to back it
â—Ź

â—Ź

Build and test a product

â—Ź

â—Ź

Fund the engineering effort
Support users

Bryan built a framework to spin out of DreamHost
Launch

26
London Ceph Day Keynote: Building Tomorrow's Ceph
Do it right
â—Ź

How do we build a strong open source company?

â—Ź

How do we build a strong open source community?

â—Ź

Models?
â—Ź

â—Ź

RedHat, Cloudera, MySQL, Canonical, …

Initial funding from DreamHost, Mark Shuttleworth
Goals
â—Ź

A stable Ceph release for production deployment
â—Ź

â—Ź

DreamObjects

Lay foundation for widespread adoption
â—Ź

Platform support (Ubuntu, Redhat, SuSE)

â—Ź

Documentation

â—Ź

Build and test infrastructure

â—Ź

Build a sales and support organization

â—Ź

Expand engineering organization
Branding
â—Ź

Early decision to engage professional agency
â—Ź

â—Ź

MetaDesign

Terms like
â—Ź

â—Ź

â—Ź

“Brand core”
“Design system”

Project vs Company
â—Ź

â—Ź

â—Ź

Shared / Separate / Shared core
Inktank != Ceph

Aspirational messaging: The Future of Storage
Slick graphics
â—Ź

broken powerpoint template

31
Today: adoption

32
London Ceph Day Keynote: Building Tomorrow's Ceph
Traction
â—Ź

Too many production deployments to count
â—Ź

We don't know about most of them!

â—Ź

Too many customers (for me) to count

â—Ź

Growing partner list
â—Ź

â—Ź

Lots of inbound

Lots of press and buzz
Quality
â—Ź

Increased adoption means increased demands on robust
testing

â—Ź

Across multiple platforms

â—Ź

Include platforms we don't like

â—Ź

Upgrades
â—Ź

â—Ź

â—Ź

Rolling upgrades
Inter-version compatibility

Expanding user community + less noise about bugs = a
good sign
Developer community
â—Ź

Significant external contributors

â—Ź

First-class feature contributions from contributors

â—Ź

Non-Inktank participants in daily Inktank stand-ups

â—Ź

External access to build/test lab infrastructure

â—Ź

Common toolset
â—Ź

â—Ź

Email (kernel.org)

â—Ź

â—Ź

Github
IRC (oftc.net)

Linux distros
CDS: Ceph Developer Summit
â—Ź

Community process for building project roadmap

â—Ź

100% online
â—Ź

Google hangouts

â—Ź

Wikis

â—Ź

Etherpad

â—Ź

First was this Spring, second is next week

â—Ź

Great feedback, growing participation

â—Ź

Indoctrinating our own developers to an open
development model
The Future

38
Governance
How do we strengthen the project community?

â—Ź

2014 is the year

â—Ź

Might formally acknowledge my role as BDL

â—Ź

Recognized project leads
â—Ź

RBD, RGW, RADOS, CephFS)

â—Ź

Formalize processes around CDS, community roadmap

â—Ź

External foundation?
Technical roadmap
â—Ź

How do we reach new use-cases and users

â—Ź

How do we better satisfy existing users

â—Ź

How do we ensure Ceph can succeed in enough markets
for Inktank to thrive

â—Ź

Enough breadth to expand and grow the community

â—Ź

Enough focus to do well
Tiering
â—Ź

â—Ź

Client side caches are great, but only buy so much.
Can we separate hot and cold data onto different storage
devices?
â—Ź

â—Ź

â—Ź

â—Ź

Cache pools: promote hot objects from an existing pool into a fast
(e.g., FusionIO) pool
Cold pools: demote cold data to a slow, archival pool (e.g.,
erasure coding)

How do you identify what is hot and cold?
Common in enterprise solutions; not found in open source
scale-out systems
→ key topic at CDS next week
Erasure coding
â—Ź

Replication for redundancy is flexible and fast

â—Ź

For larger clusters, it can be expensive
Storage
overhead
3x replication

Repair
traffic

MTTDL
(days)

1x

2.3 E10

RS (10, 4)

1.4x

10x

3.3 E13

LRC (10, 6, 5)
â—Ź

3x
1.6x

5x

1.2 E15

Erasure coded data is hard to modify, but ideal for cold or
read-only objects
â—Ź

Cold storage tiering

â—Ź

Will be used directly by radosgw
Multi-datacenter, geo-replication
â—Ź

Ceph was originally designed for single DC clusters
â—Ź

â—Ź

â—Ź

Synchronous replication
Strong consistency

Growing demand
â—Ź

â—Ź

â—Ź

Enterprise: disaster recovery
ISPs: replication data across sites for locality

Two strategies:
â—Ź

use-case specific: radosgw, RBD

â—Ź

low-level capability in RADOS
RGW: Multi-site and async replication
â—Ź

Multi-site, multi-cluster
â—Ź

â—Ź

Zones: radosgw sub-cluster(s) within a region

â—Ź

â—Ź

Regions: east coast, west coast, etc.
Can federate across same or multiple Ceph clusters

Sync user and bucket metadata across regions
â—Ź

â—Ź

Global bucket/user namespace, like S3

Synchronize objects across zones
â—Ź

Within the same region

â—Ź

Across regions

â—Ź

Admin control over which zones are master/slave
RBD: simple DR via snapshots
â—Ź

Simple backup capability
â—Ź

â—Ź

Based on block device snapshots
Efficiently mirror changes between consecutive snapshots across
clusters

â—Ź

Now supported/orchestrated by OpenStack

â—Ź

Good for coarse synchronization (e.g., hours)
â—Ź

Not real-time
Async replication in RADOS
â—Ź

One implementation to capture multiple use-cases
â—Ź

â—Ź

RBD, CephFS, RGW, … RADOS

A harder problem
â—Ź

â—Ź

â—Ź

Scalable: 1000s OSDs → 1000s of OSDs
Point-in-time consistency

Three challenges
â—Ź

Infer a partial ordering of events in the cluster

â—Ź

Maintain a stable timeline to stream from
–

â—Ź

either checkpoints or event stream

Coordinated roll-forward at destination
–

do not apply any update until we know we have everything that
happened before it
CephFS
→ This is where it all started – let's get there

â—Ź

Today
â—Ź

â—Ź

â—Ź

QA coverage and bug squashing continues
NFS and CIFS now large complete and robust

Need
â—Ź

â—Ź

Directory fragmentation

â—Ź

Snapshots

â—Ź

â—Ź

Multi-MDS

QA investment

Amazing community effort
The larger ecosystem
Big data
When will be stop talking about MapReduce?
Why is “big data” built on such a lame storage model?

â—Ź

Move computation to the data

â—Ź

Evangelize RADOS classes

â—Ź

librados case studies and proof points

â—Ź

Build a general purpose compute and storage platform
The enterprise
How do we pay for all our toys?

â—Ź

Support legacy and transitional interfaces
â—Ź

â—Ź

â—Ź

iSCSI, NFS, pNFS, CIFS
Vmware, Hyper-v

Identify the beachhead use-cases
â—Ź

Only takes one use-case to get in the door

â—Ź

Earn others later

â—Ź

Single platform – shared storage resource

â—Ź

Bottom-up: earn respect of engineers and admins

â—Ź

Top-down: strong brand and compelling product
Why we can beat the old guard
â—Ź

It is hard to compete with free and open source software
â—Ź

Unbeatable value proposition

â—Ź

Ultimately a more efficient development model

â—Ź

It is hard to manufacture community

â—Ź

Strong foundational architecture

â—Ź

Native protocols, Linux kernel support
â—Ź

â—Ź

â—Ź

Unencumbered by legacy protocols like NFS
Move beyond traditional client/server model

Ongoing paradigm shift
â—Ź

Software defined infrastructure, data center
Thank you, and Welcome!

More Related Content

ODP
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Community
 
ODP
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Community
 
ODP
Ceph Day SF 2015 - Keynote
Ceph Community
 
PDF
Ceph Day New York: Ceph: one decade in
Ceph Community
 
PDF
ceph openstack dream team
Udo Seidel
 
PDF
Introduction into Ceph storage for OpenStack
OpenStack_Online
 
PDF
The Future of GlusterFS and Gluster.org
John Mark Walker
 
PDF
Linuxtag.ceph.talk
Udo Seidel
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Community
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Community
 
Ceph Day SF 2015 - Keynote
Ceph Community
 
Ceph Day New York: Ceph: one decade in
Ceph Community
 
ceph openstack dream team
Udo Seidel
 
Introduction into Ceph storage for OpenStack
OpenStack_Online
 
The Future of GlusterFS and Gluster.org
John Mark Walker
 
Linuxtag.ceph.talk
Udo Seidel
 

What's hot (20)

PPTX
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
Ian Colle
 
PPTX
New Ceph capabilities and Reference Architectures
Kamesh Pemmaraju
 
PDF
Cephfsglusterfs.talk
Udo Seidel
 
ODP
Block Storage For VMs With Ceph
The Linux Foundation
 
PDF
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
CloudStack - Open Source Cloud Computing Project
 
PPTX
OpenStack and Ceph case study at the University of Alabama
Kamesh Pemmaraju
 
PPT
Iocg Whats New In V Sphere
Anne Achleman
 
PPTX
Ceph and OpenStack - Feb 2014
Ian Colle
 
ODP
Ostd.ksplice.talk
Udo Seidel
 
PDF
adp.ceph.openstack.talk
Udo Seidel
 
PDF
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
PDF
create auto scale jboss cluster with openshift
Yusuf Hadiwinata Sutandar
 
ODP
Red Hat Gluster Storage : GlusterFS
bipin kunal
 
PDF
GlusterFS And Big Data
Lalatendu Mohanty
 
PDF
2015 open storage workshop ceph software defined storage
Andrew Underwood
 
PPTX
Ceph and Openstack in a Nutshell
Karan Singh
 
PDF
Osdc2012 xtfs.talk
Udo Seidel
 
PPT
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Ceph Community
 
PDF
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Inc
 
ODP
Dustin Black - Red Hat Storage Server Administration Deep Dive
Gluster.org
 
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
Ian Colle
 
New Ceph capabilities and Reference Architectures
Kamesh Pemmaraju
 
Cephfsglusterfs.talk
Udo Seidel
 
Block Storage For VMs With Ceph
The Linux Foundation
 
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
CloudStack - Open Source Cloud Computing Project
 
OpenStack and Ceph case study at the University of Alabama
Kamesh Pemmaraju
 
Iocg Whats New In V Sphere
Anne Achleman
 
Ceph and OpenStack - Feb 2014
Ian Colle
 
Ostd.ksplice.talk
Udo Seidel
 
adp.ceph.openstack.talk
Udo Seidel
 
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
create auto scale jboss cluster with openshift
Yusuf Hadiwinata Sutandar
 
Red Hat Gluster Storage : GlusterFS
bipin kunal
 
GlusterFS And Big Data
Lalatendu Mohanty
 
2015 open storage workshop ceph software defined storage
Andrew Underwood
 
Ceph and Openstack in a Nutshell
Karan Singh
 
Osdc2012 xtfs.talk
Udo Seidel
 
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Ceph Community
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Inc
 
Dustin Black - Red Hat Storage Server Administration Deep Dive
Gluster.org
 
Ad

Similar to London Ceph Day Keynote: Building Tomorrow's Ceph (20)

ODP
Ceph: A decade in the making and still going strong
Patrick McGarry
 
PDF
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Community
 
PDF
DEVIEW 2013
Patrick McGarry
 
PPTX
Ceph Introduction 2017
Karan Singh
 
PDF
Ceph Overview for Distributed Computing Denver Meetup
ktdreyer
 
PDF
Storage tiering and erasure coding in Ceph (SCaLE13x)
Sage Weil
 
PDF
OSDC 2015: John Spray | The Ceph Storage System
NETWAYS
 
PDF
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Sage Weil
 
PDF
Scale 10x 01:22:12
Ceph Community
 
PDF
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Danny Al-Gaaf
 
PDF
New use cases for Ceph, beyond OpenStack, Luis Rico
Ceph Community
 
PDF
Strata - 03/31/2012
Ceph Community
 
PDF
Ceph Day New York 2014: Future of CephFS
Ceph Community
 
PDF
What's new in Luminous and Beyond
Sage Weil
 
PPTX
What you need to know about ceph
Emma Haruka Iwao
 
PPT
An intro to Ceph and big data - CERN Big Data Workshop
Patrick McGarry
 
PDF
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Community
 
PPTX
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Red_Hat_Storage
 
ODP
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Community
 
PPTX
Inktank:ceph overview
Ceph Community
 
Ceph: A decade in the making and still going strong
Patrick McGarry
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Community
 
DEVIEW 2013
Patrick McGarry
 
Ceph Introduction 2017
Karan Singh
 
Ceph Overview for Distributed Computing Denver Meetup
ktdreyer
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Sage Weil
 
OSDC 2015: John Spray | The Ceph Storage System
NETWAYS
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Sage Weil
 
Scale 10x 01:22:12
Ceph Community
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Danny Al-Gaaf
 
New use cases for Ceph, beyond OpenStack, Luis Rico
Ceph Community
 
Strata - 03/31/2012
Ceph Community
 
Ceph Day New York 2014: Future of CephFS
Ceph Community
 
What's new in Luminous and Beyond
Sage Weil
 
What you need to know about ceph
Emma Haruka Iwao
 
An intro to Ceph and big data - CERN Big Data Workshop
Patrick McGarry
 
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Community
 
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Red_Hat_Storage
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Community
 
Inktank:ceph overview
Ceph Community
 
Ad

Recently uploaded (20)

PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Doc9.....................................
SofiaCollazos
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Doc9.....................................
SofiaCollazos
 

London Ceph Day Keynote: Building Tomorrow's Ceph

  • 11. UCSC research grant â—Ź “Petascale object storage” â—Ź DOE: LANL, LLNL, Sandia â—Ź Scalability â—Ź Reliability â—Ź Performance â—Ź â—Ź Raw IO bandwidth, metadata ops/sec HPC file system workloads â—Ź Thousands of clients writing to same file, directory
  • 12. Distributed metadata management â—Ź Innovative design â—Ź Subtree-based partitioning for locality, efficiency â—Ź Dynamically adapt to current workload â—Ź Embedded inodes â—Ź Prototype simulator in Java (2004) â—Ź First line of Ceph code â—Ź Summer internship at LLNL â—Ź High security national lab environment â—Ź Could write anything, as long as it was OSS
  • 13. The rest of Ceph â—Ź RADOS – distributed object storage cluster (2005) â—Ź EBOFS – local object storage (2004/2006) â—Ź CRUSH – hashing for the real world (2005) â—Ź Paxos monitors – cluster consensus (2006) → emphasis on consistent, reliable storage → scale by pushing intelligence to the edges → a different but compelling architecture
  • 15. Industry black hole â—Ź Many large storage vendors â—Ź â—Ź Proprietary solutions that don't scale well Few open source alternatives (2006) â—Ź â—Ź Limited community and architecture (Lustre) â—Ź â—Ź Very limited scale, or No enterprise feature sets (snapshots, quotas) PhD grads all built interesting systems... â—Ź â—Ź ...and then went to work for Netapp, DDN, EMC, Veritas. They want you, not your project
  • 16. A different path â—Ź Change the world with open source â—Ź â—Ź â—Ź Do what Linux did to Solaris, Irix, Ultrix, etc. What could go wrong? License â—Ź â—Ź â—Ź GPL, BSD... LGPL: share changes, okay to link to proprietary code Avoid unsavory practices â—Ź Dual licensing â—Ź Copyright assignment
  • 19. DreamHost! â—Ź Move back to LA, continue hacking â—Ź Hired a few developers â—Ź Pure development â—Ź No deliverables
  • 20. Ambitious feature set â—Ź Native Linux kernel client (2007-) â—Ź Per-directory snapshots (2008) â—Ź Recursive accounting (2008) â—Ź Object classes (2009) â—Ź librados (2009) â—Ź radosgw (2009) â—Ź strong authentication (2009) â—Ź RBD: rados block device (2010)
  • 21. The kernel client â—Ź ceph-fuse was limited, not very fast â—Ź Build native Linux kernel implementation â—Ź Began attending Linux file system developer events (LSF) â—Ź â—Ź â—Ź Early words of encouragement from ex-Lustre devs Engage Linux fs developer community as peer Initial attempts merge rejected by Linus â—Ź â—Ź â—Ź Not sufficient evidence of user demand A few fans and would-be users chimed in... Eventually merged for v2.6.34 (early 2010)
  • 22. Part of a larger ecosystem â—Ź Ceph need not solve all problems as monolithic stack â—Ź Replaced ebofs object file system with btrfs â—Ź â—Ź Avoid reinventing the wheel â—Ź Robust, well-supported, well optimized â—Ź Kernel-level cache management â—Ź â—Ź Same design goals Copy-on-write, checksumming, other goodness Contributed some early functionality â—Ź Cloning files â—Ź Async snapshots
  • 23. Budding community â—Ź #ceph on irc.oftc.net, [email protected] â—Ź Many interested users â—Ź A few developers â—Ź Many fans â—Ź Too unstable for any real deployments â—Ź Still mostly focused on right architecture and technical solutions
  • 24. Road to product â—Ź â—Ź DreamHost decides to build an S3-compatible object storage service with Ceph Stability â—Ź â—Ź Focus on core RADOS, RBD, radosgw Paying back some technical debt â—Ź â—Ź â—Ź Build testing automation Code review! Expand engineering team
  • 25. The reality â—Ź Growing incoming commercial interest â—Ź Early attempts from organizations large and small â—Ź Difficult to engage with a web hosting company â—Ź No means to support commercial deployments â—Ź Project needed a company to back it â—Ź â—Ź Build and test a product â—Ź â—Ź Fund the engineering effort Support users Bryan built a framework to spin out of DreamHost
  • 28. Do it right â—Ź How do we build a strong open source company? â—Ź How do we build a strong open source community? â—Ź Models? â—Ź â—Ź RedHat, Cloudera, MySQL, Canonical, … Initial funding from DreamHost, Mark Shuttleworth
  • 29. Goals â—Ź A stable Ceph release for production deployment â—Ź â—Ź DreamObjects Lay foundation for widespread adoption â—Ź Platform support (Ubuntu, Redhat, SuSE) â—Ź Documentation â—Ź Build and test infrastructure â—Ź Build a sales and support organization â—Ź Expand engineering organization
  • 30. Branding â—Ź Early decision to engage professional agency â—Ź â—Ź MetaDesign Terms like â—Ź â—Ź â—Ź “Brand core” “Design system” Project vs Company â—Ź â—Ź â—Ź Shared / Separate / Shared core Inktank != Ceph Aspirational messaging: The Future of Storage
  • 34. Traction â—Ź Too many production deployments to count â—Ź We don't know about most of them! â—Ź Too many customers (for me) to count â—Ź Growing partner list â—Ź â—Ź Lots of inbound Lots of press and buzz
  • 35. Quality â—Ź Increased adoption means increased demands on robust testing â—Ź Across multiple platforms â—Ź Include platforms we don't like â—Ź Upgrades â—Ź â—Ź â—Ź Rolling upgrades Inter-version compatibility Expanding user community + less noise about bugs = a good sign
  • 36. Developer community â—Ź Significant external contributors â—Ź First-class feature contributions from contributors â—Ź Non-Inktank participants in daily Inktank stand-ups â—Ź External access to build/test lab infrastructure â—Ź Common toolset â—Ź â—Ź Email (kernel.org) â—Ź â—Ź Github IRC (oftc.net) Linux distros
  • 37. CDS: Ceph Developer Summit â—Ź Community process for building project roadmap â—Ź 100% online â—Ź Google hangouts â—Ź Wikis â—Ź Etherpad â—Ź First was this Spring, second is next week â—Ź Great feedback, growing participation â—Ź Indoctrinating our own developers to an open development model
  • 39. Governance How do we strengthen the project community? â—Ź 2014 is the year â—Ź Might formally acknowledge my role as BDL â—Ź Recognized project leads â—Ź RBD, RGW, RADOS, CephFS) â—Ź Formalize processes around CDS, community roadmap â—Ź External foundation?
  • 40. Technical roadmap â—Ź How do we reach new use-cases and users â—Ź How do we better satisfy existing users â—Ź How do we ensure Ceph can succeed in enough markets for Inktank to thrive â—Ź Enough breadth to expand and grow the community â—Ź Enough focus to do well
  • 41. Tiering â—Ź â—Ź Client side caches are great, but only buy so much. Can we separate hot and cold data onto different storage devices? â—Ź â—Ź â—Ź â—Ź Cache pools: promote hot objects from an existing pool into a fast (e.g., FusionIO) pool Cold pools: demote cold data to a slow, archival pool (e.g., erasure coding) How do you identify what is hot and cold? Common in enterprise solutions; not found in open source scale-out systems → key topic at CDS next week
  • 42. Erasure coding â—Ź Replication for redundancy is flexible and fast â—Ź For larger clusters, it can be expensive Storage overhead 3x replication Repair traffic MTTDL (days) 1x 2.3 E10 RS (10, 4) 1.4x 10x 3.3 E13 LRC (10, 6, 5) â—Ź 3x 1.6x 5x 1.2 E15 Erasure coded data is hard to modify, but ideal for cold or read-only objects â—Ź Cold storage tiering â—Ź Will be used directly by radosgw
  • 43. Multi-datacenter, geo-replication â—Ź Ceph was originally designed for single DC clusters â—Ź â—Ź â—Ź Synchronous replication Strong consistency Growing demand â—Ź â—Ź â—Ź Enterprise: disaster recovery ISPs: replication data across sites for locality Two strategies: â—Ź use-case specific: radosgw, RBD â—Ź low-level capability in RADOS
  • 44. RGW: Multi-site and async replication â—Ź Multi-site, multi-cluster â—Ź â—Ź Zones: radosgw sub-cluster(s) within a region â—Ź â—Ź Regions: east coast, west coast, etc. Can federate across same or multiple Ceph clusters Sync user and bucket metadata across regions â—Ź â—Ź Global bucket/user namespace, like S3 Synchronize objects across zones â—Ź Within the same region â—Ź Across regions â—Ź Admin control over which zones are master/slave
  • 45. RBD: simple DR via snapshots â—Ź Simple backup capability â—Ź â—Ź Based on block device snapshots Efficiently mirror changes between consecutive snapshots across clusters â—Ź Now supported/orchestrated by OpenStack â—Ź Good for coarse synchronization (e.g., hours) â—Ź Not real-time
  • 46. Async replication in RADOS â—Ź One implementation to capture multiple use-cases â—Ź â—Ź RBD, CephFS, RGW, … RADOS A harder problem â—Ź â—Ź â—Ź Scalable: 1000s OSDs → 1000s of OSDs Point-in-time consistency Three challenges â—Ź Infer a partial ordering of events in the cluster â—Ź Maintain a stable timeline to stream from – â—Ź either checkpoints or event stream Coordinated roll-forward at destination – do not apply any update until we know we have everything that happened before it
  • 47. CephFS → This is where it all started – let's get there â—Ź Today â—Ź â—Ź â—Ź QA coverage and bug squashing continues NFS and CIFS now large complete and robust Need â—Ź â—Ź Directory fragmentation â—Ź Snapshots â—Ź â—Ź Multi-MDS QA investment Amazing community effort
  • 49. Big data When will be stop talking about MapReduce? Why is “big data” built on such a lame storage model? â—Ź Move computation to the data â—Ź Evangelize RADOS classes â—Ź librados case studies and proof points â—Ź Build a general purpose compute and storage platform
  • 50. The enterprise How do we pay for all our toys? â—Ź Support legacy and transitional interfaces â—Ź â—Ź â—Ź iSCSI, NFS, pNFS, CIFS Vmware, Hyper-v Identify the beachhead use-cases â—Ź Only takes one use-case to get in the door â—Ź Earn others later â—Ź Single platform – shared storage resource â—Ź Bottom-up: earn respect of engineers and admins â—Ź Top-down: strong brand and compelling product
  • 51. Why we can beat the old guard â—Ź It is hard to compete with free and open source software â—Ź Unbeatable value proposition â—Ź Ultimately a more efficient development model â—Ź It is hard to manufacture community â—Ź Strong foundational architecture â—Ź Native protocols, Linux kernel support â—Ź â—Ź â—Ź Unencumbered by legacy protocols like NFS Move beyond traditional client/server model Ongoing paradigm shift â—Ź Software defined infrastructure, data center
  • 52. Thank you, and Welcome!

Editor's Notes

  • #2: <number>
  • #32: <number>
  • #53: <number>