Cisco Cloud Services provides an OpenStack platform to Cisco SaaS applications using a petabyte-scale Ceph cluster. The initial Ceph cluster design led to stability problems as usage grew past 50% capacity. Improvements such as client IO throttling, NVMe journaling, upgrading Ceph versions, and moving the MON levelDB to SSD stabilized the cluster and reduced recovery times from hardware failures. Lessons learned included the need for devops practices, knowledge sharing, performance modeling, and avoiding technical debt from shortcuts.