Health monitoring & predictive analytics to lower the TCO in a datacenter

ENGINEERING WORKSHOP
Health monitoring & predictive analytics
To lower the TCO in a datacenter
Christian B. Madsen & Andrei Khurshudov
Engineering Manager & Sr. Director
Seagate Technology
christian.b.madsen@seagate.com

1. The opportunity
2. Our vision and implementation
3. Use cases
4. Summary
Outline

Optimize system management
Reduce potential cost of operation
Improve datacenter efﬁciency
What if…
Seagate offered you a technology that could help you

▪  1 billion hard drives will be used in cloud datacenters by
2020, highlighting the need to manage drive health at scale
▪  One total outage per datacenter is statistically expected
every year
▪  80% of those outages are not completely explained (or
linked to root causes)
▪  $700,000 is the average cost per incident
▪  $8,000 is the average cost per minute of an unplanned outage
▪  Up to 10% of datacenter accidents are related to
storage
56%
56% of 13ZB
7.3ZB,
>1billion
drives
in cloud
Source: Seagate Strategic Marketing and Research 2013
2020
The problem
Failures in storage lead to costly outages

Top 4 challenges in drive management
Better drive management will lower the TCO
1.  Drive health monitoring
–  Need reliable key performance indicators to track drive health
status
2.  Drive failure prediction
–  “Ultimately, we want to know when our drives will fail so we
can take actions before that happens”
3.  Drive failure diagnostics and management automation
–  Need to correctly identify and quickly resolve issues
–  Need to prevent false alerts to reduce cost of failure handling
4.  Drive lifespan extension
–  Need to know how to optimize operating environment for
better reliability
–  Need to reuse partially good drives (should be possible with in-
drive diagnostic, IDD)

Our vision
•  Drive-centric health monitoring
•  Analytics and predictive models
•  Closed-loop automation
Data
Center
Global Access
• Report storage health
• Run drive self-test
• Shut-down systems
• Repair drives
• Run auto-FA
• Point at an issue
• Highlight inefficiency
• Predict reliability
• Detect anomalies
Actionable Decisions
ANALYTICS
PREDICTIONS
CONTROL
Data Aggregation
Early Warning System
Quick Issue Resolution
Cloud GazerTM
MONITORING
Monitoring, analytics, prediction and control – “The internet of things”2

Functional diagram
Monitor
Exception (alert)
Compliance (threshold)
Recommended action
Resolution
Closed-loop automation
Yes
No
Choose action
Monitor
Drive predicted to fail
Drive health
Reset or turn off drive
Turn off drive
Example
Passes
Not passing
Choose action
Automation
Choosing action from recommended
action can be automated by tying it to the
speciﬁc application or saving choices
Monitoring, intelligent decisions and automation

Real-time metric
aggregation Cloud
GazerTM
Dashboard
ReST Platform
•  Query data
•  Check thresholds
•  Manage drives
Data pool
(10,000s of drives) Server
Storage
Software
REST API
Calls
•  Cluster
•  Server
•  Drive
Agents
Storage
Server
Storage
Server
Storage
Server
Storage
Server
ReSTful API
Drives Drives
Drives Drives
Storage
Server
Storage
Server
Cloud GazerTM
Elements
Storage eco
system
Analytics
Engine
Cloud GazerTM
Drive Monitoring and Analytics
Drives
Drives
Storage Software
noSQL
ReSTful
API
Cloud GazerTM Dashboard
Architecture overview
Implementation

Danger zone
Datacenter failure rate
Failure detection
Warning about expected drive failures.
Relies on the proprietary failure prediction
algorithms that use unsupervised machine
learning techniques. Expected average
failure prediction time window is from 9 days
to 12 days.
HDD population failure rate
Measuring stress and estimating
failure acceleration of the disk
drive population in real time. Relies
on the proprietary failure prediction
algorithmsOverload detection
Detecting and reporting when
drive load exceeds design limits
Compliance (thresholds)
Recommended action
How to increase drive reliability
Degradation and performance warnings

Workload optimization
Workload
predominantly
hitting one
server
Before Load balancing issues After Workload distributed over servers and time
WeekDay Month
Workload
peaked on
Sunday
Drive visibility tools to improve workload balancing

Unsupervised machine learning and failure prediction
…Drives in ﬁeld
Multivariate
time-series
monitoring
Apply failure
prediction
algorithm in
parallel in real-
time
Real-time
status
prediction of
drive – Fine or
going to fail
For now, an average failure prediction window is on the order of 9 to 12 days
Failure prediction accuracy ranges from 55% to 90%
No interaction between drive set, no prior knowledge

Prediction and follow up actions
Systematic failure predicted:
3 out of 5 drives predicted to fail sit in
end location of servers
Heat map indicates drives at risk and you can issue drive tests (DST, IDD,…) to resolve or corroborate

Find failure triggers
Systematic failure predicted:
3 out of 5 drives predicted to fail sit in
end location of servers
Common factors for drives in the end
position is a cooler temperature.
Therefore increasing the server
temperature may reduce the
(dominant) failure mechanism and
increase drive reliability
Root cause tools including a temperature heat map can help you triage the cause of your drive issues

Failure prediction lead time
Currently catch 55-90% of
failures ahead of time
Case study 2, we predicted 5 drives to failed 23 days prior to
failure, 2 drives 22 days prior to failure,… 2 drives just one
day in advance
Case study 1, we predicted most drives (118 drives) to fail 12
days prior to failure
We can predict drives will fail on average 9-10 days before the failure

•  Truly drive-centric management tool for the cloud
•  Most efficient tool for extracting drive health information using Seagate IP
▪  Nobody knows drives better than us
▪  Freeware utilities are frequently wrong
•  Runs on any Linux system with little overhead (<1%)
Windows is next
•  Data can be collected, monitored and analyzed locally or in the Cloud
•  ReSTful API to interact with other software
•  New analytics, prediction, AI, and control capabilities are added continually
•  Drive repair will be possible with in-drive diagnostic
•  Enclosure control will be possible by summer 2015
Simply SMARTer
Competition
Why Cloud GazerTM?
*Seagate drives

Health monitoring & predictive analytics to lower the TCO in a datacenter

More Related Content

Viewers also liked (20)

Similar to Health monitoring & predictive analytics to lower the TCO in a datacenter (20)

More from Andrei Khurshudov (7)

Health monitoring & predictive analytics to lower the TCO in a datacenter