March 10–11, 2015
San Jose
ENGINEERING WORKSHOP
Health monitoring & predictive analytics
To lower the TCO in a datacenter
Christian B. Madsen & Andrei Khurshudov
Engineering Manager & Sr. Director
Seagate Technology
christian.b.madsen@seagate.com
ENGINEERING WORKSHOP
1. The opportunity
2. Our vision and implementation
3. Use cases
4. Summary
Outline
The opportunity
ENGINEERING WORKSHOP
Optimize system management
Reduce potential cost of operation
Improve datacenter efficiency
What if…
Seagate offered you a technology that could help you
ENGINEERING WORKSHOP
▪  1 billion hard drives will be used in cloud datacenters by
2020, highlighting the need to manage drive health at scale
▪  One total outage per datacenter is statistically expected
every year
▪  80% of those outages are not completely explained (or
linked to root causes)
▪  $700,000 is the average cost per incident
▪  $8,000 is the average cost per minute of an unplanned outage
▪  Up to 10% of datacenter accidents are related to
storage
56%
56% of 13ZB
7.3ZB,
>1billion
drives
in cloud
Source: Seagate Strategic Marketing and Research 2013
2020
The problem
Failures in storage lead to costly outages
ENGINEERING WORKSHOP
Top 4 challenges in drive management
Better drive management will lower the TCO
1.  Drive health monitoring
–  Need reliable key performance indicators to track drive health
status
2.  Drive failure prediction
–  “Ultimately, we want to know when our drives will fail so we
can take actions before that happens”
3.  Drive failure diagnostics and management automation
–  Need to correctly identify and quickly resolve issues
–  Need to prevent false alerts to reduce cost of failure handling
4.  Drive lifespan extension
–  Need to know how to optimize operating environment for
better reliability
–  Need to reuse partially good drives (should be possible with in-
drive diagnostic, IDD)
Our vision and implementation
ENGINEERING WORKSHOP
Our vision
•  Drive-centric health monitoring
•  Analytics and predictive models
•  Closed-loop automation
Data
Center
Global Access
• Report storage health
• Run drive self-test
• Shut-down systems
• Repair drives
• Run auto-FA
• Point at an issue
• Highlight inefficiency
• Predict reliability
• Detect anomalies
Actionable Decisions
ANALYTICS
PREDICTIONS
CONTROL
Data Aggregation
Early Warning System
Quick Issue Resolution
Cloud GazerTM
MONITORING
Monitoring, analytics, prediction and control – “The internet of things”2
ENGINEERING WORKSHOP
Functional diagram
Monitor
Exception (alert)
Compliance (threshold)
Recommended action
Resolution
Closed-loop automation
Yes
No
Choose action
Monitor
Drive predicted to fail
Drive health
Reset or turn off drive
Turn off drive
Example
Passes
Not passing
Choose action
Automation
Choosing action from recommended
action can be automated by tying it to the
specific application or saving choices
Monitoring, intelligent decisions and automation
ENGINEERING WORKSHOP
Real-time metric
aggregation Cloud
GazerTM
Dashboard
ReST Platform
•  Query data
•  Check thresholds
•  Manage drives
Data pool
(10,000s of drives) Server
Storage
Software
REST API
Calls
•  Cluster
•  Server
•  Drive
Agents
Storage
Server
Storage
Server
Storage
Server
Storage
Server
ReSTful API
Drives Drives
Drives Drives
Storage
Server
Storage
Server
Cloud GazerTM
Elements
Storage eco
system
Analytics
Engine
Cloud GazerTM
Drive Monitoring and Analytics
Drives
Drives
Storage Software
noSQL
ReSTful
API
Cloud GazerTM Dashboard
Architecture overview
Implementation
Use cases
ENGINEERING WORKSHOP
Danger zone
Datacenter failure rate
Failure detection
Warning about expected drive failures.
Relies on the proprietary failure prediction
algorithms that use unsupervised machine
learning techniques. Expected average
failure prediction time window is from 9 days
to 12 days.
HDD population failure rate
Measuring stress and estimating
failure acceleration of the disk
drive population in real time. Relies
on the proprietary failure prediction
algorithmsOverload detection
Detecting and reporting when
drive load exceeds design limits
Compliance (thresholds)
Recommended action
How to increase drive reliability
Degradation and performance warnings
ENGINEERING WORKSHOP
Workload optimization
Workload
predominantly
hitting one
server
Before Load balancing issues After Workload distributed over servers and time
WeekDay Month
Workload
peaked on
Sunday
Drive visibility tools to improve workload balancing
ENGINEERING WORKSHOP
Unsupervised machine learning and failure prediction
…Drives in field
Multivariate
time-series
monitoring
Apply failure
prediction
algorithm in
parallel in real-
time
Real-time
status
prediction of
drive – Fine or
going to fail
For now, an average failure prediction window is on the order of 9 to 12 days
Failure prediction accuracy ranges from 55% to 90%
No interaction between drive set, no prior knowledge
ENGINEERING WORKSHOP
Prediction and follow up actions
Systematic failure predicted:
3 out of 5 drives predicted to fail sit in
end location of servers
Heat map indicates drives at risk and you can issue drive tests (DST, IDD,…) to resolve or corroborate
ENGINEERING WORKSHOP
Find failure triggers
Systematic failure predicted:
3 out of 5 drives predicted to fail sit in
end location of servers
Common factors for drives in the end
position is a cooler temperature.
Therefore increasing the server
temperature may reduce the
(dominant) failure mechanism and
increase drive reliability
Root cause tools including a temperature heat map can help you triage the cause of your drive issues
ENGINEERING WORKSHOP
Failure prediction lead time
Currently catch 55-90% of
failures ahead of time
Case study 2, we predicted 5 drives to failed 23 days prior to
failure, 2 drives 22 days prior to failure,… 2 drives just one
day in advance
Case study 1, we predicted most drives (118 drives) to fail 12
days prior to failure
We can predict drives will fail on average 9-10 days before the failure
Summary
ENGINEERING WORKSHOP
•  Truly drive-centric management tool for the cloud
•  Most efficient tool for extracting drive health information using Seagate IP
▪  Nobody knows drives better than us
▪  Freeware utilities are frequently wrong
•  Runs on any Linux system with little overhead (<1%)
Windows is next
•  Data can be collected, monitored and analyzed locally or in the Cloud
•  ReSTful API to interact with other software
•  New analytics, prediction, AI, and control capabilities are added continually
•  Drive repair will be possible with in-drive diagnostic
•  Enclosure control will be possible by summer 2015
Simply SMARTer
Competition
Why Cloud GazerTM?
*Seagate drives
Questions?

More Related Content

PDF
XMPLR Data Analytics in Power Generation
PPTX
VMware: Nástroje na správu a efektívne riadenie fyzickej a virtuálnej infrašt...
PDF
Gauge Your Best Practices: Common Reasons for Gauge Failure
PPT
Boomerang Image
PDF
1030 iordanescu
PDF
PPTX
EuroSTAR 2013 Albert Witteveen Final
PPTX
Diagnostic System Monitoring
XMPLR Data Analytics in Power Generation
VMware: Nástroje na správu a efektívne riadenie fyzickej a virtuálnej infrašt...
Gauge Your Best Practices: Common Reasons for Gauge Failure
Boomerang Image
1030 iordanescu
EuroSTAR 2013 Albert Witteveen Final
Diagnostic System Monitoring

Viewers also liked (20)

PPS
生命教育--4塊糖的省思
PPTX
Generation Myth
PDF
The business of consulting (handout)
PPT
Geekmeet Iasi Intro
PPTX
What Is Windows Azure
PDF
New Libertarian Manifesto
PDF
Tolerability and Decision Making Discussion
KEY
#IgNiteTH Keynote by iannnnn
PPT
Generell presentasjon
PDF
English Translation Of Go Forward Plan, Harvard Bus Review
PDF
2011 Be A Superhero - 'Why, How and What' Event!
PPT
Referansegruppe 200209
PDF
20100812 Comfi Web Presentation
PPT
Dfi Portfolio
PPS
D4 I Framework Web
PPTX
How to Master UX Testing in an Agile Design Process
PPTX
Collaborate
PDF
Outsourcing to India Publication IJAS 2010
POT
PresentacióN1
PPTX
Designing a Customer Feedback Program to Measure and Improve User Satisfaction
生命教育--4塊糖的省思
Generation Myth
The business of consulting (handout)
Geekmeet Iasi Intro
What Is Windows Azure
New Libertarian Manifesto
Tolerability and Decision Making Discussion
#IgNiteTH Keynote by iannnnn
Generell presentasjon
English Translation Of Go Forward Plan, Harvard Bus Review
2011 Be A Superhero - 'Why, How and What' Event!
Referansegruppe 200209
20100812 Comfi Web Presentation
Dfi Portfolio
D4 I Framework Web
How to Master UX Testing in an Agile Design Process
Collaborate
Outsourcing to India Publication IJAS 2010
PresentacióN1
Designing a Customer Feedback Program to Measure and Improve User Satisfaction
Ad

Similar to Health monitoring & predictive analytics to lower the TCO in a datacenter (20)

PDF
Presentation_Final
PDF
Disk Failures
PDF
Data center design standards for cabinet and floor loading
PDF
Big Data Analytics: From Insights to Production
PDF
"From Insights to Production with Big Data Analytics", Eliano Marques, Senior...
PDF
Predictive Maintenance Solution for Industries - Cyient
PPTX
DN 2017 | Hardware Failure Prediction at Dell-EMC | Ran Taig | Dell
PDF
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
PPTX
Proactively Managing Your Data Center Infrastructure
PDF
Using Big Data Analytics
PDF
PDF
IBM IT Operations Analytics for z systems
PDF
IBM IT Operations Analytics for z Systems
PDF
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
PPTX
Webinar: 5 Steps To The Perfect Storage Refresh
PPTX
2016 smrp 101616
PPT
Future Trends in IT Storage
PPTX
SSD Failures in the Field.pptx
PPTX
PPTX
Preventative Maintenance of Robots in Automotive Industry
Presentation_Final
Disk Failures
Data center design standards for cabinet and floor loading
Big Data Analytics: From Insights to Production
"From Insights to Production with Big Data Analytics", Eliano Marques, Senior...
Predictive Maintenance Solution for Industries - Cyient
DN 2017 | Hardware Failure Prediction at Dell-EMC | Ran Taig | Dell
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Proactively Managing Your Data Center Infrastructure
Using Big Data Analytics
IBM IT Operations Analytics for z systems
IBM IT Operations Analytics for z Systems
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Webinar: 5 Steps To The Perfect Storage Refresh
2016 smrp 101616
Future Trends in IT Storage
SSD Failures in the Field.pptx
Preventative Maintenance of Robots in Automotive Industry
Ad

More from Andrei Khurshudov (7)

PDF
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
PDF
Short introduction to Big Data Analytics, the Internet of Things, and their s...
PDF
PDF
clusterstor-hadoop-data-sheet
PDF
Long Term Data Storage 2007
PDF
Future Information Growth And Storage Device Reliability 2007
PDF
Reliability Of Solid State Drives 2008
Hyper-Converged Infrastructure: Big Data and IoT opportunities and challenges...
Short introduction to Big Data Analytics, the Internet of Things, and their s...
clusterstor-hadoop-data-sheet
Long Term Data Storage 2007
Future Information Growth And Storage Device Reliability 2007
Reliability Of Solid State Drives 2008

Health monitoring & predictive analytics to lower the TCO in a datacenter

  • 2. ENGINEERING WORKSHOP Health monitoring & predictive analytics To lower the TCO in a datacenter Christian B. Madsen & Andrei Khurshudov Engineering Manager & Sr. Director Seagate Technology [email protected]
  • 3. ENGINEERING WORKSHOP 1. The opportunity 2. Our vision and implementation 3. Use cases 4. Summary Outline
  • 5. ENGINEERING WORKSHOP Optimize system management Reduce potential cost of operation Improve datacenter efficiency What if… Seagate offered you a technology that could help you
  • 6. ENGINEERING WORKSHOP ▪  1 billion hard drives will be used in cloud datacenters by 2020, highlighting the need to manage drive health at scale ▪  One total outage per datacenter is statistically expected every year ▪  80% of those outages are not completely explained (or linked to root causes) ▪  $700,000 is the average cost per incident ▪  $8,000 is the average cost per minute of an unplanned outage ▪  Up to 10% of datacenter accidents are related to storage 56% 56% of 13ZB 7.3ZB, >1billion drives in cloud Source: Seagate Strategic Marketing and Research 2013 2020 The problem Failures in storage lead to costly outages
  • 7. ENGINEERING WORKSHOP Top 4 challenges in drive management Better drive management will lower the TCO 1.  Drive health monitoring –  Need reliable key performance indicators to track drive health status 2.  Drive failure prediction –  “Ultimately, we want to know when our drives will fail so we can take actions before that happens” 3.  Drive failure diagnostics and management automation –  Need to correctly identify and quickly resolve issues –  Need to prevent false alerts to reduce cost of failure handling 4.  Drive lifespan extension –  Need to know how to optimize operating environment for better reliability –  Need to reuse partially good drives (should be possible with in- drive diagnostic, IDD)
  • 8. Our vision and implementation
  • 9. ENGINEERING WORKSHOP Our vision •  Drive-centric health monitoring •  Analytics and predictive models •  Closed-loop automation Data Center Global Access • Report storage health • Run drive self-test • Shut-down systems • Repair drives • Run auto-FA • Point at an issue • Highlight inefficiency • Predict reliability • Detect anomalies Actionable Decisions ANALYTICS PREDICTIONS CONTROL Data Aggregation Early Warning System Quick Issue Resolution Cloud GazerTM MONITORING Monitoring, analytics, prediction and control – “The internet of things”2
  • 10. ENGINEERING WORKSHOP Functional diagram Monitor Exception (alert) Compliance (threshold) Recommended action Resolution Closed-loop automation Yes No Choose action Monitor Drive predicted to fail Drive health Reset or turn off drive Turn off drive Example Passes Not passing Choose action Automation Choosing action from recommended action can be automated by tying it to the specific application or saving choices Monitoring, intelligent decisions and automation
  • 11. ENGINEERING WORKSHOP Real-time metric aggregation Cloud GazerTM Dashboard ReST Platform •  Query data •  Check thresholds •  Manage drives Data pool (10,000s of drives) Server Storage Software REST API Calls •  Cluster •  Server •  Drive Agents Storage Server Storage Server Storage Server Storage Server ReSTful API Drives Drives Drives Drives Storage Server Storage Server Cloud GazerTM Elements Storage eco system Analytics Engine Cloud GazerTM Drive Monitoring and Analytics Drives Drives Storage Software noSQL ReSTful API Cloud GazerTM Dashboard Architecture overview Implementation
  • 13. ENGINEERING WORKSHOP Danger zone Datacenter failure rate Failure detection Warning about expected drive failures. Relies on the proprietary failure prediction algorithms that use unsupervised machine learning techniques. Expected average failure prediction time window is from 9 days to 12 days. HDD population failure rate Measuring stress and estimating failure acceleration of the disk drive population in real time. Relies on the proprietary failure prediction algorithmsOverload detection Detecting and reporting when drive load exceeds design limits Compliance (thresholds) Recommended action How to increase drive reliability Degradation and performance warnings
  • 14. ENGINEERING WORKSHOP Workload optimization Workload predominantly hitting one server Before Load balancing issues After Workload distributed over servers and time WeekDay Month Workload peaked on Sunday Drive visibility tools to improve workload balancing
  • 15. ENGINEERING WORKSHOP Unsupervised machine learning and failure prediction …Drives in field Multivariate time-series monitoring Apply failure prediction algorithm in parallel in real- time Real-time status prediction of drive – Fine or going to fail For now, an average failure prediction window is on the order of 9 to 12 days Failure prediction accuracy ranges from 55% to 90% No interaction between drive set, no prior knowledge
  • 16. ENGINEERING WORKSHOP Prediction and follow up actions Systematic failure predicted: 3 out of 5 drives predicted to fail sit in end location of servers Heat map indicates drives at risk and you can issue drive tests (DST, IDD,…) to resolve or corroborate
  • 17. ENGINEERING WORKSHOP Find failure triggers Systematic failure predicted: 3 out of 5 drives predicted to fail sit in end location of servers Common factors for drives in the end position is a cooler temperature. Therefore increasing the server temperature may reduce the (dominant) failure mechanism and increase drive reliability Root cause tools including a temperature heat map can help you triage the cause of your drive issues
  • 18. ENGINEERING WORKSHOP Failure prediction lead time Currently catch 55-90% of failures ahead of time Case study 2, we predicted 5 drives to failed 23 days prior to failure, 2 drives 22 days prior to failure,… 2 drives just one day in advance Case study 1, we predicted most drives (118 drives) to fail 12 days prior to failure We can predict drives will fail on average 9-10 days before the failure
  • 20. ENGINEERING WORKSHOP •  Truly drive-centric management tool for the cloud •  Most efficient tool for extracting drive health information using Seagate IP ▪  Nobody knows drives better than us ▪  Freeware utilities are frequently wrong •  Runs on any Linux system with little overhead (<1%) Windows is next •  Data can be collected, monitored and analyzed locally or in the Cloud •  ReSTful API to interact with other software •  New analytics, prediction, AI, and control capabilities are added continually •  Drive repair will be possible with in-drive diagnostic •  Enclosure control will be possible by summer 2015 Simply SMARTer Competition Why Cloud GazerTM? *Seagate drives