Database Engineering and Operations at Yahoo

Database Engineering and Operations
B Y A s h w i n N e l l o r e

Yahoo
2
 Advertising Products
 Publisher Products
 Platforms
 Internal Products

Engineering
3
 Database as a Service
 Continuous Delivery
 Code Reviews
 Performance Analyzer (Open Sourced)
 Performance Analytics

Database as a Service
4
 Self Service on Private Cloud
 Multitenant and Dedicated Solutions
 Data Store Guidance
 User Management
 Backups
 Migrations
 Interleaved with dependent systems

Continuous Delivery
5
 Custom Configuration Management
 Github
 Jenkins Pipeline
 Database version control
 Automated Tests for syntax errors
 Code Reviews
 Developer Notifications

Performance Analyzer
6
 Lightweight and Agentless Java Web Application
 Self contained and easy to deploy anywhere
 Rich User Interface
 Gather and store performance metrics
 Detect anomalies and raise alerts
 Real time performance data access
 New metrics and alerts can be defined and deployed during runtime.
 Highly agile and extensible software development
 No license required

Dashboard: Alerts From Past 24 Hours
7
 After login, dashboard will display alerts from past 24 hours and metrics
for current database health, for all database servers under
management.
 Active alerts are colored in red.
 Built-in alerts are summarized and displayed in the list.
 List is sortable on all columns.
 List can be further restricted to a single server group.
 Forensic data gathered when an alert was detected can be viewed or
downloaded from the same page.

Dashboard: Alerts From Past 24 Hours
8

Dashboard: Current Health Status
9
 Display most recent performance metrics for all managed servers in a
single screen
 Results can be limited to a single server group.
 List is sortable and color coded to prioritize action and response
 Metrics Included:
› QPS
› CPU, Load Average and IO Waits
› Free Memory
› Slow Query Count
› Active and Total Threads
› Connection Rates and Failures
› Replication lags
› Deadlocks
› Time used for last round of metrics scan.

Dashboard: Current Health Status
10

Real Time Top
11
 Inspired by MyTop/InnoTop
 Display selected OS metrics and MySQL metrics in real time.
 Display MySQL process list in real time.
 OS metrics:
› Uptime, Load Average, CPU, Memory, Swap, TCP Connections.
 MySQL metrics
› General: Uptime, QPS, Commands, Replication
› Network/Threads: Connections, Threads, Network IO
› InnoDB: Row operation, IO, Buffer Pool

Real Time - Details
13
 User friendly and safe tool to access various performance related
information schema tables and SHOW commands.
 For metrics or status related information, the changes can be calculated
and displayed automatically, or triggered manually.
 Context help and context menu can help to digest the information or
navigate to other places for further researches.
 Features supported:
› Process list
› Global status and changes, can be filtered by partial keyword
› Configuration variables, the history, and comparison with other MySQL servers.
› Replication Status
› Parsed InnoDB engine status
› InnoDB status
› User Statistics when available, and the changes to identify hot users, tables, etc.
› Explain plan, including JSON format, either triggered from process list or input
manually.

Real Time Details And Process List
14
 Tabs to access data from various information schema tables and
SHOW commands.
 Context Menu to run EXPLAIN on any SELECT query
 Thread level detailed info from Performance Schema screen

Explain Plan and JSON Output
15
 Parsed and displayed in tree structure for easy understanding the rich
information.
 Bonus: comparing two plan formats can give us better understanding of the
old format.

Global Status
16
 Keyword filtering to view only concerned status variables.
 Auto refresh or manual refresh to see changes and change rates.
 Context help to assist understanding of the status variables.

Configuration Management
17
 Configuration consistency checks and variances when analyzing
performance issues
 Lookup by partial keyword with links to MySQL references
 Change History Tracking.
 Compare parameters between database servers

InnoDB Statistics
18
 Analyze performance issues, such as locks and mutexes
 Mutex statistics to understand contentions

User Statistics
19
 When available, user statistics provide very useful time metrics,
especially at per user level to identify hot users.
 Table statistics can also help to identify hot tables.

Metrics Gathering And Display
20
 Metrics are gathered from all managed servers based on configurable
interval.
 Metrics are stored in either embedded Java DerbyDB for very small
deployment or MySQL database for more formal deployment.
concerned metrics are grouped and metrics from a single group are
stored in a single table.
 Metrics sources:
› information_schema, especially global status, for MySQL,
› SNMP for OS level data when available
› User defined.
 Predefined metrics:
› MySQL common status, command, InnoDB, replication status
› InnoDB Mutex (optional)
› SNMP: system, disk, network, storage
› Additional metrics can be defined and associated with individual server group or server,
using global status variables, or customized SQL statements.

Metrics Charts – Common Global Status
21
 Periodically poll global status, InnoDB mutex and user defined metrics
 Metrics are stored in built-in embedded Java DB for a small deployment
or in MySQL DB for a large deployment

Metrics Charts – OS using SNMP
22
 OS level metrics are polled from SNMP
 Metrics include CPU, Load Average, Context switches, Interrupts, IO
Waits, Disk, Memory Usage, network and storage usages, etc.

Metrics Charts: Single Chart Or Comparison
23
 Display chart for any available metric.
 Compare two metrics of the same server during the same period to
identify correlations, which frequently help to identify root cause during
troubleshooting.
 Auto play option to display the second metrics sequentially

Metrics Comparison between A Group Of Servers
24
 Metrics can be viewed and compared on a pair of servers or multiple
servers of the same group.
 This feature can be used to understand how loads are balanced, or
capacity differences between two servers.
 Above sample is a master/slave comparison. Replication cannot catch up
the very high update rates on the master.

User Defined Metrics (UDM)
25
 Customized metrics can be added either using status variables from global status, which
are not included in the built in metrics, or using customized SQL statement.
 Manual setup is required to associate concerned servers or server groups with any UDM.
 Current implementation will store all metrics defined within one UDM in a single table.

Anomaly Detections and Alerts
26
 Anomalies will be checked for a set of predefined metrics against
thresholds. Thresholds can be adjusted at server group level or host
level.
 When anomalies are detected, forensic data will be gathered and
logged, such as process lists, InnoDB engine status, innodb locks, etc.
 Alert detail reports can be viewed and downloaded from dashboard and
Alert page.
 Alerts will be logged and notifications can be sent out using email and
web notifications.
 Predefined alerts:
› CPU, Load Average, IO Waits, Running Threads, Replication Status and lag, Slow
Query Count, Connection Failure, Deadlocks and Disk Usages
 Additional customized alerts can be defined and attached to concerned
database server, using either a SQL statement, or against metrics
already defined, or just against any global status variable.

Alerts and Settings
27
 All alerts for past 24 hours will be displayed in dashboard after login.
 Alerts for all servers, an individual server group, or a single host, can be
accessed from Alert page.
 Thresholds can be configured at server group or host level.

Alert Notifications
28
 Alert notifications will be sent to email if configured, with minimum
information.
 Web notification is also supported on modern browsers when the
application is open.

Alert Reports
29
 For most of the alerts, an alert report will be generated with some forensic
information.
 The information includes aggregated and original data from process list
and InnoDB engine status, etc.

Deadlock Detection
30
 Deadlock detection is done by comparing INNODB_DEADLOCKS status
variable (available in Percona server).
 When detected, an alert will be raised and logged. Detail can be found
either from InnoDB engine status, or associated alert reports.

User Defined Alerts
31
 Customized alerts can be defined using SQL statements, global status
variables or metrics gathered by the analyzer.
 Customized alerts will not be applied to all servers automatically. Requires
manually setup to associate them with concerned servers or server groups.

Profiling and Tuning
32
 A simple and safe interface to run explain plan, MySQL profiling, and
execute MySQL SELECT statement.

Performance Schema – Top Queries
33
 Top queries by various criteria

Performance Schema – Hot Tables
34
 Table performance metrics are always powerful tools to identify IO
bottleneck, lock contentions and SQL inefficiency.

Internal Analytics
35
 Metrics logged over time into Cassandra
 Capex Planning
 Proactive Performance Diagnosis

Database Engineering and Operations at Yahoo

More Related Content

What's hot (13)

Viewers also liked (13)

Similar to Database Engineering and Operations at Yahoo (20)

Recently uploaded (20)

Database Engineering and Operations at Yahoo

Editor's Notes