我们的文章会在微信公众号IT民工的龙马人生和博客网站( www.htz.pw )同步更新 ,欢迎关注收藏,也欢迎大家转载,但是请在文章开始地方标注文章出处,谢谢!
由于博客中有大量代码,通过页面浏览效果更佳。
本文转自朋友的真实案例分享。
案例:troubleshooting Large Waits With The Wait Event “resmgr:cpu quantum”
本案例来自西区某客户,数据库版本为11.2.0.4,客户反馈应用异常缓慢,几乎处于不可用的状态。
查看历史活动会话信息发现,从2022-05-01 09:43开始,活动会话开始异常增长。
2022-05-01 09:40 1 39
2022-05-01 09:40 2 83
2022-05-01 09:41 1 38
2022-05-01 09:41 2 75
2022-05-01 09:42 1 50
2022-05-01 09:42 2 199
2022-05-01 09:43 1 22
2022-05-01 09:43 2 1130
2022-05-01 09:44 1 20
2022-05-01 09:44 2 1766
2022-05-01 09:45 1 26
2022-05-01 09:45 2 2027
2022-05-01 09:46 1 88
2022-05-01 09:46 2 3462
2022-05-01 09:47 1 97
2022-05-01 09:47 2 9765
2022-05-01 09:48 1 103
2022-05-01 09:48 2 12402
2022-05-01 09:49 1 78
2022-05-01 09:49 2 17489
2022-05-01 09:50 1 96
2022-05-01 09:50 2 16880
2022-05-01 09:51 1 101
2022-05-01 09:51 2 19659
2022-05-01 09:52 1 126
2022-05-01 09:52 2 18511
2022-05-01 09:53 1 124
2022-05-01 09:53 2 15975
2022-05-01 09:54 1 100
2022-05-01 09:54 2 17142
2022-05-01 09:55 1 94
2022-05-01 09:55 2 20933
2022-05-01 09:56 1 115
2022-05-01 09:56 2 17159
2022-05-01 09:57 1 72
2022-05-01 09:57 2 19454
2022-05-01 09:58 1 33
2022-05-01 09:58 2 15774
2022-05-01 09:59 1 28
2022-05-01 09:59 2 20397
大量异常的活动会话几乎全是等待resmgr:cpu quantum
2022-05-01 09:38 2 214
2022-05-01 09:39 2 84
2022-05-01 09:40 2 80
2022-05-01 09:41 2 69
2022-05-01 09:42 2 192
2022-05-01 09:43 resmgr:cpu quantum 2 424
2022-05-01 09:43 2 689
2022-05-01 09:44 resmgr:cpu quantum 2 810
2022-05-01 09:44 2 922
2022-05-01 09:45 resmgr:cpu quantum 2 1114
2022-05-01 09:45 2 880
2022-05-01 09:46 resmgr:cpu quantum 2 2679
2022-05-01 09:46 2 757
2022-05-01 09:47 resmgr:cpu quantum 2 8831
2022-05-01 09:47 2 901
2022-05-01 09:48 resmgr:cpu quantum 2 11566
2022-05-01 09:48 2 802
2022-05-01 09:49 resmgr:cpu quantum 2 16565
2022-05-01 09:49 2 887
2022-05-01 09:50 resmgr:cpu quantum 2 16118
2022-05-01 09:50 2 737
2022-05-01 09:51 resmgr:cpu quantum 2 18701
2022-05-01 09:51 2 926
2022-05-01 09:52 resmgr:cpu quantum 2 17589
2022-05-01 09:52 2 895
2022-05-01 09:53 resmgr:cpu quantum 2 15183
2022-05-01 09:53 2 770
2022-05-01 09:54 resmgr:cpu quantum 2 16357
2022-05-01 09:54 2 752
2022-05-01 09:55 control file sequential read 2 11
2022-05-01 09:55 resmgr:cpu quantum 2 19959
2022-05-01 09:55 2 926
2022-05-01 09:56 resmgr:cpu quantum 2 16361
2022-05-01 09:56 2 765
2022-05-01 09:57 control file sequential read 2 12
2022-05-01 09:57 latch: cache buffers chains 2 2734
2022-05-01 09:57 resmgr:cpu quantum 2 15686
2022-05-01 09:57 2 985
2022-05-01 09:58 control file sequential read 2 11
2022-05-01 09:58 resmgr:cpu quantum 2 14989
2022-05-01 09:58 2 755
2022-05-01 09:59 control file sequential read 2 12
2022-05-01 09:59 resmgr:cpu quantum 2 19440
2022-05-01 09:59 2 910
resmgr:cpu quantum等待事件的含义在WAITEVENT: “resmgr:cpu quantum” Reference Note (Doc ID 2097889.1) 中有说明
Event 'resmgr: cpu quantum' is a standard event used by resource manager to control the allocation of CPU to processes. When a session waits for 'resmgr: cpu quantum' that session is waiting to be allocated a quantum of CPU time.
This wait occurs when the resource manager is enabled and is throttling CPU consumption. To reduce the occurrence of this wait event, increase the CPU allocation for the session's current consumer group.
简单来说就是,当resource manager启用对CPU的限制时,进程对应消费组所占用的CPU达到限额时,该进程将以等待resmgr: cpu quantum的形式进入等待,以保证该消费组的cpu消耗不超过限额。
从osw的vmstat可以看到刚刚出现大量resmgr: cpu quantum的时段cpu使用率仅仅为50%。
zzz ***Sun May 1 09:40:59 CST 2022
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
39 0 0 315695840 72296 21418404 0 0 624 164 0 0 2 1 96 0 0
12 0 0 315666016 72304 21419448 0 0 67 14298 96858 95762 8 2 91 0 0
7 0 0 315674624 72312 21418280 0 0 83 16846 89062 98593 6 1 93 0 0
zzz ***Sun May 1 09:41:29 CST 2022
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
17 0 0 315677088 72696 21420468 0 0 624 164 0 0 2 1 96 0 0
12 0 0 315655424 72704 21421572 0 0 8436 213 78322 83804 4 1 95 0 0
32 0 0 315662432 72736 21421272 0 0 91 8558 110771 99331 17 1 82 0 0
zzz ***Sun May 1 09:41:59 CST 2022
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
23 1 0 315660384 73020 21422720 0 0 624 164 0 0 2 1 96 0 0
12 1 0 315610208 73028 21419544 0 0 76604 4974 171824 162508 9 3 87 1 0
15 2 0 315620288 73036 21423504 0 0 5859 17278 126332 131118 8 2 90 0 0
zzz ***Sun May 1 09:42:29 CST 2022
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
100 2 0 315660832 73400 21424484 0 0 624 164 0 0 2 1 96 0 0
70 2 0 315639616 73408 21425508 0 0 41699 9383 198482 143324 48 2 49 1 0
86 3 0 315636544 73424 21425092 0 0 43388 17133 193937 144246 47 2 50 1 0
查看参数resource_manager_plan,发现启用了DEFAULT_MAINTENANCE_PLAN
SQL> show parameter resource_manager_plan
NAME TYPE VALUE
------------------------------------ ---------- ------------------------------
resource_manager_plan string SCHEDULER[0x32DF]:DEFAULT_MAIN
TENANCE_PLAN
看到DEFAULT_MAINTENANCE_PLAN应该非常熟悉,这就是11g自动任务维护时间窗口默认将会启用的resource mangager plan。
WINDOW_NAME REPEAT_INTERVAL DURATION ENABL ACTIV
-------------------- ---------------------------------------------------------------------- -------------------- ----- -----
MONDAY_WINDOW freq=daily;byday=MON;byhour=22;byminute=0; bysecond=0 +000 04:00:00 TRUE FALSE
TUESDAY_WINDOW freq=daily;byday=TUE;byhour=22;byminute=0; bysecond=0 +000 04:00:00 TRUE FALSE
WEDNESDAY_WINDOW freq=daily;byday=WED;byhour=22;byminute=0; bysecond=0 +000 04:00:00 TRUE FALSE
THURSDAY_WINDOW freq=daily;byday=THU;byhour=22;byminute=0; bysecond=0 +000 04:00:00 TRUE FALSE
FRIDAY_WINDOW freq=daily;byday=FRI;byhour=22;byminute=0; bysecond=0 +000 04:00:00 TRUE FALSE
SATURDAY_WINDOW freq=daily;byday=SAT;byhour=6;byminute=0; bysecond=0 +000 20:00:00 TRUE FALSE
SUNDAY_WINDOW freq=daily;byday=SUN;byhour=6;byminute=0; bysecond=0 +000 20:00:00 TRUE TRUE
WEEKNIGHT_WINDOW freq=daily;byday=MON,TUE,WED,THU,FRI;byhour=22;byminute=0; bysecond=0 +000 08:00:00 FALSE FALSE
WEEKEND_WINDOW freq=daily;byday=SAT;byhour=0;byminute=0;bysecond=0 +002 00:00:00 FALSE FALSE
今天51劳动节,正好是周日,该库时间窗口启用时间为早上6点,从alert也可以看到6点时启用了该时间窗口的resource manager plan
2022-05-01T06:00:00.111668+08:00
Setting Resource Manager plan SCHEDULER[0x4D52]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
DEFAULT_MAINTENANCE_PLAN对应消费组以及限制在dba_rsrc_plan_directives中查看
GROUP_OR_SUBPLAN TYPE MGMT_P1 MGMT_P2 MGMT_P3 COMMENTS MAX_UTILIZATION_LIMIT
------------------------------ ------------------------------ ---------- ---------- ---------- -------------------------------------------------- ---------------------
ORA$AUTOTASK_SUB_PLAN PLAN 0 25 0 Directive for automated maintenance tasks 90
ORA$DIAGNOSTICS CONSUMER_GROUP 0 5 0 Directive for automated diagnostic tasks 90
SYS_GROUP CONSUMER_GROUP 75 0 0 Directive for system operations
OTHER_GROUPS CONSUMER_GROUP 0 70 0 Directive for all other operations
自动任务维护属于 ORA A U T O T A S K S U B P L A N , S Y S G R O U P 和 O T H E R G R O U P S 分别代表 S Y S / S Y S T E M 会话消费组和业务会话消费组。 C P U 限额按照 M G M T P ∗ 定义的优先级进行百分比分配,该限额是会动态调整的。假如此时几乎只有 O R A AUTOTASK_SUB_PLAN,SYS_GROUP和OTHER_GROUPS分别代表SYS/SYSTEM会话消费组和业务会话消费组。CPU限额按照MGMT_P*定义的优先级进行百分比分配,该限额是会动态调整的。假如此时几乎只有ORA AUTOTASKSUBPLAN,SYSGROUP和OTHERGROUPS分别代表SYS/SYSTEM会话消费组和业务会话消费组。CPU限额按照MGMTP∗定义的优先级进行百分比分配,该限额是会动态调整的。假如此时几乎只有ORAAUTOTASK_SUB_PLAN在运行,ORA$AUTOTASK_SUB_PLAN的限额可能会被调高,最大限制是MAX_UTILIZATION_LIMIT默认90%,而对应真正运行业务的OTHER_GROUPS就会降低限额。
在周日的早上6点开启的自动任务维护窗口,很可能几乎只有自动维护任务的相关session会被执行,所以有可能会提高ORA$AUTOTASK_SUB_PLAN,并且降低OTHER_GROUPS的限额。如果假设9点开始业务开始增加,那么就很容易导致resmgr: cpu quantum。
此外发现当时等待resmgr: cpu quantum的sql基本是同一个sql。
2022-05-01 09:40 b81avsvs9ccpp 2 76
2022-05-01 09:41 b81avsvs9ccpp 2 66
2022-05-01 09:42 b81avsvs9ccpp 2 181
2022-05-01 09:43 3t74kg8jmw4dg 2 14
2022-05-01 09:43 b81avsvs9ccpp 2 1087
2022-05-01 09:44 3t74kg8jmw4dg 2 14
2022-05-01 09:44 b81avsvs9ccpp 2 1692
2022-05-01 09:44 fm8d2gv7gz1bq 2 12
2022-05-01 09:44 2 17
2022-05-01 09:45 3t74kg8jmw4dg 2 22
2022-05-01 09:45 b81avsvs9ccpp 2 1945
2022-05-01 09:45 fm8d2gv7gz1bq 2 12
2022-05-01 09:45 2 18
2022-05-01 09:46 3t74kg8jmw4dg 2 40
2022-05-01 09:46 b81avsvs9ccpp 2 3351
2022-05-01 09:46 2 23
2022-05-01 09:47 3t74kg8jmw4dg 2 48
2022-05-01 09:47 aunqkadqdumxj 2 16
2022-05-01 09:47 b81avsvs9ccpp 2 9598
2022-05-01 09:47 by3xu361c3mfy 2 12
2022-05-01 09:47 2 31
2022-05-01 09:48 3t74kg8jmw4dg 2 48
2022-05-01 09:48 4z322qbnykw3q 2 13
2022-05-01 09:48 b81avsvs9ccpp 2 12238
2022-05-01 09:48 by3xu361c3mfy 2 13
2022-05-01 09:48 2 33
2022-05-01 09:49 3t74kg8jmw4dg 2 77
2022-05-01 09:49 4z322qbnykw3q 2 13
2022-05-01 09:49 6zd1tmy35rf7b 2 11
2022-05-01 09:49 b81avsvs9ccpp 2 17280
2022-05-01 09:49 by3xu361c3mfy 2 23
2022-05-01 09:49 2 34
2022-05-01 09:50 3t74kg8jmw4dg 2 47
2022-05-01 09:50 4z322qbnykw3q 2 16
2022-05-01 09:50 aunqkadqdumxj 2 16
2022-05-01 09:50 b81avsvs9ccpp 2 16688
2022-05-01 09:50 by3xu361c3mfy 2 28
2022-05-01 09:50 2 16
2022-05-01 09:51 3t74kg8jmw4dg 2 59
2022-05-01 09:51 4z322qbnykw3q 2 46
2022-05-01 09:51 6zd1tmy35rf7b 2 13
2022-05-01 09:51 b81avsvs9ccpp 2 19415
2022-05-01 09:51 by3xu361c3mfy 2 28
2022-05-01 09:51 2 30
2022-05-01 09:52 3t74kg8jmw4dg 2 62
2022-05-01 09:52 4z322qbnykw3q 2 43
2022-05-01 09:52 b81avsvs9ccpp 2 18304
2022-05-01 09:52 by3xu361c3mfy 2 22
2022-05-01 09:52 fm8d2gv7gz1bq 2 12
2022-05-01 09:52 2 13
2022-05-01 09:53 3t74kg8jmw4dg 2 57
2022-05-01 09:53 4z322qbnykw3q 2 21
2022-05-01 09:53 b81avsvs9ccpp 2 15802
2022-05-01 09:53 by3xu361c3mfy 2 21
2022-05-01 09:54 3t74kg8jmw4dg 2 102
2022-05-01 09:54 4z322qbnykw3q 2 16
2022-05-01 09:54 b81avsvs9ccpp 2 16922
2022-05-01 09:54 by3xu361c3mfy 2 26
2022-05-01 09:54 fm8d2gv7gz1bq 2 14
2022-05-01 09:55 3t74kg8jmw4dg 2 106
2022-05-01 09:55 4z322qbnykw3q 2 31
2022-05-01 09:55 6zd1tmy35rf7b 2 15
2022-05-01 09:55 b81avsvs9ccpp 2 20637
2022-05-01 09:55 by3xu361c3mfy 2 38
2022-05-01 09:55 fm8d2gv7gz1bq 2 18
2022-05-01 09:55 2 31
2022-05-01 09:56 3t74kg8jmw4dg 2 92
2022-05-01 09:56 4z322qbnykw3q 2 16
2022-05-01 09:56 b81avsvs9ccpp 2 16934
2022-05-01 09:56 by3xu361c3mfy 2 32
2022-05-01 09:56 fm8d2gv7gz1bq 2 15
2022-05-01 09:56 2 20
2022-05-01 09:57 3t74kg8jmw4dg 2 111
2022-05-01 09:57 4z322qbnykw3q 2 34
2022-05-01 09:57 b81avsvs9ccpp 2 19171
2022-05-01 09:57 by3xu361c3mfy 2 30
2022-05-01 09:57 fm8d2gv7gz1bq 2 18
2022-05-01 09:57 2 34
2022-05-01 09:58 3t74kg8jmw4dg 2 104
2022-05-01 09:58 4z322qbnykw3q 2 21
2022-05-01 09:58 b81avsvs9ccpp 2 15548
怀疑有sql的并发的比之前的要高的因素存在
****************************************************************************************
PLAN STAT FROM ASH
****************************************************************************************
SQL_ID b81avsvs9ccpp, child number 4
-------------------------------------
select a.offer_code,T.FIELD_VALUE FIELD_VALUE from pm_ext_cha t,
pm_offer a where t.from_table_name = 'TD_B_DISCNT' and t.field_name =
'DISCNT_LABLE' and t.object_id = a.offer_id and a.offer_type = 'D' and
a.OFFER_CODE =:1 order by A.EXPIRE_DATE desc
Plan hash value: 14354385
---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 1126 (100)| |latch: cache buffers chai(1)(0%) |
| |CPU(68)(.01%) |
| |SQL*Net message to client(1)(0%) |
| |CPU(1)(0%) |
| 1 | SORT ORDER BY | | 1 | 70 | 1126 (1)| 00:00:01 |CPU(10)(0%) |
| 2 | NESTED LOOPS | | 1 | 70 | 1125 (1)| 00:00:01 |CPU(39)(0%) |
| 3 | NESTED LOOPS | | 485 | 70 | 1125 (1)| 00:00:01 |CPU(125)(.02%) |
|* 4 | MAT_VIEW ACCESS FULL | MV_PM_EXT_CHA | 485 | 22310 | 349 (1)| 00:00:01 |CPU(5816)(.72%) |
| |resmgr:cpu quantum(656886)(81.55%) |
| |resmgr:internal state cha(5)(0%) |
| |latch: cache buffers chai(332)(.04%) |
|* 5 | INDEX UNIQUE SCAN | SYS_C_SNAP$_202PK_PM_OFFER | 1 | | 1 (0)| 00:00:01 |latch: cache buffers chai(57227)(7.1%|
| |wait list latch free(25)(0%) |
| |CPU(79731)(9.9%) |
|* 6 | MAT_VIEW ACCESS BY INDEX ROWID| MV_PM_OFFER | 1 | 24 | 2 (0)| 00:00:01 |latch: cache buffers chai(485)(.06%) |
| |CPU(4787)(.59%) |
---------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter(("T"."FIELD_NAME"='DISCNT_LABLE' AND "T"."FROM_TABLE_NAME"='TD_B_DISCNT'))
5 - access("T"."OBJECT_ID"="A"."OFFER_ID")
6 - filter(("A"."OFFER_CODE"=TO_NUMBER(:1) AND "A"."OFFER_TYPE"='D'))
PL/SQL procedure successfully completed.
+------------------------------------------------------------------------+
| information from awr sysdate-7 |
+------------------------------------------------------------------------+
PLAN GET DISK WRITE ROWS ROWS USER_IO(MS) ELA(MS) CPU(MS) CLUSTER(MS) PLSQL
END_TI I NAME HASH VALUE EXEC PRE EXEC PRE EXEC PER EXEC ROW_P PRE EXEC PRE FETCH PER EXEC PRE EXEC PRE EXEC PER EXEC PER EXEC
------ - --------------- ------------- ---------- ------------ -------- -------- ----- ----------- --------- ----------- -------- -------- ----------- --------
01 00 2 UOP_PARAM 14354385 88.W 1,225 0 0 22.W 0 1 0 ######## 172,251 0 0
01 00 2 UOP_PARAM 14354385 59.W 0 0 0 20.W 0 0 0 191,574 23,961 0 0
01 01 2 UOP_PARAM 14354385 19.W 10,877 0 0 4.8W 0 0 0 8,776 8,277 0 0
01 01 2 UOP_PARAM 14354385 31.W 0 0 0 8.0W 0 0 0 9,409 8,907 0 0
01 02 2 UOP_PARAM 14354385 10.W 10,893 0 0 2.6W 0 0 0 8,394 7,901 0 0
01 02 2 UOP_PARAM 14354385 13.W 0 0 0 3.3W 0 0 0 8,466 7,968 0 0
01 03 2 UOP_PARAM 14354385 7.5W 10,899 0 0 1.9W 0 0 0 8,314 7,824 0 0
01 03 2 UOP_PARAM 14354385 6.6W 10,910 0 0 1.6W 0 0 0 8,188 7,697 0 0
01 04 2 UOP_PARAM 14354385 6.7W 10,898 0 0 2.0W 0 0 0 8,156 7,661 0 0
01 04 2 UOP_PARAM 14354385 6.4W 0 0 0 1.7W 0 0 0 8,183 7,693 0 0
01 05 2 UOP_PARAM 14354385 9.3W 10,894 0 0 2.5W 0 0 0 8,342 7,844 0 0
01 05 2 UOP_PARAM 14354385 8.3W 10,900 0 0 2.4W 0 0 0 8,271 7,778 0 0
01 06 2 UOP_PARAM 14354385 10.W 10,897 0 0 2.8W 0 0 0 10,480 9,976 0 0
01 06 2 UOP_PARAM 14354385 12.W 0 0 0 3.1W 0 0 0 8,416 7,911 0 0
01 07 2 UOP_PARAM 14354385 28.W 10,881 0 0 7.6W 0 0 0 11,449 10,956 0 0
01 07 2 UOP_PARAM 14354385 35.W 0 0 0 9.9W 0 0 0 9,500 9,010 0 0
01 08 2 UOP_PARAM 14354385 41.W 441 0 0 11.W 0 0 0 9,562 9,071 0 0
01 08 2 UOP_PARAM 14354385 54.W 0 0 0 15.W 0 0 0 10,766 10,276 0 0
01 09 2 UOP_PARAM 14354385 101W 0 0 0 27.W 0 0 0 91,885 52,828 0 0
01 09 2 UOP_PARAM 14354385 59.W 3,687 0 0 16.W 0 0 0 11,327 10,832 0 0
01 10 2 UOP_PARAM 14354385 141W 1,843 0 0 38.W 0 1 0 ######## 209,143 40 0
01 10 2 UOP_PARAM 14354385 109W 0 0 0 29.W 0 0 0 ######## 136,456 0 0
01 11 2 UOP_PARAM 14354385 72.W 0 0 0 20.W 0 0 0 278,573 113,357 86 0
01 11 2 UOP_PARAM 14354385 118W 4 0 0 34.W 0 0 0 194,557 85,951 1 0
01 12 2 UOP_PARAM 14354385 98.W 0 0 0 30.W 0 0 0 44,274 32,320 0 0
01 12 2 UOP_PARAM 14354385 111W 3,182 0 0 33.W 0 0 0 110,326 57,733 0 0
01 13 2 UOP_PARAM 14354385 101W 0 0 0 30.W 0 0 0 17,081 16,574 0 0
29 08 2 UOP_PARAM 14354385 28.W 10,850 0 0 9.5W 0 0 0 9,262 8,763 0 0
29 08 2 UOP_PARAM 14354385 40.W 208 0 0 13.W 0 0 0 9,965 9,464 0 0
29 09 2 UOP_PARAM 14354385 42.W 833 0 0 13.W 0 0 0 9,988 9,487 0 0
29 09 2 UOP_PARAM 14354385 55.W 0 0 0 17.W 0 0 0 10,919 10,412 0 0
29 10 2 UOP_PARAM 14354385 64.W 0 0 0 21.W 0 0 0 11,627 11,119 1 0
29 10 2 UOP_PARAM 14354385 58.W 3,417 0 0 18.W 0 0 0 11,142 10,639 0 0
29 11 2 UOP_PARAM 14354385 73.W 0 0 0 23.W 0 0 0 12,627 12,118 1 0
29 11 2 UOP_PARAM 14354385 70.W 0 0 0 22.W 0 0 0 12,308 11,795 0 0
29 12 2 UOP_PARAM 14354385 70.W 0 0 0 22.W 0 0 0 12,261 11,749 0 0
29 12 2 UOP_PARAM 14354385 70.W 4,754 0 0 22.W 0 0 0 12,485 11,970 0 0
29 13 2 UOP_PARAM 14354385 65.W 0 0 0 22.W 0 0 0 11,713 11,202 0 0
29 13 2 UOP_PARAM 14354385 70.W 0 0 0 22.W 0 0 0 12,102 11,591
通过对比可以看到故障期间的sql并发确实比之前要多1倍,并且对于执行次数如此之高的sql,sql性能并不算太好,因为对于一个非聚合的sql来说,平均返回行数非常少的情况下,消耗了几千甚至10000多的逻辑读,这是不合理的。假如该sql足够优化,可能也不会导致此次故障。
我们知道对于resource manager的cpu限制,是根据cpu_count参数来进行百分比配额的,而该库的cpu_count为136,实际的lcpu数量为160,并且该主机只有1个实例,这也是为何主机的cpu使用率并没有100%使用完,始终在85-90%之间的原因。虽然cpu_count与lcpu差距不大但是也或多或少算是此次故障的另一个因素。
最终建议:
- 假如周末白天也有业务的系统,建议调整自动维护任务时间窗口的时间,比如都放在晚上22点进行。
- 禁用自动维护窗口默认启用的resource manager plan。
- 优化执行次数高但并不高效的sql。
- 建议在确认主机只有一个实例存在的情况下,cpu_count与实际的lcpu保持一致。该参数参与了大量资源与后台进程个数的运算。如LMS、LMD、ADG pr进程等等。
------------------作者介绍-----------------------
姓名:黄廷忠
现就职:Oracle中国高级服务团队
曾就职:OceanBase、云和恩墨、东方龙马等
电话、微信、QQ:18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)