我们在IO问题发生时间段:
1,可以查dba_hist_system_event 来看该IO相关等待时间的平均响应时间的变化趋势,
2,直接查dba_hist_event_histogram或者sys.WRH$_EVENT_HISTOGRAM , 来看该IO相关等待事件的等待时间直方图分布。
3, 我们也可以从整体数据库的IO变化量上(dba_hist_sysstat), 看出问题时段的总体IO量, 比正常时候的IO量到底是否有增加。
4, 我们也可以从某个IO function 即某个进程上面(dba_hist_iostat_function),看问题时间段某个进程的IO request次数、IO 的量。
比如说LGWR进程的写request次数、 写的量。
说明:
1, 如上视图均可关联dba_hist_snapshot
2, 所有dba_hist 相关视图都是由对应的v$ 动态性能视图汇聚而来, 当时的数据可以查对应的v$视图。
福利: 后附sql。
例2:
直方图分布可能更加精确, 因为查看平均响应时间可能会因为粒度太粗、时间跨度长而被平均了。
正常情况下,应该<8ms的占据了90%甚至更多, > 32ms 的应该占比接近为0。
可以看到, 昨天smscp异常时候,在16 – 32 ms 之间的已经占比很多了。
END_INTERVAL_TIME SNAP_ID TIME_HISTG TOTAL_WAIT PER_OF_WAITS
1 22-10月-14 03.00.49.801 下午 20536 <1ms 89464 0.71
2 22-10月-14 03.00.49.801 下午 20536 <2ms 89464 11.32
3 22-10月-14 03.00.49.801 下午 20536 <4ms 89464 26.39
4 22-10月-14 03.00.49.801 下午 20536 <8ms 89464 18.28
5 22-10月-14 03.00.49.801 下午 20536 <16ms 89464 13.16
6 22-10月-14 03.00.49.801 下午 20536 <32ms 89464 8.21
7 22-10月-14 03.00.49.801 下午 20536 <64ms 89464 7.56
8 22-10月-14 03.00.49.801 下午 20536 <128ms 89464 6.06
9 22-10月-14 03.00.49.801 下午 20536 <256ms 89464 4.72
10 22-10月-14 03.00.49.801 下午 20536 <512ms 89464 2.72
11 22-10月-14 03.00.49.801 下午 20536 <1s 89464 0.83
12 22-10月-14 03.00.49.801 下午 20536 <2s 89464 0.03
13 22-10月-14 03.00.49.801 下午 20536 <4s 89464 0
14 22-10月-14 03.00.49.801 下午 20536 >=4s 89464 0
正常情况下是这样的,8ms以上的占比很少:
END_INTERVAL_TIME SNAP_ID TIME_HISTG TOTAL_WAIT PER_OF_WAITS
1 22-10月-14 07.00.49.701 下午 20540 <1ms 1091640 38.28
2 22-10月-14 07.00.49.701 下午 20540 <2ms 1091640 51.09
3 22-10月-14 07.00.49.701 下午 20540 <4ms 1091640 9.03
4 22-10月-14 07.00.49.701 下午 20540 <8ms 1091640 1.37
5 22-10月-14 07.00.49.701 下午 20540 <16ms 1091640 0.17
6 22-10月-14 07.00.49.701 下午 20540 <32ms 1091640 0.04
7 22-10月-14 07.00.49.701 下午 20540 <64ms 1091640 0.01
8 22-10月-14 07.00.49.701 下午 20540 <128ms 1091640 0
9 22-10月-14 07.00.49.701 下午 20540 <256ms 1091640 0
10 22-10月-14 07.00.49.701 下午 20540 <512ms 1091640 0
11 22-10月-14 07.00.49.701 下午 20540 <1s 1091640 0
12 22-10月-14 07.00.49.701 下午 20540 <2s 1091640 0
13 22-10月-14 07.00.49.701 下午 20540 <4s 1091640 0
14 22-10月-14 07.00.49.701 下午 20540 >=4s 1091640 0
例3:
可以看到昨天下午的整库的physical read total IO requests 并没有明显增加, 但是physical read total bytes 增加到了3000GB 一个小时。
其它时候也有这么高的,但是都是在晚上, 只有昨天是在下午15点这个生产时间的snapshot,开发有大量direct path read 所以对应用产生了较大影响。
例4:可以看到问题发生时段,10.22 14点-15点, LGWR的large_write_megabytes 和requests有明显增加, 而 平时有如此高峰的是有晚上和凌晨。
Smll read 的bytes和request 都没有增加。
附sql:
— 查等待事件的等待时间直方图分布:
select b.end_interval_time,a.*
from
(select snap_id,
decode(wait_time_milli,
1,
'<1ms',
2,
'<2ms',
4,
'<4ms',
8,
'<8ms',
16,
'<16ms',
32,
'<32ms',
64,
'<64ms',
128,
'<128ms',
256,
'<256ms',
512,
'<512ms',
1024,
'<1s',
2048,
'<2s',
4096,
'<4s',
8192,
'>=4s',
wait_time_milli) time_histg,
total_wait,
round(delta_wait_count / total_wait * 100, 2) as per_of_waits
from (select snap_id,
wait_time_milli,
delta_wait_count,
sum(delta_wait_count) over(partition by snap_id order by snap_id) as total_wait
from (select xx.snap_id,
xx.wait_time_milli,
xx.wait_count - yy.wait_count as delta_wait_count
from dba_hist_event_histogram xx,
dba_hist_event_histogram yy
where xx.event_name = 'log file parallel write'
and xx.event_id = yy.event_id
and xx.snap_id = yy.snap_id + 1
and xx.instance_number = yy.instance_number
and xx.wait_time_milli = yy.wait_time_milli
and xx.snap_id in (20540)))
order by snap_id, wait_time_milli) a , dba_hist_snapshot b
where a.snap_id=b.snap_id;
— 查等待事件的平均等待时间的变化趋势:
with t as (
select s.snap_id, s.instance_number, s.end_interval_time ,
total_waits - lag(total_waits, 1) over (partition by s.instance_number order by s.snap_id) waits,
(time_waited_micro - lag(time_waited_micro, 1) over (partition by s.instance_number order by s.snap_id))/1000 twt
from dba_hist_system_event ev, dba_hist_snapshot s
where ev.instance_number = s.instance_number and ev.snap_id = s.snap_id
and event_name = 'log file parallel write'
and s.end_interval_time BETWEEN TO_DATE('20140401 10:00:00', 'yyyymmdd hh24:mi:ss') AND TO_DATE('20140414 11:00:00', 'yyyymmdd hh24:mi:ss') )
select to_char(end_interval_time,’YYYYMMDD HH24:MI’), instance_number, sum(waits), sum(twt), round(sum(twt)/sum(waits), 2) wt
from t
where (to_char(end_interval_time, ‘hh24’) between 9 and 12 or to_char(end_interval_time, ‘hh24’) between 14 and 17)
group by to_char(end_interval_time,’YYYYMMDD HH24:MI’), instance_number
order by 1, instance_number;
— 查整库IO相关指标:
–get_stat_10g.sql
with s as (
select sn.instance_number, sn.snap_id, sn.end_interval_time stime, (s.value – lag(s.value, 1) over (partition by sn.instance_number order by sn.snap_id)) value
–(sn.end_interval_time – lag(sn.end_interval_time, 1) over (order by sn.snap_id))
24
60*60 elap_time,
–round((s.value – lag(s.value, 1) over (order by sn.snap_id))/((sn.end_interval_time – lag(sn.end_interval_time, 1) over (order by sn.snap_id))
24
60*60)) value_s
from dba_hist_snapshot sn, dba_hist_sysstat s
where s.snap_id = sn.snap_id and s.stat_name = ‘&1’ and s.instance_number = sn.instance_number
)
select instance_number , snap_id, stime, value ,
rank() over (partition by instance_number order by value desc) ranking, min(value) over (partition by instance_number) min_phy,
max(value) over (partition by instance_number) max_phy, round(avg(value) over (partition by instance_number)) avg_phy
from s
where (to_char(stime, ‘hh24’) between 9 and 12 or to_char(stime, ‘hh24’) between 14 and 17)
and to_char(stime, ‘D’) not in (‘1’, ‘7’)
and stime between to_date(&2, ‘yyyymmdd’) and to_date(&3, ‘yyyymmdd’)
order by instance_number, snap_id
/
—eg: @get_stat_10g.sql ‘parse count (hard)’ 20100716 20100726
—physical read total IO requests
—physical read total bytes
—physical write total IO requests
—physical write total bytes
—以上四个指标即IOPS(requests)、IO吞吐量(bytes)
— 查整库IO相关指标2:
select e.snap_id,
to_char(e.startup_time, 'yyyy-mm-dd hh24:mi:ss') instance_startup_time,
to_char(e.end_interval_time, 'yyyy-mm-dd hh24:mi:ss') snapshot_end_time,
b.value - a.value IO_requests,
round((d.value - c.value)/1024/1024/1024) IO_GB
from dba_hist_sysstat a,
dba_hist_sysstat b,
dba_hist_sysstat c,
dba_hist_sysstat d,
dba_hist_snapshot e
where a.stat_name = ‘physical read total IO requests’
and b.stat_name = ‘physical read total IO requests’
and c.stat_name = ‘physical read total bytes’
and d.stat_name = ‘physical read total bytes’
and a.snap_id = e.snap_id – 1
and b.snap_id = e.snap_id
and c.snap_id = e.snap_id – 1
and d.snap_id = e.snap_id
and e.end_interval_time BETWEEN
TO_DATE('20141021 09:00:00', 'yyyymmdd hh24:mi:ss') AND
TO_DATE('20141023 18:00:00', 'yyyymmdd hh24:mi:ss')
order by e.begin_interval_time;
— 查某个IO 进程的读写指标, 例如LGWR:
SQL> select * from dba_hist_iostat_function_name;
DBID FUNCTION_ID FUNCTION_NAME
———- ———– ——————————
1154030563 0 RMAN
1154030563 1 DBWR
1154030563 2 LGWR
1154030563 3 ARCH
1154030563 4 XDB
1154030563 5 Streams AQ
1154030563 6 Data Pump
1154030563 7 Recovery
1154030563 8 Buffer Cache Reads
1154030563 9 Direct Reads
1154030563 10 Direct Writes
1154030563 11 Smart Scan
1154030563 12 Archive Manager
1154030563 13 Others
14 rows selected
SQL>
— 查某个IO进程例如LGWR的IO requset和IO bytes dba_hist_iostat_function:
select e.snap_id,
to_char(e.end_interval_time, 'yyyy-mm-dd hh24:mi:ss') snapshot_end_time,
b.SMALL_READ_MEGABYTES - a.SMALL_READ_MEGABYTES SMALL_READ_MEGABYTES ,
b.SMALL_WRITE_MEGABYTES - a.SMALL_WRITE_MEGABYTES SMALL_WRITE_MEGABYTES ,
b.LARGE_READ_MEGABYTES - a.LARGE_READ_MEGABYTES LARGE_READ_MEGABYTES ,
b.LARGE_WRITE_MEGABYTES - a.LARGE_WRITE_MEGABYTES LARGE_WRITE_MEGABYTES ,
b.SMALL_READ_REQS - a.SMALL_READ_REQS SMALL_READ_REQS ,
b.SMALL_WRITE_REQS - a.SMALL_WRITE_REQS SMALL_WRITE_REQS ,
b.LARGE_READ_REQS - a.LARGE_READ_REQS LARGE_READ_REQS ,
b.LARGE_WRITE_REQS - a.LARGE_WRITE_REQS LARGE_WRITE_REQS ,
b.NUMBER_OF_WAITS - a.NUMBER_OF_WAITS NUMBER_OF_WAITS ,
b.WAIT_TIME - a.WAIT_TIME WAIT_TIME
from dba_hist_iostat_function a,
dba_hist_iostat_function b,
dba_hist_snapshot e
where a.FUNCTION_NAME = ‘LGWR’
and b.FUNCTION_NAME = ‘LGWR’
and a.snap_id = e.snap_id – 1
and b.snap_id = e.snap_id
and e.end_interval_time BETWEEN
TO_DATE('20141021 09:00:00', 'yyyymmdd hh24:mi:ss') AND
TO_DATE('20141023 18:00:00', 'yyyymmdd hh24:mi:ss')
order by e.begin_interval_time;