oracle IO性能分析

  • Post author:
  • Post category:其他


我们在IO问题发生时间段:

1,可以查dba_hist_system_event 来看该IO相关等待时间的平均响应时间的变化趋势,

2,直接查dba_hist_event_histogram或者sys.WRH$_EVENT_HISTOGRAM , 来看该IO相关等待事件的等待时间直方图分布。

3, 我们也可以从整体数据库的IO变化量上(dba_hist_sysstat), 看出问题时段的总体IO量, 比正常时候的IO量到底是否有增加。

4, 我们也可以从某个IO function 即某个进程上面(dba_hist_iostat_function),看问题时间段某个进程的IO request次数、IO 的量。

比如说LGWR进程的写request次数、 写的量。

说明:

1, 如上视图均可关联dba_hist_snapshot

2, 所有dba_hist 相关视图都是由对应的v$ 动态性能视图汇聚而来, 当时的数据可以查对应的v$视图。

福利: 后附sql。

例2:

直方图分布可能更加精确, 因为查看平均响应时间可能会因为粒度太粗、时间跨度长而被平均了。

正常情况下,应该<8ms的占据了90%甚至更多, > 32ms 的应该占比接近为0。

可以看到, 昨天smscp异常时候,在16 – 32 ms 之间的已经占比很多了。

   END_INTERVAL_TIME    SNAP_ID    TIME_HISTG    TOTAL_WAIT    PER_OF_WAITS

1 22-10月-14 03.00.49.801 下午 20536 <1ms 89464 0.71

2 22-10月-14 03.00.49.801 下午 20536 <2ms 89464 11.32

3 22-10月-14 03.00.49.801 下午 20536 <4ms 89464 26.39

4 22-10月-14 03.00.49.801 下午 20536 <8ms 89464 18.28

5 22-10月-14 03.00.49.801 下午 20536 <16ms 89464 13.16

6 22-10月-14 03.00.49.801 下午 20536 <32ms 89464 8.21

7 22-10月-14 03.00.49.801 下午 20536 <64ms 89464 7.56

8 22-10月-14 03.00.49.801 下午 20536 <128ms 89464 6.06

9 22-10月-14 03.00.49.801 下午 20536 <256ms 89464 4.72

10 22-10月-14 03.00.49.801 下午 20536 <512ms 89464 2.72

11 22-10月-14 03.00.49.801 下午 20536 <1s 89464 0.83

12 22-10月-14 03.00.49.801 下午 20536 <2s 89464 0.03

13 22-10月-14 03.00.49.801 下午 20536 <4s 89464 0

14 22-10月-14 03.00.49.801 下午 20536 >=4s 89464 0

正常情况下是这样的,8ms以上的占比很少:

   END_INTERVAL_TIME    SNAP_ID    TIME_HISTG    TOTAL_WAIT    PER_OF_WAITS

1 22-10月-14 07.00.49.701 下午 20540 <1ms 1091640 38.28

2 22-10月-14 07.00.49.701 下午 20540 <2ms 1091640 51.09

3 22-10月-14 07.00.49.701 下午 20540 <4ms 1091640 9.03

4 22-10月-14 07.00.49.701 下午 20540 <8ms 1091640 1.37

5 22-10月-14 07.00.49.701 下午 20540 <16ms 1091640 0.17

6 22-10月-14 07.00.49.701 下午 20540 <32ms 1091640 0.04

7 22-10月-14 07.00.49.701 下午 20540 <64ms 1091640 0.01

8 22-10月-14 07.00.49.701 下午 20540 <128ms 1091640 0

9 22-10月-14 07.00.49.701 下午 20540 <256ms 1091640 0

10 22-10月-14 07.00.49.701 下午 20540 <512ms 1091640 0

11 22-10月-14 07.00.49.701 下午 20540 <1s 1091640 0

12 22-10月-14 07.00.49.701 下午 20540 <2s 1091640 0

13 22-10月-14 07.00.49.701 下午 20540 <4s 1091640 0

14 22-10月-14 07.00.49.701 下午 20540 >=4s 1091640 0

例3:

可以看到昨天下午的整库的physical read total IO requests 并没有明显增加, 但是physical read total bytes 增加到了3000GB 一个小时。

其它时候也有这么高的,但是都是在晚上, 只有昨天是在下午15点这个生产时间的snapshot,开发有大量direct path read 所以对应用产生了较大影响。

例4:可以看到问题发生时段,10.22 14点-15点, LGWR的large_write_megabytes 和requests有明显增加, 而 平时有如此高峰的是有晚上和凌晨。

Smll read 的bytes和request 都没有增加。

附sql:

— 查等待事件的等待时间直方图分布:

select b.end_interval_time,a.*

from

(select snap_id,

   decode(wait_time_milli,
          1,
          '<1ms',
          2,
          '<2ms',
          4,
          '<4ms',
          8,
         '<8ms',
         16,
         '<16ms',
         32,
         '<32ms',
         64,
         '<64ms',
         128,
         '<128ms',
         256,
         '<256ms',
         512,
         '<512ms',
         1024,
         '<1s',
         2048,
         '<2s',
         4096,
         '<4s',
         8192,
         '>=4s',
         wait_time_milli) time_histg,
  total_wait,
  round(delta_wait_count / total_wait * 100, 2) as per_of_waits

from (select snap_id,

          wait_time_milli,
          delta_wait_count,
          sum(delta_wait_count) over(partition by snap_id order by snap_id) as total_wait
     from (select xx.snap_id,
                  xx.wait_time_milli,
                  xx.wait_count - yy.wait_count as delta_wait_count
             from dba_hist_event_histogram xx,
                  dba_hist_event_histogram yy
            where xx.event_name = 'log file parallel write'
              and xx.event_id = yy.event_id
              and xx.snap_id = yy.snap_id + 1
              and xx.instance_number = yy.instance_number
              and xx.wait_time_milli = yy.wait_time_milli
              and xx.snap_id in (20540)))

order by snap_id, wait_time_milli) a , dba_hist_snapshot b

where a.snap_id=b.snap_id;

— 查等待事件的平均等待时间的变化趋势:

with t as (

select s.snap_id, s.instance_number, s.end_interval_time ,

     total_waits - lag(total_waits, 1) over (partition by s.instance_number order by s.snap_id) waits,
     (time_waited_micro - lag(time_waited_micro, 1) over (partition by s.instance_number order by s.snap_id))/1000 twt
from dba_hist_system_event ev, dba_hist_snapshot s

where ev.instance_number = s.instance_number and ev.snap_id = s.snap_id

 and event_name = 'log file parallel write'
 and s.end_interval_time BETWEEN TO_DATE('20140401 10:00:00', 'yyyymmdd hh24:mi:ss') AND  TO_DATE('20140414 11:00:00', 'yyyymmdd hh24:mi:ss') )

select to_char(end_interval_time,’YYYYMMDD HH24:MI’), instance_number, sum(waits), sum(twt), round(sum(twt)/sum(waits), 2) wt

from t

where (to_char(end_interval_time, ‘hh24’) between 9 and 12 or to_char(end_interval_time, ‘hh24’) between 14 and 17)

group by to_char(end_interval_time,’YYYYMMDD HH24:MI’), instance_number

order by 1, instance_number;

— 查整库IO相关指标:

–get_stat_10g.sql

with s as (

select sn.instance_number, sn.snap_id, sn.end_interval_time stime, (s.value – lag(s.value, 1) over (partition by sn.instance_number order by sn.snap_id)) value

–(sn.end_interval_time – lag(sn.end_interval_time, 1) over (order by sn.snap_id))

24

60*60 elap_time,

–round((s.value – lag(s.value, 1) over (order by sn.snap_id))/((sn.end_interval_time – lag(sn.end_interval_time, 1) over (order by sn.snap_id))

24

60*60)) value_s

from dba_hist_snapshot sn, dba_hist_sysstat s

where s.snap_id = sn.snap_id and s.stat_name = ‘&1’ and s.instance_number = sn.instance_number

)

select instance_number , snap_id, stime, value ,

rank() over (partition by instance_number order by value desc) ranking, min(value) over (partition by instance_number) min_phy,

max(value) over (partition by instance_number) max_phy, round(avg(value) over (partition by instance_number)) avg_phy

from s

where (to_char(stime, ‘hh24’) between 9 and 12 or to_char(stime, ‘hh24’) between 14 and 17)

and to_char(stime, ‘D’) not in (‘1’, ‘7’)

and stime between to_date(&2, ‘yyyymmdd’) and to_date(&3, ‘yyyymmdd’)

order by instance_number, snap_id

/

—eg: @get_stat_10g.sql ‘parse count (hard)’ 20100716 20100726

—physical read total IO requests

—physical read total bytes

—physical write total IO requests

—physical write total bytes

—以上四个指标即IOPS(requests)、IO吞吐量(bytes)

— 查整库IO相关指标2:

select e.snap_id,

   to_char(e.startup_time, 'yyyy-mm-dd hh24:mi:ss') instance_startup_time,
   to_char(e.end_interval_time, 'yyyy-mm-dd hh24:mi:ss') snapshot_end_time,
   b.value - a.value IO_requests,
   round((d.value - c.value)/1024/1024/1024) IO_GB

from dba_hist_sysstat a,

   dba_hist_sysstat  b,
   dba_hist_sysstat  c,
   dba_hist_sysstat  d,
   dba_hist_snapshot e

where a.stat_name = ‘physical read total IO requests’

and b.stat_name = ‘physical read total IO requests’

and c.stat_name = ‘physical read total bytes’

and d.stat_name = ‘physical read total bytes’

and a.snap_id = e.snap_id – 1

and b.snap_id = e.snap_id

and c.snap_id = e.snap_id – 1

and d.snap_id = e.snap_id

and e.end_interval_time BETWEEN

   TO_DATE('20141021 09:00:00', 'yyyymmdd hh24:mi:ss') AND
   TO_DATE('20141023 18:00:00', 'yyyymmdd hh24:mi:ss')

order by e.begin_interval_time;

— 查某个IO 进程的读写指标, 例如LGWR:

SQL> select * from dba_hist_iostat_function_name;

  DBID FUNCTION_ID FUNCTION_NAME

———- ———– ——————————

1154030563 0 RMAN

1154030563 1 DBWR

1154030563 2 LGWR

1154030563 3 ARCH

1154030563 4 XDB

1154030563 5 Streams AQ

1154030563 6 Data Pump

1154030563 7 Recovery

1154030563 8 Buffer Cache Reads

1154030563 9 Direct Reads

1154030563 10 Direct Writes

1154030563 11 Smart Scan

1154030563 12 Archive Manager

1154030563 13 Others

14 rows selected

SQL>

— 查某个IO进程例如LGWR的IO requset和IO bytes dba_hist_iostat_function:

select e.snap_id,

   to_char(e.end_interval_time, 'yyyy-mm-dd hh24:mi:ss') snapshot_end_time,
   b.SMALL_READ_MEGABYTES  - a.SMALL_READ_MEGABYTES     SMALL_READ_MEGABYTES   ,                                                                                                
   b.SMALL_WRITE_MEGABYTES - a.SMALL_WRITE_MEGABYTES    SMALL_WRITE_MEGABYTES  ,                                                                                              
   b.LARGE_READ_MEGABYTES  - a.LARGE_READ_MEGABYTES     LARGE_READ_MEGABYTES   ,                                                                                             
   b.LARGE_WRITE_MEGABYTES - a.LARGE_WRITE_MEGABYTES    LARGE_WRITE_MEGABYTES  ,                                                                                            
   b.SMALL_READ_REQS       - a.SMALL_READ_REQS          SMALL_READ_REQS        ,                                                                                           
   b.SMALL_WRITE_REQS      - a.SMALL_WRITE_REQS         SMALL_WRITE_REQS       ,                                                                                          
   b.LARGE_READ_REQS       - a.LARGE_READ_REQS          LARGE_READ_REQS        ,                                                                                         
   b.LARGE_WRITE_REQS      - a.LARGE_WRITE_REQS         LARGE_WRITE_REQS       ,                                                                                        
   b.NUMBER_OF_WAITS       - a.NUMBER_OF_WAITS          NUMBER_OF_WAITS        ,                                                                                       
   b.WAIT_TIME             - a.WAIT_TIME                WAIT_TIME                                                                        

from dba_hist_iostat_function a,

   dba_hist_iostat_function  b,
   dba_hist_snapshot e

where a.FUNCTION_NAME = ‘LGWR’

and b.FUNCTION_NAME = ‘LGWR’

and a.snap_id = e.snap_id – 1

and b.snap_id = e.snap_id

and e.end_interval_time BETWEEN

   TO_DATE('20141021 09:00:00', 'yyyymmdd hh24:mi:ss') AND
   TO_DATE('20141023 18:00:00', 'yyyymmdd hh24:mi:ss')

order by e.begin_interval_time;