[sh] 等僵尸进程,导致系统非常缓慢(ORA-00445)

  • Post author:
  • Post category:其他


今天午休时间,接到一个请求。系统非常缓慢,且从top看,进程多为僵尸进程

Oracle版本是11.2.3.x

Linux5.6 x86-64


看到僵尸进程,第一首先怀疑 cron 里面的问题

通过crontab -l ,crontab -u oracle -l

发现没有任何人任务




类似这种问题,可以在crontab 中每行后面加个 > /dev/null 2>&1




参考http://blog.sina.com.cn/s/blog_6226824c01014ee8.html

之后通过pstree -ap

查看到 这些僵尸进程是由psp0,cjq0,mmon,smco这些进程引起的

在metalink里 搜索 defunct processes, 以及对比alert里面的信息。定位到了一个文档




转到底部

转到底部






2014-2-9






PROBLEM


为此文档评级


通过电子邮件发送此文档的链接


在新窗口中打开文档


可打印页


In this Document


Symptoms

Changes

Cause

Solution

APPLIES TO:


Oracle Database – Enterprise Edition – Version 11.2.0.1 to 12.1.0.1 [Release 11.2 to 12.1]



CRM On Demand – Version N/A to N/A



IBM: Linux on System z



Linux x86-64



Linux x86



SYMPTOMS

Errors are seen in the alert log relating to spawning of processes such as:

@ Checked for relevance on 17th Jan 2012

ORA-00445: background process “m001” did not start after 120 seconds

Incident details in: /opt/u01/app/oracle/diag/rdbms/incident/incdir_3721/db1_mmon_7417_i3721.trc

ERROR: Unable to normalize symbol name for the following short stack (at offset 2):

Tue Jun 21 03:03:06 2011

ORA-00445: background process “J003” did not start after 120 seconds


or

Waited for process W002 to initialize for 60 seconds

The system appears to be running very slowly and defunct processes can appear.



CHANGES

REDHAT 5 kernel 2.6.18-194.el5 #1 SMP Tue Mar 16

Oracle 11.2.0.2 Single Instance

IBM: Linux on System z



CAUSE

Recent linux kernels have a feature called Address Space Layout Randomization (ASLR).

ASLR  is a feature that is activated by default on some of the newer linux distributions.

It is designed to load shared memory objects in random addresses.

In Oracle, multiple processes map a shared memory object at the same address across the processes.

With ASLR turned on Oracle cannot guarantee the availability of this shared memory address.

This conflict in the address space means that a process trying to attach a shared memory object to a specific address may not be able to do so, resulting in a failure in shmat subroutine.

However, on subsequent retry (using a new process) the shared memory attachment may work.

The result is a “random” set of failures in the alert log.



SOLUTION

It should be noted that this problem has only been positively diagnosed in Redhat 5 and Oracle 11.2.0.2.

It is also likely, as per unpublished BUG:8527473,  that this issue will reproduce running on Generic Linux platforms running  any Oracle 11.2.0.x. or 12.1.0.x  on Redhat/OEL kernels which have ASLR.

This issue has been seen in both Single Instance and RAC environments.

ASLR also exists in SLES10 and SLES 11 kernels and by default ASLR is turned on.  To date no problem has been seen on SuSE servers running Oracle  but Novell confirm ASLR may cause problems.  Please refer to


http://www.novell.com/support/kb/doc.php?id=7004855

mmap occasionally infringes on stack

You can verify whether ASLR is being used as follows:

# /sbin/sysctl -a | grep randomize

kernel.randomize_va_space = 1

If the parameter is set to any value other than 0 then ASLR is in use.

On Redhat 5 to permanently disable ASLR.


add/modify this parameter in /etc/sysctl.conf



kernel.randomize_va_space=0

kernel.exec-shield=0


You need to reboot for kernel.exec-shield parameter to take effect.

Note that both kernel parameters are required for ASLR to be switched off.

There may be other reasons for a process failing to start, however, by switching ASLR off, you can quickly discount ASLR being the problem. More and more issues are being identified when ASLR is in operation.






通过设置两个内核参数,关闭了一个叫ASLR的linux新特性,之后对服务器进行了重启,现在系统恢复了正常。有待于继续观察



版权声明:本文为Hank_dai原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。