生产分布式事务的处理

  • Post author:
  • Post category:其他


刚刚在生产的主数据库报了错误

DISTRIB TRAN 41544f4d.3137322E31362E33302E3233332E746D32333333393030343535
  is local tran 4645.21.804978 (hex=1225.15.c4872)
  insert pending prepared tran, scn=6069352057191 (hex=585.2192c967)
Thu Mar 05 14:21:02 2015
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist
Errors in file /u02/app/oracle/diag/rdbms/billdb/billdb/trace/billdb_reco_4653788.trc:
ORA-24756: transaction does not exist

怎么处理分布式事务的问题

分布式事务,简单来说,是指一个事务在本地和远程执行,本地需要等待确认远程的事务结束后,进行下一步本地的操作。如通过dblink update远程数据库的一行记录,如果在执行过程中网络异常,或者其他事件导致本地数据库无法得知远程数据库的执行情况,此时就会发生in doublt的报错。此时需要dba介入,且需要分多种情况进行处理。
分布式事务在commit提交时候,会经历3个阶段:
1.PREPARE PHASE:
1.1 决定哪个数据库为commit point site。(注,参数文件中commit_point_strength值高的那个数据库为commit point site)
1.2 全局协调者(Global Coordinator)要求所有的点(除commit point site外)做好commit或者rollback的准备。此时,对分布式事务的表加锁。
1.3 所有分布式事务的节点将它的scn告知全局协调者。
1.4 全局协调者取各个点的最大的scn作为分布式事务的scn。(eygle在这篇文章中也测试过)
至此,所有的点都完成了准备工作,我们开始进入COMMIT PHASE阶段,此时除commit point site点外所有点的事务均为in doubt状态,直到COMMIT PHASE阶段结束。
如果数据库在此阶段出现问题,我们查询(假设远程数据库为commit point site,且本地数据库为Global Coordinator):
本地:
select local_tran_id,state from dba_2pc_pending;
LOCAL_TRAN_ID          STATE
---------------------- ----------------
2.12.64845              collecting
远程:
select local_tran_id,state from dba_2pc_pending;
no rows selected
即表示本地数据库要求其他点做好commit或者rollback准备,现在正在“收集”其他点的数据库的返回信息,但是远程数据库未知状态(in doubt)。我们需要将本地的Global Coordinator的状态清除掉:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
或者我们在查询的时候发现是如下的状态:
本地:
select local_tran_id,state from dba_2pc_pending;
LOCAL_TRAN_ID          STATE
---------------------- ----------------
2.12.64845             prepared
远程:
select local_tran_id,state from dba_2pc_pending;
no rows selected
即表示本地Global Coordinator已经做好准备,已经将分布式锁放到各个事务的表上,但是远程数据库的状态再次未知(in doubt),我们需要手工的将本地的transaction rollback掉,并且清除分布式事务信息:
本地:
rollback force 'local_tran_id';
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
2.COMMIT PHASE:
2.1 Global Coordinator将最大scn传到commit point site,要求其commit。
2.2 commit point尝试commit或者rollback。分布式事务锁释放。
2.3 commit point通知Global Coordinator已经commit。
2.4 Global Coordinator通知分布式事务的所有点进行commit。
如果数据库在此阶段出现问题,我们查询
本地:
select local_tran_id,state from dba_2pc_pending;
LOCAL_TRAN_ID          STATE
---------------------- ----------------
2.12.64845             prepared
远程:
select local_tran_id,state from dba_2pc_pending;
LOCAL_TRAN_ID          STATE
---------------------- ----------------
1.92.66874             commited
即远程数据库可能已经commit,但是本地Global Coordinator未知远程数据库的状态,还是处于prepare的状态。我们需要在如下处理:
本地:
commit force 'local_tran_id';
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
远程:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
或者我们在查询的时候发现是如下的状态:
本地:
select local_tran_id,state from dba_2pc_pending;
LOCAL_TRAN_ID          STATE
---------------------- ----------------
2.12.64845            commited
远程:
select local_tran_id,state from dba_2pc_pending;
LOCAL_TRAN_ID          STATE
---------------------- ----------------
1.92.66874             commited
即远程数据库和本地数据库均已经完成commit,但是分布式事务的信息尚未清除,我们需要在本地和远程运行:
本地:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
远程:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
3.FORGET PHASE:
3.1 参与的点通知commit point site他们已经完成commit,commit point site就能忘记(forget)这个事务。
3.2 commit point site在远程数据库上清除分布式事务信息。
3.3 commit point site通知Global Coordinator可以清除本地的分布式事务信息。
3.4 Global Coordinator清除分布式事务信息。
此时如果出现问题,我们查询:
本地:
select local_tran_id,state from dba_2pc_pending;
LOCAL_TRAN_ID          STATE
---------------------- ----------------
2.12.64845            commited
远程:
select local_tran_id,state from dba_2pc_pending;
no rows selected
即远程commit point site已经完成commit,通知Global Coordinator清除本地的分布式事务信息,但是Global Coordinator没有收到该信息。我们需要这样处理:
本地:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
综上,分布式事务的依次状态为:
综上,分布式事务的依次状态为:
phase       local_state    remote_state       action
----------- ---------- ------------------  -------------------
prepare     collecting       /              本地DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY
            prepared         /              本地rollback force后PURGE_LOST_DB_ENTRY
 
commit      prepared        commited        本地commit force后本地和远程均PURGE
            commited        commited        本地和远程均PURGE_LOST_DB_ENTRY
            
forget      commited         /              本地PURGE_LOST_DB_ENTRY
另,当我们遇到使用rollback/commit force的时候,无法正常的清除分布式事务的信息,会报错ORA-02058: no prepared transaction found with ID X.XX.XXXXX时,我们需要通过手工方式来清除该信息。(注,以下方式修改数据字典,存在风险,使用前请备份好你的数据库)
情况1,在dba_2pc表中还有事务记录,但是实际已经不存在该事务了:
select local_tran_id, state from dba_2pc_pending;
 
       LOCAL_TRAN_ID          STATE
       ---------------------- ----------------
       1.92.66874             prepared
 
(注:'1.92.66874' 的结构为rbs#, slot#, wrap#,此事务在rollback segment #1)
我们再用如下语句找出使用rollback segment #1且状态是active的transaction:
  SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */
         KTUXESTA Status,
         KTUXECFL Flags
  FROM x$ktuxe
  WHERE ktuxesta!='INACTIVE'
        AND ktuxeusn= 1; <== 这是rollback segment#,即rbs#
 
  no rows selected
因此我们在rollback force的时候会报错:
ORA-02058: no prepared transaction found with ID 1.92.66874
我们需要如下处理:
set transaction use rollback segment SYSTEM;
 
delete from sys.pending_trans$
  where local_tran_id = '1.92.66874';
 
delete from sys.pending_sessions$ where local_tran_id = '1.92.66874';
 
delete from sys.pending_sub_sessions$ where local_tran_id = '1.92.66874';
 
commit;
情况2,这种情况比较少见,在dba_2pc表中无法查到分布式事务信息,但是实际上却是存在该分布式事务的:
我们在alertlog中可以看到:
ORA-1591: lock held by in-doubt distributed transaction 1.92.66874
我们查询dba_2pc的表,发现没有分布式事务信息:
select local_tran_id, state from dba_2pc_pending 
where local_tran_id='1.92.66874';
 
no rows selected
但是去查实际的rollback segment信息,却发现有prepared状态的分布式事务存在:
SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */
       KTUXESTA Status,
       KTUXECFL Flags
FROM x$ktuxe
WHERE ktuxesta!='INACTIVE'
      AND ktuxeusn= 1;
 
  KTUXEUSN   KTUXESLT   KTUXESQN STATUS           FLAGS
---------- ---------- ---------- ---------------- ------------------------
         1         92      66874 PREPARED         SCO|COL|REV|DEAD
我们无法做commit force或者rollback force:
rollback force '1.92.66874';
 
ORA-02058: no prepared transaction found with ID 1.92.66874
我们用如下的方式手工清理:
alter system disable distributed recovery;
 
    insert into pending_trans$ (
        LOCAL_TRAN_ID,
        GLOBAL_TRAN_FMT,
        GLOBAL_ORACLE_ID,
        STATE,
        STATUS,
        SESSION_VECTOR,
        RECO_VECTOR,
        TYPE#,
        FAIL_TIME,
        RECO_TIME)
    values( '1.92.66874', /* <== 此处为你的local tran id */
        306206,                  /*                                         */
        'XXXXXXX.12345.1.2.3',   /*  这些值不必更改,   */
        'prepared','P',          /*  是静态参数,可以直接   */
        hextoraw( '00000001' ),  /*  在这个sql中使用                             */
        hextoraw( '00000000' ),  /*                                         */
        0, sysdate, sysdate );
 
    insert into pending_sessions$ 
    values( '1.92.66874',/* <==此处为你的local tran id  */
        1, hextoraw('05004F003A1500000104'), 
        'C', 0, 30258592, '', 
        146
      );
 
    commit;
 
    commit force '1.92.66874';
    
    此时如果commit force还是出现报错,需要继续执行:
 
      delete from pending_trans$ where local_tran_id='1.92.66874';
      delete from pending_sessions$ where local_tran_id='1.92.66874';
      commit;
      alter system enable distributed recovery;
 
    此时如果没有报错,则执行以下语句:
 
      alter system enable distributed recovery;
 
      and purge the dummy entry from the dictionary, using
 
      connect / as sysdba
 
      alter session set "_smu_debug_mode" = 4; 
      (注:如果使用auto的undo管理方式,需要执行此步骤,此步骤能避免在后续执行purge_lost_db_entry出现ORA-01453 的报错,详细信息可见Bug 2191458)
 
      commit;  
 
      exec dbms_transaction.purge_lost_db_entry( '1.92.66874' )
 
 
Symptoms
While trying to commit or rollback a pending transaction getting error ORA-2058... 
Subsequently when trying to purge the pending transactions using the
procedure "dbms_transaction.purge_lost_db_entry" gives the following errors..

SQL> EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386'); 

BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386'); END;
*
ERROR at line 1:
ORA-30019: Illegal rollback Segment operation in Automatic Undo mode
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65

SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108');
BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108'); END;
*
ERROR at line 1:
ORA-06510: PL/SQL: unhandled user-defined exception
ORA-06512: at "SYS.DBMS_TRANSACTION", line 94


Cause
If the remote database no longer exists then the transaction will have to be 
purged from the list of pending distributed transactions.
The transaction to be deleted is in the Prepared State.
Solution
If the command causes an ORA-2058 error to occur, it means that the remote 
database cannot be accessed. In this case, check whether the database link to 
the remote database exists and whether the remote database is shutdown. 

If the remote database no longer exists then the transaction will have to be 
purged from the list of pending distributed transactions.

Follow the instructions on how to purge a 
distributed transaction from the database.
=================================
If the remote database cannot be accessed, a failed distributed transaction 
cannot be committed or rolled back and so must be purged from the list of 
pending transactions.

1. Identify the id of the transaction:

SQL> SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID FROM DBA_2PC_PENDING;

2. Purge the transaction:

SQL> EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('');
SQL> COMMIT;

3. Confirm that the transaction has been purged:

SQL> SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID FROM DBA_2PC_PENDING;


Step 2:
=====
If you get the following errors while purging transactions using "dbms_transaction.purge_lost_db_entry" 


SQL> EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386'); 

BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386'); END;
*
ERROR at line 1:
ORA-30019: Illegal rollback Segment operation in Automatic Undo mode
ORA-06512: at "SYS.DBMS_TRANSACTION", line 65
ORA-06512: at "SYS.DBMS_TRANSACTION", line 85


Fix:
===
This problem is logged as 
Bug.2191458 (unpublished) UNABLE TO EXEC DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY WITH AUTO UNDO MANAGEMENT 
and is worked by development.

Use the following Workaround:

You have to use local_tran_id.....

Issue commit before alter system set "_smu_debug_mode" = 4;

Follow the steps,

SQL> commit;
SQL> alter session set "_smu_debug_mode" = 4;
SQL> commit;
SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
SQL> commit;

Step 3:
=====

When executing the following procedure(dbms_transaction.purge_lost_db_entry)
to delete entries from
dba_2pc_pending one encounters the following error:

SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108'); ==>For example..
BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108'); END;
*
ERROR at line 1:
ORA-06510: PL/SQL: unhandled user-defined exception
ORA-06512: at "SYS.DBMS_TRANSACTION", line 94


Fix:
===

The transaction to be deleted is in the prepared state and has to be either
force committed or force rollback

SQL> select LOCAL_TRAN_ID,STATE from dba_2pc_pending;

LOCAL_TRAN_ID STATE
---------------------- ----------------
37.16.108 prepared

SQL> rollback force '37.16.108'; ==>For example

Rollback complete.

SQL> select LOCAL_TRAN_ID,STATE from dba_2pc_pending;

LOCAL_TRAN_ID STATE
---------------------- ----------------
37.16.108 forced rollback

SQL> COMMIT;

SQL>alter system set "_smu_debug_mode" = 4;
Rollback complete.

SQL> exec dbms_transaction.purge_lost_db_entry('37.16.108'); ==>For example
SQL> COMMIT;

wo phase commit和分布式事务,two phase commit是为保证分布式事务的数据一致性而设计的,在Oracle事务中对两个以上的Oracle站点执行了操作就形成了分布式事务,一般情况下,Oracle会自动处理分布式事务,为了保证分布式事务的一致性,所有站点的修改操作要么全部提交,要么全部回滚,否则就会造成数据不一致,在分布式事务中,不同站点的角色不同,有global coordinator和commit point site,commit point site会首先执行提交,本地数据库作为global coordinator,Oracle初始化参数commit_point_strength决定了哪个站点成为commit point site,commit_point_strength值最高的站点会成为commit point site,two phase commit有如下3个阶段:

1.准备阶段(PREPARE PHASE),在这个阶段会处理如下这些事情:

·本地数据库Global Coordinator向其它数据库发出COMMIT通知

·比较所有数据库的SCN号,将最高的SCN号作为分布式事务的全局SCN号

·所有数据库写在线日志

·对分布式事务修改的表加分布锁,防止被读写

·各数据库向Global Coordinator发出已经准备好的通知

所有参与分布式事务的数据库必须经过上述准备,才能进入下一阶段

2.提交阶段(COMMIT PHASE),在这个阶段会处理如下这些事情:

·本地数据库Global Coordinator通知commit point site首先提交,commit point site提交后,释放其占有的资源,通知Global Coordinator完成提交

·本地数据库Global Coordinator通知其它数据库提交

·提交节点在日志中追加一条信息,表示分布式事务已经完成提交,并通知Global Coordinator,此时所有数据库的数据保持了一致性

3.注销阶段(FORGET PHASE)

·本地数据库Global Coordinator通知commit point site所有数据库已经完成提交

·commit point site清除分布式事务的记录和状态信息,并通知Global Coordinator

·Global Coordinator清除本地分布式事务的记录和状态信息

此时分布式事务的two phase commit全部完成

以上3个阶段,在任何一个阶段的前、中、后发生失败都会导致two phase commit失败,一般来说,Oracle后台进程REC0会自动恢复事务,只有在分布式事务锁住的对象急需被访问,锁住的回滚段阻止了其他事务的使用,网络故障或CRASH的数据库的恢复需要很长的时间等情况出现时,才需要使用人工操作的方式来维护分布式事务,而分布式事务失败的情况也比较复杂,需要针对不同的情形采取相应的措施,Oracle中提供了视图dba_2pc_pending和dba_2pc_neighbors供用户查看分布式事务的相关信息,通常two phase commit失败有如下5种情况:

1.prepare阶段没准备好就失败了,global coordinator正在等待各个站点返回已准备好的通知,处于collecting状态,dba_2pc_pending试图中的state列的值是collecting,各个站点什么都没发生,无需执行任何操作,在本地数据库执行exec dbms_transaction.purge_lost_db_entry(’transaction_id’)清除in_doubt状态的分布式事务记录就可以了

2.prepare阶段完成时发生失败,global coordinator处于prepared状态,已经加上分布锁,等待提交,远程commit point site已经自动回滚,此时需要在global coordinator上执行rollback force ‘transaction_id’,然后清除in_doubt状态的分布式事务记录

3.commit point site执行commit完成后其他站点尚未commit时发生失败,这时需要在其他站点执行commit force ‘transaction_id’,’commit#’,注意后面的commit#使用较高的commit#,可以查询各个站点的dba_2pc_pending试图的commit#列得到,然后清除in_doubt状态的分布式事务记录

4.所有站点commit成功,进入forget阶段时发生失败,数据已经一致,只需在各站点清除in_doubt状态的分布式事务记录就可以了

5.commit point site完成forget阶段,其他站点没完成forget,只需在其他站点清除in_doubt状态的分布式事务记录就可以了



版权声明:本文为u012422541原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。