close

1.

restore datafiles and controlfile from backup tape. try to start up a instance and find these error

svrmgrl> starup nomount pfile=xxoo

PMON started with pid=2
LMON started with pid=3
LMD0 started with pid=4
LMD0: terminating instance due to error 1
Instance terminated by LMD0, pid = 790678

2. what is LMD?

LMS(GLOBAL CACHE SERVICE PROCESS)進程主要用來管理集群內數據塊的訪問,並在不同實例的BUFFER CACHE中傳輸塊鏡像。 LMS進程跨集群管理數據庫的請求,並保證在所有實例的BUFFER CACHE中一個數據塊的鏡像只能出現一次。 LMS進程靠著在實例中傳遞消息來協調數據塊的訪問,當一個實例請求數據塊時,該實例的LMD進程發出一個數據塊資源的請求,該請求只向MASTER數據塊的實例的LMD進程, MASTER實例的LMD進程同時正在使用的實例的LMD進程釋放該資源,這時擁有該資源的實例的LMS進程會創建一個數據塊鏡像的一致性讀,然後把該數據塊傳遞到請求該資源的實例的BUFFER CACHE中。 LMS進程保證了在每一時刻只能允許一個實例去更新數據塊,並負責保持該數據塊的鏡像紀錄(包含更新數據塊的狀態FLAG)。在故障發生時,由進程狀態關係圖,可以看到LMS進程處於latch free(latch: object queue header operation和latch: cache buffers chains)且狀態為dead狀態(LMS0除外)。也就意味著1,2號節點之間block傳輸出現了問題。由於PMON進程和LMS進程相互爭用latch(latch: object queue header operation),進而PMON獲取不了latch,出現超時等待10分鐘,也就無法恢復有問題的進程。 LMD (GLOBAL ENQUEUE SERVICE DAEMON)進程主要管理對全局隊列和資源的訪問,並更新相應隊列的狀態,處理來自於其他實例的資源請求。當2號節點LMS進程出現問題時,無法更新相應隊列狀態,由日誌可以看出2號節點處於空閒事件等待狀態(ges remote message)LMON(GLOBAL ENQUEUE SERVICE MONITOR)進程主要監測群集內的全局隊列和全局資源,管理實例和處理異常並相應的群集隊列進行恢復操作。當1號節點LMON進程檢測到2號節點出現異常時,發送killmember指令,由2號節點lmd進程完成abort操作。我們同時注意到故障發生時,ckpt,dbwr進程均處於dead狀態,為保證一致性處於pending i/o狀態,dbwr等待事件為gcs drm freeze in enter server mode,該等待事件在Oracle有如下解釋(metalink doc:4755405.8):
Under heavy load, DRM sync can take very long as LMON is waiting for LMS
processes to empty their send queues. This wait is unnecessary as
correctness is already guaranteed by the sync channel flush mechanism.
This shows up as sessions having large "gcs drm freeze in enter server mode" wait times. DRM takes a long time (more than 5 minutes).
也就是在在高負載到情況下,尤其是需要節點之間大規模傳遞一致性塊時,DRM sync功能LMON會等待LMS進程清空發送隊列,當然在LMS進程清空發送隊列時可能會引起latch free等待,進而引起GES問題,而Oracle GES(GES包括LMON和LMD進程)檢測到有問題時,將會abort實例(metalink:ID 9920699.8)。 Oracle還指出DRM功能可能會超過5分鐘。針對這一現象,Oracle定位為bug:6960699
解決辦法:
屏蔽DRM功能,設置如下隱含參數
_gc_affinity_time=0
_gc_undo_affinity=FALSE

3. i setup a oracle 8.1.7.4 db in single hosts (not cluster)
 it doesnot has gc_undo_affinity=FALSE (only for oracle 10G up) and i found some document that also mention about disk full or memory leak or somethiing bad luck....

4.i check the filesystems permission again, and i found some wrong setting, ex. /db1 has read permission and without wright permission. so i chmod -R 774 /db1 then the database startup nomount successful.

5.i got another problem.
SVRMGR> startup nomount pfile=/oracle817/dbs/db1.ora
ORACLE instance started.
Total System Global Area                        574689172 bytes
Fixed Size                                          73620 bytes
Variable Size                                   246665216 bytes
Database Buffers                                327680000 bytes
Redo Buffers                                       270336 bytes
SVRMGR> alter database mount standby database;
alter database mount standby database
*
ORA-29702: error occurred in Cluster Group Service operation
SVRMGR> exit

6.(only for oracle 10G)
ORA-29702:	error occurred in Cluster Group Service operation

Cause:	An unexpected error occurred while performing a CGS operation.

Action:	Verify that the LMON process is still active. Also, check the Oracle LMON trace files for errors. 


--------------------another solution----------------------
# cause - cloned a RAC ORACLE_HOME to a server with no CRS

# solution

cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk rac_off
make -f ins_rdbms.mk ioracle

7. the way to solve this problem in Oracle 8174 is
把oracle的安裝用戶,加到hagsuser群組裡就好。

8.結論
一般來說 裝好AIX系統,設好VG, LV,FS後,再灌ORACLE 最後再上HACMP就不會有這種問題
如果先設定HACMP再灌ORACLE就會變這樣。
arrow
arrow

    哈哈小熊 發表在 痞客邦 留言(0) 人氣()