Olá pessoal,
Espero que estejam bem!
Há algumas semanas, o time DBA de um cliente me informou que havia um cluster de 2 nós onde o segundo nó não estava iniciando com sucesso.
Comecei o troubleshooting tentando subir o cluster:
[root@DBLN02 ~]# crsctl start crs -wait
Vamos analisar o output do start do cluster:
CRS-4123: Starting Oracle High Availability Services-managed resources CRS-2672: Attempting to start 'ora.mdnsd' on 'dbln02' CRS-2672: Attempting to start 'ora.evmd' on 'dbln02' CRS-2676: Start of 'ora.evmd' on 'dbln02' succeeded CRS-2676: Start of 'ora.mdnsd' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.gpnpd' on 'dbln02' CRS-2676: Start of 'ora.gpnpd' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.gipcd' on 'dbln02' CRS-2676: Start of 'ora.gipcd' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.cssdmonitor' on 'dbln02' CRS-2672: Attempting to start 'ora.crf' on 'dbln02' CRS-2676: Start of 'ora.cssdmonitor' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.cssd' on 'dbln02' CRS-2672: Attempting to start 'ora.diskmon' on 'dbln02' CRS-2676: Start of 'ora.diskmon' on 'dbln02' succeeded CRS-2676: Start of 'ora.crf' on 'dbln02' succeeded CRS-2676: Start of 'ora.cssd' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'dbln02' CRS-2672: Attempting to start 'ora.ctssd' on 'dbln02' CRS-2676: Start of 'ora.ctssd' on 'dbln02' succeeded CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.asm' on 'dbln02' CRS-2676: Start of 'ora.asm' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.storage' on 'dbln02' CRS-2676: Start of 'ora.storage' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.crsd' on 'dbln02' CRS-2676: Start of 'ora.crsd' on 'dbln02' succeeded CRS-2883: Resource 'ora.crsd' failed during Clusterware stack start. CRS-4406: Oracle High Availability Services synchronous start failed. CRS-41053: checking Oracle Grid Infrastructure for file permission issues CRS-4000: Command Start failed, or completed with errors.
Como podemos ver, o GI reclamou de erro de permissão, isso geralmente acontece quando as permissões estão incorretas.
A primeira coisa que tentei fazer foi o unlock e lock do GRID_HOME, como root:
[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -unlock
[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -lock
Ao tentar subir o cluster, o erro permanecia o mesmo:
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
O primeiro passo foi checar como estava o alert.log do GI nesse nó:
2023-08-03 01:04:04.340 [OHASD(21431)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 21431 2023-08-03 01:04:04.386 [OHASD(21431)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0. 2023-08-03 01:04:04.400 [OHASD(21431)]CRS-2112: The OLR service started on node dbln02. 2023-08-03 01:04:04.638 [OHASD(21431)]CRS-8011: reboot advisory message from host: dbln02, component: cssmonit, with time stamp: L-2023-08-03-00:05:26.790 2023-08-03 01:04:04.638 [OHASD(21431)]CRS-1301: Oracle High Availability Service started on node dbln02. 2023-08-03 01:04:04.639 [OHASD(21431)]CRS-8013: reboot advisory message text: Rebooting node due to connection problems with CSS 2023-08-03 01:04:04.641 [OHASD(21431)]CRS-8011: reboot advisory message from host: dbln02, component: cssagent, with time stamp: L-2023-08-03-00:05:26.791 2023-08-03 01:04:04.642 [OHASD(21431)]CRS-8013: reboot advisory message text: Rebooting node due to connection problems with CSS 2023-08-03 01:04:04.642 [OHASD(21431)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred 2023-08-03 01:04:05.187 [CSSDMONITOR(21570)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 21570 2023-08-03 01:04:05.199 [CSSDAGENT(21564)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 21564 2023-08-03 01:04:05.199 [ORAROOTAGENT(21549)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 21549 2023-08-03 01:04:05.282 [ORAAGENT(21559)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 21559 2023-08-03 01:04:06.017 [ORAAGENT(21661)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 21661 2023-08-03 01:04:06.156 [MDNSD(21683)]CRS-8500: Oracle Clusterware MDNSD process is starting with operating system process ID 21683 2023-08-03 01:04:06.164 [EVMD(21685)]CRS-8500: Oracle Clusterware EVMD process is starting with operating system process ID 21685 2023-08-03 01:04:07.197 [GPNPD(21719)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 21719 2023-08-03 01:04:08.268 [GPNPD(21719)]CRS-2328: Grid Plug and Play Daemon(GPNPD) started on node dbln02. 2023-08-03 01:04:08.247 [GIPCD(21790)]CRS-8500: Oracle Clusterware GIPCD process is starting with operating system process ID 21790 2023-08-03 01:04:09.197 [GIPCD(21790)]CRS-7517: The Oracle Grid Interprocess Communication (GIPC) failed to identify the Fast Node Death Detection (FNDD). 2023-08-03 01:04:12.449 [OSYSMOND(21826)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 21826 2023-08-03 01:04:12.433 [CSSDMONITOR(21824)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 21824 2023-08-03 01:04:12.685 [CSSDAGENT(21861)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 21861 2023-08-03 01:04:12.935 [OCSSD(21882)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 21882 2023-08-03 01:04:13.999 [OCSSD(21882)]CRS-1713: CSSD daemon is started in hub mode 2023-08-03 01:04:15.327 [OCSSD(21882)]CRS-1707: Lease acquisition for node dbln02 number 2 completed 2023-08-03 01:04:16.464 [OCSSD(21882)]CRS-1621: The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc 2023-08-03 01:04:16.465 [OCSSD(21882)]CRS-1617: The information required to do node kill for node dbln02 is incomplete; details at (:CSSNM00004:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc 2023-08-03 01:04:16.470 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_003; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc. 2023-08-03 01:04:16.477 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_002; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc. 2023-08-03 01:04:16.484 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_001; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc. 2023-08-03 01:04:18.023 [OCSSD(21882)]CRS-1601: CSSD Reconfiguration complete. Active nodes are dbln01 dbln02 . 2023-08-03 01:04:19.947 [OCTSSD(22102)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 22102 2023-08-03 01:04:19.978 [OCSSD(21882)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation. 2023-08-03 01:04:20.924 [OCTSSD(22102)]CRS-2403: The Cluster Time Synchronization Service on host dbln02 is in observer mode. 2023-08-03 01:04:22.216 [OCTSSD(22102)]CRS-2407: The new Cluster Time Synchronization Service reference node is host dbln01. 2023-08-03 01:04:22.218 [OCTSSD(22102)]CRS-2401: The Cluster Time Synchronization Service started on host dbln02. 2023-08-03 01:04:29.836 [OLOGGERD(22190)]CRS-8500: Oracle Clusterware OLOGGERD process is starting with operating system process ID 22190 2023-08-03 01:04:29.885 [CRSD(22192)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22192 2023-08-03T01:04:32.764516-04:00 Errors in file /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc (incident=1505): CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] 2023-08-03 01:04:32.743 [CRSD(22192)]CRS-1019: The Oracle Cluster Registry (OCR) Service exited on host dbln02. Details at (:PROCR00016:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc. Incident details in: /oracle/app/grid/diag/crs/dbln02/crs/incident/incdir_1505/crsd_i1505.trc
Atenção às últimas mensagens:
2023-08-03 01:04:29.885 [CRSD(22192)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22192 2023-08-03T01:04:32.764516-04:00 Errors in file /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc (incident=1505): CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] 2023-08-03 01:04:32.743 [CRSD(22192)]CRS-1019: The Oracle Cluster Registry (OCR) Service exited on host dbln02. Details at (:PROCR00016:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc. Incident details in: /oracle/app/grid/diag/crs/dbln02/crs/incident/incdir_1505/crsd_i1505.trc
Não havia nenhum motivo claro que justificasse o erro no alert.log, vamos então analisar o crsd.trc (output truncado para facilitar a leitura):
2023-08-03 01:04:32.743 : OCRMAS:2992596736: th_hub_verify_master_pubdata: Shutdown CacheLocal. Patch Levels don't match. Local Patch Level [3360129916] != Cache Writer Patch Level [2696908285]
Agora temos algo mais claro:
Patch Levels don’t match. Local Patch Level [3360129916] != Cache Writer Patch Level [2696908285]
Vamos checar o patch level entre os dois nós do cluster.
No nó onde o cluster está funcionando normal:
[root@DBLN01 ~]$ /oracle/app/193/grid/bin/kfod op=patches
--------------- List of Patches =============== 29517242 29517247 29585399 33575402 35042068 35050325 35050331 35107512
[root@DBLN01 ~]$ /oracle/app/193/grid/bin/kfod op=patchlvl
------------------- Current Patch level =================== 2696908285
Vamos agora checar o segundo nó, onde o cluster não sobe:
[root@DBLN02 ~]# /oracle/app/193/grid/bin/kfod op=patches
--------------- List of Patches =============== 29517242 29517247 29585399 33575402 34419443 34428761 34444834 34580338
[root@DBLN02 ~]# /oracle/app/193/grid/bin/kfod op=patchlvl
------------------- Current Patch level =================== 3360129916
Confirmado!
Patch level Nó 1: 2696908285
Patch level Nó 2: 3360129916
Ao questionar o time DBA, eles falaram que foram aplicar o Release Update 19.19 no primeiro nó e não observaram que o segundo nó não estava funcionando (havia ocorrido um erro de storage anteriormente). Portanto, a própria ação de aplicar o RU 19.19 criou a diferença de patch levels entre os nós.
O primeiro passo é “alinhar” os patches entre os servidores. Primeiro, vamos fazer o unlock do cluster:
[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -unlock
Agora vamos aplicar todos os patches que fazem parte do GI Release Update 19.19, com o usuário grid:
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35042068
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35050331
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35050325
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35107512
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/33575402
Próximo passo é aplicar os patches no DB Home para o RU 19.19, com o usuário oracle:
[oracle@DBLN02 ~]$ opatch apply -silent -oh /oracle/DB1/193 /sapcd/19C/AprilDBPatches/35037840/35042068
[oracle@DBLN02 ~]$ opatch apply -silent -oh /oracle/DB1/193 /sapcd/19C/AprilDBPatches/35037840/35050331
Recriar as chaves do OLR:
[root@DBLN02 install]# clscfg -localpatch
Após isso, conseguimos subir o cluster normalmente no nó 2!
Espero que seja útil!
Um abraço,
Vinicius
Related posts
Sobre
Disclaimer
Minhas postagens refletem minhas próprias opiniões e não representam necessariamente as opiniões do meu empregador, a Accenture.