Olá pessoal,

Espero que estejam bem!

Há algumas semanas, o time DBA de um cliente me informou que havia um cluster de 2 nós onde o segundo nó não estava iniciando com sucesso.

Comecei o troubleshooting tentando subir o cluster:

[root@DBLN02 ~]# crsctl start crs -wait

Vamos analisar o output do start do cluster:

CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.mdnsd' on 'dbln02'
CRS-2672: Attempting to start 'ora.evmd' on 'dbln02'
CRS-2676: Start of 'ora.evmd' on 'dbln02' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'dbln02'
CRS-2676: Start of 'ora.gpnpd' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'dbln02'
CRS-2676: Start of 'ora.gipcd' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'dbln02'
CRS-2672: Attempting to start 'ora.crf' on 'dbln02'
CRS-2676: Start of 'ora.cssdmonitor' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'dbln02'
CRS-2672: Attempting to start 'ora.diskmon' on 'dbln02'
CRS-2676: Start of 'ora.diskmon' on 'dbln02' succeeded
CRS-2676: Start of 'ora.crf' on 'dbln02' succeeded
CRS-2676: Start of 'ora.cssd' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'dbln02'
CRS-2672: Attempting to start 'ora.ctssd' on 'dbln02'
CRS-2676: Start of 'ora.ctssd' on 'dbln02' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'dbln02'
CRS-2676: Start of 'ora.asm' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'dbln02'
CRS-2676: Start of 'ora.storage' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'dbln02'
CRS-2676: Start of 'ora.crsd' on 'dbln02' succeeded
CRS-2883: Resource 'ora.crsd' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
CRS-4000: Command Start failed, or completed with errors.

Como podemos ver, o GI reclamou de erro de permissão, isso geralmente acontece quando as permissões estão incorretas.

A primeira coisa que tentei fazer foi o unlock e lock do GRID_HOME, como root:

[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -unlock

[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -lock

Ao tentar subir o cluster, o erro permanecia o mesmo:

CRS-41053: checking Oracle Grid Infrastructure for file permission issues

O primeiro passo foi checar como estava o alert.log do GI nesse nó:

2023-08-03 01:04:04.340 [OHASD(21431)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 21431
2023-08-03 01:04:04.386 [OHASD(21431)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0.
2023-08-03 01:04:04.400 [OHASD(21431)]CRS-2112: The OLR service started on node dbln02.
2023-08-03 01:04:04.638 [OHASD(21431)]CRS-8011: reboot advisory message from host: dbln02, component: cssmonit, with time stamp: L-2023-08-03-00:05:26.790
2023-08-03 01:04:04.638 [OHASD(21431)]CRS-1301: Oracle High Availability Service started on node dbln02.
2023-08-03 01:04:04.639 [OHASD(21431)]CRS-8013: reboot advisory message text: Rebooting node due to connection problems with CSS
2023-08-03 01:04:04.641 [OHASD(21431)]CRS-8011: reboot advisory message from host: dbln02, component: cssagent, with time stamp: L-2023-08-03-00:05:26.791
2023-08-03 01:04:04.642 [OHASD(21431)]CRS-8013: reboot advisory message text: Rebooting node due to connection problems with CSS
2023-08-03 01:04:04.642 [OHASD(21431)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred
2023-08-03 01:04:05.187 [CSSDMONITOR(21570)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 21570
2023-08-03 01:04:05.199 [CSSDAGENT(21564)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 21564
2023-08-03 01:04:05.199 [ORAROOTAGENT(21549)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 21549
2023-08-03 01:04:05.282 [ORAAGENT(21559)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 21559
2023-08-03 01:04:06.017 [ORAAGENT(21661)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 21661
2023-08-03 01:04:06.156 [MDNSD(21683)]CRS-8500: Oracle Clusterware MDNSD process is starting with operating system process ID 21683
2023-08-03 01:04:06.164 [EVMD(21685)]CRS-8500: Oracle Clusterware EVMD process is starting with operating system process ID 21685
2023-08-03 01:04:07.197 [GPNPD(21719)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 21719
2023-08-03 01:04:08.268 [GPNPD(21719)]CRS-2328: Grid Plug and Play Daemon(GPNPD) started on node dbln02.
2023-08-03 01:04:08.247 [GIPCD(21790)]CRS-8500: Oracle Clusterware GIPCD process is starting with operating system process ID 21790
2023-08-03 01:04:09.197 [GIPCD(21790)]CRS-7517: The Oracle Grid Interprocess Communication (GIPC) failed to identify the Fast Node Death Detection (FNDD).
2023-08-03 01:04:12.449 [OSYSMOND(21826)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 21826
2023-08-03 01:04:12.433 [CSSDMONITOR(21824)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 21824
2023-08-03 01:04:12.685 [CSSDAGENT(21861)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 21861
2023-08-03 01:04:12.935 [OCSSD(21882)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 21882
2023-08-03 01:04:13.999 [OCSSD(21882)]CRS-1713: CSSD daemon is started in hub mode
2023-08-03 01:04:15.327 [OCSSD(21882)]CRS-1707: Lease acquisition for node dbln02 number 2 completed
2023-08-03 01:04:16.464 [OCSSD(21882)]CRS-1621: The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc
2023-08-03 01:04:16.465 [OCSSD(21882)]CRS-1617: The information required to do node kill for node dbln02 is incomplete; details at (:CSSNM00004:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc
2023-08-03 01:04:16.470 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_003; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc.
2023-08-03 01:04:16.477 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_002; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc.
2023-08-03 01:04:16.484 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_001; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc.
2023-08-03 01:04:18.023 [OCSSD(21882)]CRS-1601: CSSD Reconfiguration complete. Active nodes are dbln01 dbln02 .
2023-08-03 01:04:19.947 [OCTSSD(22102)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 22102
2023-08-03 01:04:19.978 [OCSSD(21882)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.
2023-08-03 01:04:20.924 [OCTSSD(22102)]CRS-2403: The Cluster Time Synchronization Service on host dbln02 is in observer mode.
2023-08-03 01:04:22.216 [OCTSSD(22102)]CRS-2407: The new Cluster Time Synchronization Service reference node is host dbln01.
2023-08-03 01:04:22.218 [OCTSSD(22102)]CRS-2401: The Cluster Time Synchronization Service started on host dbln02.
2023-08-03 01:04:29.836 [OLOGGERD(22190)]CRS-8500: Oracle Clusterware OLOGGERD process is starting with operating system process ID 22190
2023-08-03 01:04:29.885 [CRSD(22192)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22192
2023-08-03T01:04:32.764516-04:00
Errors in file /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc  (incident=1505):
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
2023-08-03 01:04:32.743 [CRSD(22192)]CRS-1019: The Oracle Cluster Registry (OCR) Service exited on host dbln02. Details at (:PROCR00016:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc.
Incident details in: /oracle/app/grid/diag/crs/dbln02/crs/incident/incdir_1505/crsd_i1505.trc

Atenção às últimas mensagens:

2023-08-03 01:04:29.885 [CRSD(22192)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22192
2023-08-03T01:04:32.764516-04:00
Errors in file /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc  (incident=1505):
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
2023-08-03 01:04:32.743 [CRSD(22192)]CRS-1019: The Oracle Cluster Registry (OCR) Service exited on host dbln02. Details at (:PROCR00016:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc.
Incident details in: /oracle/app/grid/diag/crs/dbln02/crs/incident/incdir_1505/crsd_i1505.trc

Não havia nenhum motivo claro que justificasse o erro no alert.log, vamos então analisar o crsd.trc (output truncado para facilitar a leitura):

2023-08-03 01:04:32.743 :  OCRMAS:2992596736: th_hub_verify_master_pubdata: Shutdown CacheLocal. Patch Levels don't match. Local Patch Level [3360129916] != Cache Writer Patch Level [2696908285]

Agora temos algo mais claro:

Patch Levels don’t match. Local Patch Level [3360129916] != Cache Writer Patch Level [2696908285]

Vamos checar o patch level entre os dois nós do cluster.

No nó onde o cluster está funcionando normal:

[root@DBLN01 ~]$ /oracle/app/193/grid/bin/kfod op=patches

---------------
List of Patches
===============
29517242
29517247
29585399
33575402
35042068
35050325
35050331
35107512

[root@DBLN01 ~]$ /oracle/app/193/grid/bin/kfod op=patchlvl

-------------------
Current Patch level
===================
2696908285

Vamos agora checar o segundo nó, onde o cluster não sobe:

[root@DBLN02 ~]# /oracle/app/193/grid/bin/kfod op=patches

---------------
List of Patches
===============
29517242
29517247
29585399
33575402
34419443
34428761
34444834
34580338

[root@DBLN02 ~]# /oracle/app/193/grid/bin/kfod op=patchlvl

-------------------
Current Patch level
===================
3360129916

Confirmado!

Patch level Nó 1: 2696908285

Patch level Nó 2: 3360129916

Ao questionar o time DBA, eles falaram que foram aplicar o Release Update 19.19 no primeiro nó e não observaram que o segundo nó não estava funcionando (havia ocorrido um erro de storage anteriormente). Portanto, a própria ação de aplicar o RU 19.19 criou a diferença de patch levels entre os nós.

O primeiro passo é “alinhar” os patches entre os servidores. Primeiro, vamos fazer o unlock do cluster:

[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -unlock

Agora vamos aplicar todos os patches que fazem parte do GI Release Update 19.19, com o usuário grid:

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35042068

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35050331

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35050325

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35107512

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/33575402

Próximo passo é aplicar os patches no DB Home para o RU 19.19, com o usuário oracle:

[oracle@DBLN02 ~]$ opatch apply -silent -oh /oracle/DB1/193 /sapcd/19C/AprilDBPatches/35037840/35042068

[oracle@DBLN02 ~]$ opatch apply -silent -oh /oracle/DB1/193 /sapcd/19C/AprilDBPatches/35037840/35050331

Recriar as chaves do OLR:

[root@DBLN02 install]# clscfg -localpatch

Após isso, conseguimos subir o cluster normalmente no nó 2!

Espero que seja útil!

Um abraço,

Vinicius