Hi everyone,

Hope you’re doing good!

A few weeks ago, DBA team of a client reached up to me and told me that they have a 2-node cluster where the second node it was not running as expected.

I started the cluster troubleshoot:

[root@DBLN02 ~]# crsctl start crs -wait

Let’s check the output of above command:

CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.mdnsd' on 'dbln02'
CRS-2672: Attempting to start 'ora.evmd' on 'dbln02'
CRS-2676: Start of 'ora.evmd' on 'dbln02' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'dbln02'
CRS-2676: Start of 'ora.gpnpd' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'dbln02'
CRS-2676: Start of 'ora.gipcd' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'dbln02'
CRS-2672: Attempting to start 'ora.crf' on 'dbln02'
CRS-2676: Start of 'ora.cssdmonitor' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'dbln02'
CRS-2672: Attempting to start 'ora.diskmon' on 'dbln02'
CRS-2676: Start of 'ora.diskmon' on 'dbln02' succeeded
CRS-2676: Start of 'ora.crf' on 'dbln02' succeeded
CRS-2676: Start of 'ora.cssd' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'dbln02'
CRS-2672: Attempting to start 'ora.ctssd' on 'dbln02'
CRS-2676: Start of 'ora.ctssd' on 'dbln02' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'dbln02'
CRS-2676: Start of 'ora.asm' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'dbln02'
CRS-2676: Start of 'ora.storage' on 'dbln02' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'dbln02'
CRS-2676: Start of 'ora.crsd' on 'dbln02' succeeded
CRS-2883: Resource 'ora.crsd' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
CRS-4000: Command Start failed, or completed with errors.

As we can see, GI is complaining about permissions, this usually happens when the permissions are wrong.

So, first thing to do is to unlock and lock GRID_HOME, as root user:

[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -unlock

[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -lock

When we tried to start the cluster, the error persisted with same message:

CRS-41053: checking Oracle Grid Infrastructure for file permission issues

Fist step was to check GI alert.log on the node which was failing:

2023-08-03 01:04:04.340 [OHASD(21431)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 21431
2023-08-03 01:04:04.386 [OHASD(21431)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0.
2023-08-03 01:04:04.400 [OHASD(21431)]CRS-2112: The OLR service started on node dbln02.
2023-08-03 01:04:04.638 [OHASD(21431)]CRS-8011: reboot advisory message from host: dbln02, component: cssmonit, with time stamp: L-2023-08-03-00:05:26.790
2023-08-03 01:04:04.638 [OHASD(21431)]CRS-1301: Oracle High Availability Service started on node dbln02.
2023-08-03 01:04:04.639 [OHASD(21431)]CRS-8013: reboot advisory message text: Rebooting node due to connection problems with CSS
2023-08-03 01:04:04.641 [OHASD(21431)]CRS-8011: reboot advisory message from host: dbln02, component: cssagent, with time stamp: L-2023-08-03-00:05:26.791
2023-08-03 01:04:04.642 [OHASD(21431)]CRS-8013: reboot advisory message text: Rebooting node due to connection problems with CSS
2023-08-03 01:04:04.642 [OHASD(21431)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred
2023-08-03 01:04:05.187 [CSSDMONITOR(21570)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 21570
2023-08-03 01:04:05.199 [CSSDAGENT(21564)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 21564
2023-08-03 01:04:05.199 [ORAROOTAGENT(21549)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 21549
2023-08-03 01:04:05.282 [ORAAGENT(21559)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 21559
2023-08-03 01:04:06.017 [ORAAGENT(21661)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 21661
2023-08-03 01:04:06.156 [MDNSD(21683)]CRS-8500: Oracle Clusterware MDNSD process is starting with operating system process ID 21683
2023-08-03 01:04:06.164 [EVMD(21685)]CRS-8500: Oracle Clusterware EVMD process is starting with operating system process ID 21685
2023-08-03 01:04:07.197 [GPNPD(21719)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 21719
2023-08-03 01:04:08.268 [GPNPD(21719)]CRS-2328: Grid Plug and Play Daemon(GPNPD) started on node dbln02.
2023-08-03 01:04:08.247 [GIPCD(21790)]CRS-8500: Oracle Clusterware GIPCD process is starting with operating system process ID 21790
2023-08-03 01:04:09.197 [GIPCD(21790)]CRS-7517: The Oracle Grid Interprocess Communication (GIPC) failed to identify the Fast Node Death Detection (FNDD).
2023-08-03 01:04:12.449 [OSYSMOND(21826)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 21826
2023-08-03 01:04:12.433 [CSSDMONITOR(21824)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 21824
2023-08-03 01:04:12.685 [CSSDAGENT(21861)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 21861
2023-08-03 01:04:12.935 [OCSSD(21882)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 21882
2023-08-03 01:04:13.999 [OCSSD(21882)]CRS-1713: CSSD daemon is started in hub mode
2023-08-03 01:04:15.327 [OCSSD(21882)]CRS-1707: Lease acquisition for node dbln02 number 2 completed
2023-08-03 01:04:16.464 [OCSSD(21882)]CRS-1621: The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc
2023-08-03 01:04:16.465 [OCSSD(21882)]CRS-1617: The information required to do node kill for node dbln02 is incomplete; details at (:CSSNM00004:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc
2023-08-03 01:04:16.470 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_003; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc.
2023-08-03 01:04:16.477 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_002; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc.
2023-08-03 01:04:16.484 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_001; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc.
2023-08-03 01:04:18.023 [OCSSD(21882)]CRS-1601: CSSD Reconfiguration complete. Active nodes are dbln01 dbln02 .
2023-08-03 01:04:19.947 [OCTSSD(22102)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 22102
2023-08-03 01:04:19.978 [OCSSD(21882)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.
2023-08-03 01:04:20.924 [OCTSSD(22102)]CRS-2403: The Cluster Time Synchronization Service on host dbln02 is in observer mode.
2023-08-03 01:04:22.216 [OCTSSD(22102)]CRS-2407: The new Cluster Time Synchronization Service reference node is host dbln01.
2023-08-03 01:04:22.218 [OCTSSD(22102)]CRS-2401: The Cluster Time Synchronization Service started on host dbln02.
2023-08-03 01:04:29.836 [OLOGGERD(22190)]CRS-8500: Oracle Clusterware OLOGGERD process is starting with operating system process ID 22190
2023-08-03 01:04:29.885 [CRSD(22192)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22192
2023-08-03T01:04:32.764516-04:00
Errors in file /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc  (incident=1505):
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
2023-08-03 01:04:32.743 [CRSD(22192)]CRS-1019: The Oracle Cluster Registry (OCR) Service exited on host dbln02. Details at (:PROCR00016:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc.
Incident details in: /oracle/app/grid/diag/crs/dbln02/crs/incident/incdir_1505/crsd_i1505.trc

Last messages where we can pay attention:

2023-08-03 01:04:29.885 [CRSD(22192)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22192
2023-08-03T01:04:32.764516-04:00
Errors in file /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc  (incident=1505):
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
2023-08-03 01:04:32.743 [CRSD(22192)]CRS-1019: The Oracle Cluster Registry (OCR) Service exited on host dbln02. Details at (:PROCR00016:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc.
Incident details in: /oracle/app/grid/diag/crs/dbln02/crs/incident/incdir_1505/crsd_i1505.trc

So, nothing clear that explains why cluster was not starting up. Let’s check file crsd.trc (output truncated to easier to read it):

2023-08-03 01:04:32.743 :  OCRMAS:2992596736: th_hub_verify_master_pubdata: Shutdown CacheLocal. Patch Levels don't match. Local Patch Level [3360129916] != Cache Writer Patch Level [2696908285]

OK, now is clear:

Patch Levels don’t match. Local Patch Level [3360129916] != Cache Writer Patch Level [2696908285]

Let’s check the patch level between the two nodes of cluster:

In the node where cluster is running fine:

[root@DBLN01 ~]$ /oracle/app/193/grid/bin/kfod op=patches

---------------
List of Patches
===============
29517242
29517247
29585399
33575402
35042068
35050325
35050331
35107512

[root@DBLN01 ~]$ /oracle/app/193/grid/bin/kfod op=patchlvl

-------------------
Current Patch level
===================
2696908285

Now let’s check the second node, where cluster is not starting up:

[root@DBLN02 ~]# /oracle/app/193/grid/bin/kfod op=patches

---------------
List of Patches
===============
29517242
29517247
29585399
33575402
34419443
34428761
34444834
34580338

[root@DBLN02 ~]# /oracle/app/193/grid/bin/kfod op=patchlvl

-------------------
Current Patch level
===================
3360129916

Confirmed!

Patch level node 1: 2696908285

Patch level node 2: 3360129916

Then, I asked DBA team, they told that they applied Release Update 19.19 on the first node and they didn’t notice that the second node was not running (there was a storage error previously that brought the cluster down). So, the action itself to apply RU 19.19 implied on the patch level mismatch.

First step is to “align” the patches between the nodes. So, first, let’s unlock the cluster:

[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -unlock

Now, let’s apply all required patches for GI Release Update 19.19, as grid user:

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35042068

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35050331

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35050325

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35107512

[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/33575402

The next step is to apply the required patches into DB Home for RU 19.19, as oracle user:

[oracle@DBLN02 ~]$ opatch apply -silent -oh /oracle/DB1/193 /sapcd/19C/AprilDBPatches/35037840/35042068

[oracle@DBLN02 ~]$ opatch apply -silent -oh /oracle/DB1/193 /sapcd/19C/AprilDBPatches/35037840/35050331

Recreate the OLR keys:

[root@DBLN02 install]# clscfg -localpatch

After this, we were able to bring up the cluster on node 2 without issues!

Hope this helps!

Peace,

Vinicius