Hi everyone,
Hope you’re doing good!
A few weeks ago, DBA team of a client reached up to me and told me that they have a 2-node cluster where the second node it was not running as expected.
I started the cluster troubleshoot:
[root@DBLN02 ~]# crsctl start crs -wait
Let’s check the output of above command:
CRS-4123: Starting Oracle High Availability Services-managed resources CRS-2672: Attempting to start 'ora.mdnsd' on 'dbln02' CRS-2672: Attempting to start 'ora.evmd' on 'dbln02' CRS-2676: Start of 'ora.evmd' on 'dbln02' succeeded CRS-2676: Start of 'ora.mdnsd' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.gpnpd' on 'dbln02' CRS-2676: Start of 'ora.gpnpd' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.gipcd' on 'dbln02' CRS-2676: Start of 'ora.gipcd' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.cssdmonitor' on 'dbln02' CRS-2672: Attempting to start 'ora.crf' on 'dbln02' CRS-2676: Start of 'ora.cssdmonitor' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.cssd' on 'dbln02' CRS-2672: Attempting to start 'ora.diskmon' on 'dbln02' CRS-2676: Start of 'ora.diskmon' on 'dbln02' succeeded CRS-2676: Start of 'ora.crf' on 'dbln02' succeeded CRS-2676: Start of 'ora.cssd' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'dbln02' CRS-2672: Attempting to start 'ora.ctssd' on 'dbln02' CRS-2676: Start of 'ora.ctssd' on 'dbln02' succeeded CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.asm' on 'dbln02' CRS-2676: Start of 'ora.asm' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.storage' on 'dbln02' CRS-2676: Start of 'ora.storage' on 'dbln02' succeeded CRS-2672: Attempting to start 'ora.crsd' on 'dbln02' CRS-2676: Start of 'ora.crsd' on 'dbln02' succeeded CRS-2883: Resource 'ora.crsd' failed during Clusterware stack start. CRS-4406: Oracle High Availability Services synchronous start failed. CRS-41053: checking Oracle Grid Infrastructure for file permission issues CRS-4000: Command Start failed, or completed with errors.
As we can see, GI is complaining about permissions, this usually happens when the permissions are wrong.
So, first thing to do is to unlock and lock GRID_HOME, as root user:
[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -unlock
[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -lock
When we tried to start the cluster, the error persisted with same message:
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
Fist step was to check GI alert.log on the node which was failing:
2023-08-03 01:04:04.340 [OHASD(21431)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 21431 2023-08-03 01:04:04.386 [OHASD(21431)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0. 2023-08-03 01:04:04.400 [OHASD(21431)]CRS-2112: The OLR service started on node dbln02. 2023-08-03 01:04:04.638 [OHASD(21431)]CRS-8011: reboot advisory message from host: dbln02, component: cssmonit, with time stamp: L-2023-08-03-00:05:26.790 2023-08-03 01:04:04.638 [OHASD(21431)]CRS-1301: Oracle High Availability Service started on node dbln02. 2023-08-03 01:04:04.639 [OHASD(21431)]CRS-8013: reboot advisory message text: Rebooting node due to connection problems with CSS 2023-08-03 01:04:04.641 [OHASD(21431)]CRS-8011: reboot advisory message from host: dbln02, component: cssagent, with time stamp: L-2023-08-03-00:05:26.791 2023-08-03 01:04:04.642 [OHASD(21431)]CRS-8013: reboot advisory message text: Rebooting node due to connection problems with CSS 2023-08-03 01:04:04.642 [OHASD(21431)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred 2023-08-03 01:04:05.187 [CSSDMONITOR(21570)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 21570 2023-08-03 01:04:05.199 [CSSDAGENT(21564)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 21564 2023-08-03 01:04:05.199 [ORAROOTAGENT(21549)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 21549 2023-08-03 01:04:05.282 [ORAAGENT(21559)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 21559 2023-08-03 01:04:06.017 [ORAAGENT(21661)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 21661 2023-08-03 01:04:06.156 [MDNSD(21683)]CRS-8500: Oracle Clusterware MDNSD process is starting with operating system process ID 21683 2023-08-03 01:04:06.164 [EVMD(21685)]CRS-8500: Oracle Clusterware EVMD process is starting with operating system process ID 21685 2023-08-03 01:04:07.197 [GPNPD(21719)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 21719 2023-08-03 01:04:08.268 [GPNPD(21719)]CRS-2328: Grid Plug and Play Daemon(GPNPD) started on node dbln02. 2023-08-03 01:04:08.247 [GIPCD(21790)]CRS-8500: Oracle Clusterware GIPCD process is starting with operating system process ID 21790 2023-08-03 01:04:09.197 [GIPCD(21790)]CRS-7517: The Oracle Grid Interprocess Communication (GIPC) failed to identify the Fast Node Death Detection (FNDD). 2023-08-03 01:04:12.449 [OSYSMOND(21826)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 21826 2023-08-03 01:04:12.433 [CSSDMONITOR(21824)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 21824 2023-08-03 01:04:12.685 [CSSDAGENT(21861)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 21861 2023-08-03 01:04:12.935 [OCSSD(21882)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 21882 2023-08-03 01:04:13.999 [OCSSD(21882)]CRS-1713: CSSD daemon is started in hub mode 2023-08-03 01:04:15.327 [OCSSD(21882)]CRS-1707: Lease acquisition for node dbln02 number 2 completed 2023-08-03 01:04:16.464 [OCSSD(21882)]CRS-1621: The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc 2023-08-03 01:04:16.465 [OCSSD(21882)]CRS-1617: The information required to do node kill for node dbln02 is incomplete; details at (:CSSNM00004:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc 2023-08-03 01:04:16.470 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_003; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc. 2023-08-03 01:04:16.477 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_002; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc. 2023-08-03 01:04:16.484 [OCSSD(21882)]CRS-1605: CSSD voting file is online: ORCL:GRID1_001; details in /oracle/app/grid/diag/crs/dbln02/crs/trace/ocssd.trc. 2023-08-03 01:04:18.023 [OCSSD(21882)]CRS-1601: CSSD Reconfiguration complete. Active nodes are dbln01 dbln02 . 2023-08-03 01:04:19.947 [OCTSSD(22102)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 22102 2023-08-03 01:04:19.978 [OCSSD(21882)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation. 2023-08-03 01:04:20.924 [OCTSSD(22102)]CRS-2403: The Cluster Time Synchronization Service on host dbln02 is in observer mode. 2023-08-03 01:04:22.216 [OCTSSD(22102)]CRS-2407: The new Cluster Time Synchronization Service reference node is host dbln01. 2023-08-03 01:04:22.218 [OCTSSD(22102)]CRS-2401: The Cluster Time Synchronization Service started on host dbln02. 2023-08-03 01:04:29.836 [OLOGGERD(22190)]CRS-8500: Oracle Clusterware OLOGGERD process is starting with operating system process ID 22190 2023-08-03 01:04:29.885 [CRSD(22192)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22192 2023-08-03T01:04:32.764516-04:00 Errors in file /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc (incident=1505): CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] 2023-08-03 01:04:32.743 [CRSD(22192)]CRS-1019: The Oracle Cluster Registry (OCR) Service exited on host dbln02. Details at (:PROCR00016:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc. Incident details in: /oracle/app/grid/diag/crs/dbln02/crs/incident/incdir_1505/crsd_i1505.trc
Last messages where we can pay attention:
2023-08-03 01:04:29.885 [CRSD(22192)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22192 2023-08-03T01:04:32.764516-04:00 Errors in file /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc (incident=1505): CRS-1019 [] [] [] [] [] [] [] [] [] [] [] [] 2023-08-03 01:04:32.743 [CRSD(22192)]CRS-1019: The Oracle Cluster Registry (OCR) Service exited on host dbln02. Details at (:PROCR00016:) in /oracle/app/grid/diag/crs/dbln02/crs/trace/crsd.trc. Incident details in: /oracle/app/grid/diag/crs/dbln02/crs/incident/incdir_1505/crsd_i1505.trc
So, nothing clear that explains why cluster was not starting up. Let’s check file crsd.trc (output truncated to easier to read it):
2023-08-03 01:04:32.743 : OCRMAS:2992596736: th_hub_verify_master_pubdata: Shutdown CacheLocal. Patch Levels don't match. Local Patch Level [3360129916] != Cache Writer Patch Level [2696908285]
OK, now is clear:
Patch Levels don’t match. Local Patch Level [3360129916] != Cache Writer Patch Level [2696908285]
Let’s check the patch level between the two nodes of cluster:
In the node where cluster is running fine:
[root@DBLN01 ~]$ /oracle/app/193/grid/bin/kfod op=patches
--------------- List of Patches =============== 29517242 29517247 29585399 33575402 35042068 35050325 35050331 35107512
[root@DBLN01 ~]$ /oracle/app/193/grid/bin/kfod op=patchlvl
------------------- Current Patch level =================== 2696908285
Now let’s check the second node, where cluster is not starting up:
[root@DBLN02 ~]# /oracle/app/193/grid/bin/kfod op=patches
--------------- List of Patches =============== 29517242 29517247 29585399 33575402 34419443 34428761 34444834 34580338
[root@DBLN02 ~]# /oracle/app/193/grid/bin/kfod op=patchlvl
------------------- Current Patch level =================== 3360129916
Confirmed!
Patch level node 1: 2696908285
Patch level node 2: 3360129916
Then, I asked DBA team, they told that they applied Release Update 19.19 on the first node and they didn’t notice that the second node was not running (there was a storage error previously that brought the cluster down). So, the action itself to apply RU 19.19 implied on the patch level mismatch.
First step is to “align” the patches between the nodes. So, first, let’s unlock the cluster:
[root@DBLN02 ~]# /oracle/app/193/grid/crs/install/rootcrs.sh -unlock
Now, let’s apply all required patches for GI Release Update 19.19, as grid user:
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35042068
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35050331
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35050325
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/35107512
[grid@DBLN02 ~]$ opatch apply -silent -oh /oracle/app/193/grid /sapcd/19C/AprilDBPatches/35037840/33575402
The next step is to apply the required patches into DB Home for RU 19.19, as oracle user:
[oracle@DBLN02 ~]$ opatch apply -silent -oh /oracle/DB1/193 /sapcd/19C/AprilDBPatches/35037840/35042068
[oracle@DBLN02 ~]$ opatch apply -silent -oh /oracle/DB1/193 /sapcd/19C/AprilDBPatches/35037840/35050331
Recreate the OLR keys:
[root@DBLN02 install]# clscfg -localpatch
After this, we were able to bring up the cluster on node 2 without issues!
Hope this helps!
Peace,
Vinicius
Related posts
Disclaimer
My postings reflect my own views and do not necessarily represent the views of my employer, Accenture.