Troubleshooting Oracle Clusterware GI CRS startup issues CRS-4535 Cannot communicatew with Cluster Ready Services

Purpose

After rebooting a node, the CRS stack fails to start. This bulletin offers a list of things to check while troubleshooting the cause.

Using CRSCTL to analyze the problem is generally not useful, and results in the following

$ crsctl check crs

CRS-4638: Oracle High Availability Servers is online

CRS-4535: Cannot communicatew with Cluster Ready Services

CRS-4533: Event Manager is online

# crsctl start resource crsd

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4000: Command Start failed, or completed with errors.

# crsctl stop cluster –all

CRS-2673: Attempting to stop ‘ora.crsd’ on ‘racnode1’

CRS-2673: Attempting to stop ‘ora.crsd’ on ‘racnode2’

CRS-4548: Unable to connect to CRSD

CRS-2675: Stop of ‘ora.crsd’ on ‘racnode1’ failed

CRS-2679: Attempting to clean ‘ora.crsd’ on ‘racnode1’

CRS-4548: Unable to connect to CRSD

CRS-2678: ‘ora.crsd’ on ‘racnode1’ has experienced an unrecoverable failure

CRS-0267: Human intervention required to resume its availability

CRS-4548: Unable to connect to CRSD

CRS-2675: Stop of ‘ora.crsd’ on ‘racnode2’ failed

CRS-2679: Attempting to clean ‘ora.crsd’ on ‘racnode2’

CRS-4548: Unable to connect to CRSD

CRS-2678: ‘ora.crsd’ on ‘racnode2’ has experienced an unrecoverable failure

CRS-0267: Human intervention required to resume its availability

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

1. Check the processes currently running on the node from the Grid home. For example, if you use /u01/grid for the Grid

Home, you might use the following

Ps -ef | grep grid

Determine which processes did not start. The startup dependencies are

CRSD --> EVMD and CTSSD

CTSSD --> CSSD

CSSD --> CSSDMONITOR, DISKMON, and GPNPD

GPNPD --> MDNSD and GIPCD

2. Review the clusterware alert log and the trace files for the processes that did not start. For example, if CRSD, CSSD, and DISKMON did not start, check the trace files at the lowest process level first, which would be DISKMON.

3. If you use ASM for storing the Oracle Clusterware files (voting disks/OCR), make sure the ASM instance is started. Use SQL*Plus (from the Grid home) to start the ASM instance if it is not started, and resolve any errors that occur.

4. Make sure the ASM instance or Oracle Clusterware user has access to the disks used to store the Oracle Clusterware files. Configure UDEV if required.

5. If the disks were stamped with ASMLIB, make sure all the ASMLIB RPMs were installed. If you are missing the oracleasmlib RPM, it will appear that the disks are marked for use with ASM, but there is no library for ASM to interface with.

Oracleasm-2.6 debug

Oracleasm-support-2.1.2

Oracleasmlib-2.0.4

Oracleasm-2.6.18

DBRECOVER Recovery Options

For Oracle incidents, start with the DBRECOVER for Oracle trial to verify table visibility, row previews, and export readiness on copied datafiles. For MySQL and InnoDB incidents, DBRECOVER for MySQL is free software and can inspect.ibd files, ibdata1, and database directories locally.

When the case is urgent, preserve the original files first, work from copies, and contact paid emergency support with the database version, platform, error messages, file list, and recovery objective.