Background:
HP Proliant DL380 G5 servers connected to an HP EVA 4400 san box.
Windows Server 2008 R2 and SQL Server 2008.
Sql failover clustering on two nodes.
Problem summary:
Failover cluster for a sql cluster failing the past few days. Found cluster errors in the event logs on two servers. Ran cluster validation 10 times. Failed 2 out of 10.
Messages seen in cluster validation wizard:
Failed to validate file data on cluster disk 4 partition 1, failure reason: The system cannot find the file specified.
An error occurred while executing the test.
There was an error getting information about the running processes on the nodes.
There was an error retrieving information about the Processes from node 'SQL02'.
Not found
================================
ran chkdsk on c drive of both servers - no errors
ran sfc /scannow on both servers - no errors
increased the size of the sql filestream drive - yesterday - still had errors early this morning
================================
highlights of windows events seen:
sql01 event log highlights:
11-14 1:34PM:
event 1055
Health check for file share resource 'SQL Server FILESTREAM share (MSSQLSERVER)' failed. Retrieving information for share 'FSData' (scoped to network name ...SQL) indicated that the share does not exist (error code '53'). Please ensure the share exists and
is accessible.
event 1069
Cluster resource 'SQL Server FILESTREAM share (MSSQLSERVER)' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
event 1077
Health check for IP interface 'IP Address ....30' (address '....30') failed (status is '1168'). Run the Validate a Configuration wizard to ensure that the network adapter is functioning properly.
event 1069
Cluster resource 'IP Address ....30' in clustered service or application '...SQDtc' failed.
event 7034
The Distributed Transaction Coordinator (6589ecf4-6303-422b-9de3-f90653f68a14) service terminated unexpectedly. It has done this 1 time(s).
more related events, then
event 1135
Cluster node '...SQL02' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate
a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected
such as hubs, switches, or bridges.
11-17 4:53AM
event 1215
Cluster network name resource 'SQL Network Name (...SQL)' failed a health check. Network name '...SQL' is no longer registered on this node. The error code was '1453'. Check for hardware or software errors related to the network adapter. Also, you can run the
Validate a Configuration wizard to check your network configuration.
11-18 1:14AM
event 1215
Cluster network name resource '...SQDtc' failed a health check. Network name '...SQDTC' is no longer registered on this node. The error code was '1453'. Check for hardware or software errors related to the network adapter. Also, you can run the Validate a Configuration
wizard to check your network configuration.
1:46AM
event 1055
Health check for file share resource 'SQL Server FILESTREAM share (MSSQLSERVER)' failed. Retrieving information for share 'FSData' (scoped to network name ...SQL) indicated that the share does not exist (error code '1726'). Please ensure the share exists and
is accessible.
3:16am
event 6
An I/O operation initiated by the Registry failed unrecoverably.The Registry could not flush hive (file): '\SystemRoot\System32\Config\SOFTWARE'.
4:14am
event 137
The default transaction resource manager on volume G: encountered a non-retryable error and could not start. The data contains the error code.
event 7024
The Cluster Service service terminated with service-specific error Insufficient quota to complete the requested service..
sql02 event log highlights:
11-14 1:34pm
event 1135
Cluster node '...sql01' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate
a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected
such as hubs, switches, or bridges.
11-18 4:14am
event 1135
Cluster node '...sql01' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate
a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected
such as hubs, switches, or bridges.
415am
event 1069
Cluster resource 'SQL Server Agent' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.
================================
HP EVA 4400 log excerpts:
01:17:34
29-Oct-2012 Yes 31101 SCell:SAM
SC Event Code: 06324e13 - An HSV300 controller has detected only one port of all Fibre Channel devices on a loop.
01:17:34
29-Oct-2012 Yes 31260 SCell:SAM
SC Event Code: 09cdc305 - A Fibre Channel port has transitioned to the FAILED state.
01:17:34
29-Oct-2012 Yes 3028 SCell:SAM
Cannot find description for SC Event Code: 066a0028
01:17:34
29-Oct-2012 Yes 31101 SCell:SAM
SC Event Code: 06324e13 - An HSV300 controller has detected only one port of all Fibre Channel devices on a loop.
01:17:34
29-Oct-2012 Yes 31260 SCell:SAM
SC Event Code: 09cdc305 - A Fibre Channel port has transitioned to the FAILED state.
02:08:03:545
29-Oct-2012
Controller 2 066a0028 #12857
Corrective action code: 00 More details
02:08:03:531
29-Oct-2012
Controller 2 0319000a #12856
An HSV300 controller has begun discovering devices on the backend loops.
Corrective action code: 00 More details
02:08:03:531
29-Oct-2012
Controller 2 06324e13 #12855
An HSV300 controller has detected only one port of all Fibre Channel devices on a loop.
Corrective action code: 4e More details
02:08:03:531
29-Oct-2012
Controller 2 09cdc305 #12854
A Fibre Channel port has transitioned to the FAILED state.
Corrective action code: c3 More details
02:08:27:643
29-Oct-2012
Controller 1 031a000a #12853
An HSV300 controller has completed discovering devices on the backend loops.
Corrective action code: 00 More details
02:08:25:329
29-Oct-2012
Controller 1 066a0028 #12852
Corrective action code: 00