Quantcast
Channel: High Availability (Clustering) forum
Viewing all articles
Browse latest Browse all 6672

feedback on cluster errors & related windows events - sql failover cluster

$
0
0

Background:

HP Proliant DL380 G5 servers connected to an HP EVA 4400 san box.
Windows Server 2008 R2 and SQL Server 2008.
Sql failover clustering on two nodes.

Problem summary:

Failover cluster for a sql cluster failing the past few days. Found cluster errors in the event logs on two servers. Ran cluster validation 10 times. Failed 2 out of 10.

Messages seen in cluster validation wizard:

Failed to validate file data on cluster disk 4 partition 1, failure reason: The system cannot find the file specified.
An error occurred while executing the test.
There was an error getting information about the running processes on the nodes.
There was an error retrieving information about the Processes from node 'SQL02'.
Not found

================================

ran chkdsk on c drive of both servers - no errors

ran sfc /scannow on both servers - no errors

increased the size of the sql filestream drive - yesterday - still had errors early this morning

================================

highlights of windows events seen:

sql01 event log highlights:

11-14 1:34PM:

event 1055
Health check for file share resource 'SQL Server FILESTREAM share (MSSQLSERVER)' failed. Retrieving information for share 'FSData' (scoped to network name ...SQL) indicated that the share does not exist (error code '53'). Please ensure the share exists and is accessible.

event 1069
Cluster resource 'SQL Server FILESTREAM share (MSSQLSERVER)' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.

event 1077
Health check for IP interface 'IP Address ....30' (address '....30') failed (status is '1168'). Run the Validate a Configuration wizard to ensure that the network adapter is functioning properly.

event 1069
Cluster resource 'IP Address ....30' in clustered service or application '...SQDtc' failed.

event 7034
The Distributed Transaction Coordinator (6589ecf4-6303-422b-9de3-f90653f68a14) service terminated unexpectedly. It has done this 1 time(s).

more related events, then

event 1135
Cluster node '...SQL02' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

11-17 4:53AM

event 1215
Cluster network name resource 'SQL Network Name (...SQL)' failed a health check. Network name '...SQL' is no longer registered on this node. The error code was '1453'. Check for hardware or software errors related to the network adapter. Also, you can run the Validate a Configuration wizard to check your network configuration.

11-18 1:14AM

event 1215
Cluster network name resource '...SQDtc' failed a health check. Network name '...SQDTC' is no longer registered on this node. The error code was '1453'. Check for hardware or software errors related to the network adapter. Also, you can run the Validate a Configuration wizard to check your network configuration.

1:46AM

event 1055
Health check for file share resource 'SQL Server FILESTREAM share (MSSQLSERVER)' failed. Retrieving information for share 'FSData' (scoped to network name ...SQL) indicated that the share does not exist (error code '1726'). Please ensure the share exists and is accessible.

3:16am

event 6
An I/O operation initiated by the Registry failed unrecoverably.The Registry could not flush hive (file): '\SystemRoot\System32\Config\SOFTWARE'.

4:14am

event 137
The default transaction resource manager on volume G: encountered a non-retryable error and could not start. The data contains the error code.

event 7024
The Cluster Service service terminated with service-specific error Insufficient quota to complete the requested service..
sql02 event log highlights:

11-14 1:34pm

event 1135
Cluster node '...sql01' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

11-18 4:14am

event 1135
Cluster node '...sql01' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
415am

event 1069

Cluster resource 'SQL Server Agent' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.

================================

HP EVA 4400 log excerpts:

01:17:34
29-Oct-2012 Yes 31101 SCell:SAM
SC Event Code: 06324e13 - An HSV300 controller has detected only one port of all Fibre Channel devices on a loop.
01:17:34
29-Oct-2012 Yes 31260 SCell:SAM
SC Event Code: 09cdc305 - A Fibre Channel port has transitioned to the FAILED state.
01:17:34
29-Oct-2012 Yes 3028 SCell:SAM
Cannot find description for SC Event Code: 066a0028
01:17:34
29-Oct-2012 Yes 31101 SCell:SAM
SC Event Code: 06324e13 - An HSV300 controller has detected only one port of all Fibre Channel devices on a loop.
01:17:34
29-Oct-2012 Yes 31260 SCell:SAM
SC Event Code: 09cdc305 - A Fibre Channel port has transitioned to the FAILED state.

02:08:03:545
29-Oct-2012
Controller 2 066a0028 #12857
Corrective action code: 00 More details

02:08:03:531
29-Oct-2012
Controller 2 0319000a #12856
An HSV300 controller has begun discovering devices on the backend loops.
Corrective action code: 00 More details

02:08:03:531
29-Oct-2012
Controller 2 06324e13 #12855
An HSV300 controller has detected only one port of all Fibre Channel devices on a loop.
Corrective action code: 4e More details

02:08:03:531
29-Oct-2012
Controller 2 09cdc305 #12854
A Fibre Channel port has transitioned to the FAILED state.
Corrective action code: c3 More details

02:08:27:643
29-Oct-2012
Controller 1 031a000a #12853
An HSV300 controller has completed discovering devices on the backend loops.
Corrective action code: 00 More details

02:08:25:329
29-Oct-2012
Controller 1 066a0028 #12852
Corrective action code: 00


Viewing all articles
Browse latest Browse all 6672

Trending Articles