Hi everyone!
There is a 5-node SQL Server 2012 failover cluster based on Windows server 2012 Datacenter and built on IBM
Bladecenter HS23 type 7875. Cluster nodes are using SAN-boot from IBM Storwize v3700 and LUN's
from IBM Storwize v7000.
Periodically on different nodes of the cluster appears an error vent ID 1073 The Cluster service
was halted to prevent an inconsistency within the failover cluster. The error code was '668', and Event
ID 7031 The Cluster Service service terminated unexpectedly. It has done this 1 time(s). TThe
following corrective action will be taken in 60000 milliseconds: Restart the service и Event ID 7024 The Cluster Service service terminated with the following service-specific error: An assertion failure has occurred. After these errors have appeared cluster
node hangs in "joining" state and the same happens to all nodes that will be rebooted or turned off, and all operations I try to preform on cluster(stopping cluster service, pause, evict, etc) are failling. Cluster returns to normal state only after
all of its node are rebooted. Here's is the piece of cluster log at the time the error occurred:
00000b4c.00000c7c::2014/04/21-03:32:25.939 INFO [VSS] Backing up part of the system state [VSS]
OnPrepareBackup: starting new session dfb4fbf0-db28-40d2-af3a-82e66a271267
00000b4c.00000c7c::2014/04/21-03:32:25.939 INFO [VSS] OnPrepareBackup returning - true
00000b4c.00001194::2014/04/21-03:32:26.704 INFO [GUM] Node 7: Processing RequestLock 4:4744
00000b4c.00001198::2014/04/21-03:32:26.704 INFO [GUM] Node 7: Processing GrantLock to 4 (sent by 3
gumid: 11271)
00000b4c.00000e2c::2014/04/21-03:32:26.704 ERR mscs::GumAgent::ExecuteQueuedUpdate: TransactionInProgress(5918)'
because of 'Cannot restart an in-progress transaction'
00000b4c.00001194::2014/04/21-03:32:26.719 ERR Failed type check .?AUBoxedNodeSet@mscs@@
00000b4c.00001194::2014/04/21-03:32:26.719 ERR [CORE] mscs::ClusterCore::DeliverMessage: TypeMismatch(1629)'
because of 'failed type check'
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [VSS] HandleBackupGum - Initiating the backup
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [VSS] HandleOnFreezeGum - Stopping the Death Timer
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [VSS] HandleBackupGum - Completed the backup Request
00000b4c.00000e2c::2014/04/21-03:32:26.750 ERR [GUM] Node 7: sequenceNumber + 1 == payload->GumId
(5129, 11272)
00000b4c.00000e2c::2014/04/21-03:32:26.750 ERR mscs::GumAgent::ExecuteQueuedUpdate: AssertionFailed(668)'
because of 'failed assertion'(sequenceNumber + 1 == payload->GumId is false)
00000b4c.00000e2c::2014/04/21-03:32:26.750 ERR GumHandler failed (status = 668)
00000b4c.00000e2c::2014/04/21-03:32:26.750 ERR GumHandler failed (status = 668), executing OnStop
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [DM]: Shutting down, so unloading the cluster database.
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [DM] Shutting down, so unloading the cluster database
(waitForLock: false).
00000b4c.00000e2c::2014/04/21-03:32:26.813 ERR FatalError is Calling Exit Process.
00000b4c.00000b50::2014/04/21-03:32:26.813 INFO [CS] About to exit process...
000015d0.000015d4::2014/04/21-03:32:26.828 WARN [RHS] Cluster service has terminated.
00001618.0000161c::2014/04/21-03:32:26.828 WARN [RHS] Cluster service has terminated.
00001588.0000158c::2014/04/21-03:32:26.828 WARN [RHS] Cluster service has terminated.
000015f4.000015f8::2014/04/21-03:32:26.828 WARN [RHS] Cluster service has terminated.
All of the reccommeded failover cluster updates and hotfixes are installed and the cluster is validated.