Last night my two-node cluster went down for no apparent reason. All VMs (4) were down even though the cluster manager said they were running. The cluster shared volume on my SAN was not accessible through Windows Explorer but the Dell mpio software showed it was connected and the SAN itself showed a connection and did not have any problem. It took me five hours of struggle to get the cluster running again. I had to remotely restart each node several times from another server using the command line because the RDP session would stop responding due to Explorer locking up. I ended up removing the antivirus software from each node but that was in desperation; I don't know if that was the problem or not. It finally started to work again when I manually brought the cluster IP back online, manually moved all resources to node1 and then did a pause and drain of node2 and restarted node2. This error shows up twice in the Application log of both nodes:
Possible Memory Leak. Application (C:\Windows\Cluster\rhs.exe -key SYSTEM\CurrentControlSet\Services\ClusSvc\Parameters\Rhs\0428d6b3-5c3b-4757-bc31-70379129ad89 -parentPid 3060 -initEvent 1dbde958-779b-4cd7-8daa-7c9299d0303c -replyEndpoint OLEAA17D0EF8BDFFAD1F4F33871C878) (PID: 4520) has passed a non-NULL pointer to RPC for an [out] parameter marked [allocate(all_nodes)]. [allocate(all_nodes)] parameters are always reallocated; if the original pointer contained the address of valid memory, that memory will be leaked. The call originated on the interface with UUID ({4b324fc8-1670-01d3-1278-5a47bf6ee188}), Method number (64). User Action: Contact your application vendor for an updated version of the application.
There are also two critical stops logged in the Dell OpenManage logs on each node.
The symptoms are very similar to this technet article for Server 2008 R2:
http://support.microsoft.com/kb/2798093
Both nodes are fully updated with hotfix 2870270.
Can anyone shed some light on this? What went wrong and how do I prevent it from happening again?