Hi!
We're facing an issue when cluster service starts. It won´t accept any group for a while (from 10 to 90 minutes after startup).
We did troubleshooting of the cluster.log and found out a lot of records like this:
ERR [RES] Physical Disk: Failed to get snapshot info for disk, status 170. Ignoring... WARN [RES] Physical Disk <%1>: Open: invalid device number!
where %1 is one of the disk resources in the cluster.
After it, the RHS consider it deadlock and restart.
ERR [RHS] RhsCall::DeadlockMonitor: Call OPENRESOURCE timed out for resource '%1'. ERR [RHS] Resource Cluster Disk 3 handling deadlock. Cleaning current operation and terminating RHS process.
and submit a WER.
When this occurs, RCM warns it:
WARN [RCM] HandleMonitorReply: FAILURENOTIFICATION for '%1', gen(0) result 4.
And if it fails 4 consecutive times, it marks the resource as poisoned:
WARN [RCM] rcm::RcmResource::HandleMonitorReply: Resource '%1' consecutive failure count 4. Moving resource to the poisoned state. INFO [RCM] TransitionToState(%1) Offline-->Poisoned.
When every disk return a good status (gen(0) result 0.) or is marked as poisoned, the issue dissappear.
Meanwhile, on eventviewer we see
Log Name: Microsoft-Windows-FailoverClustering/Operational Source: Microsoft-Windows-FailoverClustering Date: DATE Event ID: 1209 Task Category: Physical Disk Resource Level: Information Keywords: User: SYSTEM Computer: SERVERNAME Description: Cluster service is requesting a bus reset for <resource>
and BFAD errors 7 (Logical unit reset was performed upon request. Dump data contains additional details.) and 118 (The driver for device <resource> performed a bus reset upon request.)
Any hint on this issue?