We have a 6 node production cluster. We are on Windows Server 2008 R2 and SQL Server 2008 R2. At any time, a node will loss communication with the cluster causing every instance on that node to failover to other nodes. The event logs are very generic - event ids 1006 and 1335. We disabled tcp offloading, done nic driver updates, installed various patches (KB2524478, 2552040, 2685891, 2687741, 2754804), but its still happening. If anyone has any information that can help, please let me know. Here is what is happening in the cluster log at the time of the disconnect.
00000950.00000b14::2013/02/20-12:37:09.511 WARN [CHANNEL ~] failure, status WSAETIMEDOUT(10060)
00000950.00000ae4::2013/02/20-12:37:09.511 WARN [CHANNEL ~] failure, status WSAECONNRESET(10054)
00000950.000009cc::2013/02/20-12:37:09.518 INFO [ACCEPT] :::~3343~: Accepted inbound connection from remote endpoint:~51451~.00000950.0000133c::2013/02/20-12:37:09.518 INFO [SV] Route local (~) to remote (:~51451~) exists. Forwarding to alternate path.00000950.0000133c::2013/02/20-12:37:09.518 INFO [SV] Securing route from (~) to remote (:~51451~).
00000950.0000133c::2013/02/20-12:37:09.518 INFO [SV] Got a new incoming stream from:~51451~
00000950.00000b14::2013/02/20-12:37:09.519 INFO [PULLER evproddb13] Parent stream has been closed.
00000950.00000b14::2013/02/20-12:37:09.519 ERR [NODE] Node 4: Connection to Node 7 is broken. Reason Closed(1236)' because of 'channel to remote endpoint 3343~ has failed with status WSAETIMEDOUT(10060)'
00000950.00000b14::2013/02/20-12:37:09.519 WARN [NODE] Node 4: Initiating reconnect with n7.
00000950.00000b14::2013/02/20-12:37:09.519 INFO [MQ-evproddb13] Pausing
00000950.00001988::2013/02/20-12:37:09.519 INFO [Reconnector-evproddb13] Reconnector from epoch 1 to epoch 2 waited 00.000 so far.00000950.00001988::2013/02/20-12:37:09.519 INFO [CONNECT]:~3343~ from local ~: Established connection to remote endpoint:~3343~.00000950.00001988::2013/02/20-12:37:09.519 INFO [Reconnector-evproddb13] Successfully established a new connection.00000950.00001988::2013/02/20-12:37:09.520 INFO [SV] Route local (:~52834~) to remote evproddb13 (~) exists. Forwarding to alternate path.00000950.00001988::2013/02/20-12:37:09.520 INFO [SV] Securing route from (:~52834~) to remote evproddb13 (3343~).
00000950.00001988::2013/02/20-12:37:09.520 INFO [SV] Got a new outgoing stream to evproddb13 at 3343~
00000950.00000ae4::2013/02/20-12:37:09.525 ERR [NODE] Node 4: channel (write) to node 7 is broken. Reason Closed(1236)' because of 'channel to remote endpoint:~3343~ has failed with status WSAECONNRESET(10054)'
00000950.00000ae4::2013/02/20-12:37:09.525 WARN [NODE] Node 4: Initiating reconnect with n7.
00000950.00000ae4::2013/02/20-12:37:09.525 INFO [MQ-evproddb13] Pausing
00000950.00000b14::2013/02/20-12:37:09.525 INFO [NODE] Node 4: Cancelling reconnector...
00000950.00002318::2013/02/20-12:37:09.525 INFO [Reconnector-evproddb13] Reconnector from epoch 1 to epoch 2 waited 00.000 so far.00000950.00000b14::2013/02/20-12:37:09.525 INFO [CONNECT] 3343~ from local 14:~0~: Established connection to remote endpoint 3343~.
00000950.00000b14::2013/02/20-12:37:09.525 INFO [Reconnector-evproddb13] Successfully established a new connection.00000950.00000b14::2013/02/20-12:37:09.525 INFO [SV] Route local (:~52836~) to remote evproddb13 (:~3343~) exists. Forwarding to alternate path.00000950.00000b14::2013/02/20-12:37:09.526 INFO [SV] Securing route from (:~52836~) to remote evproddb13 (:~3343~).00000950.00000b14::2013/02/20-12:37:09.526 INFO [SV] Got a new outgoing stream to evproddb13 at:~3343~