An all-hardware Exchange 2010 SP3 UR4 DAG cluster is having an issue when the Microsoft Loopback adapter is installed (from Device Manager...Add Legacy Hardware) to support DSR operations with hardware load balancer (HLB).
- The HLB provides HA endpoint for RPC Client Access, SMTP, etc. DSR is required to preserve source IP--on which Exchange receive connectors that filter on source IP for security depend.
- It is server DAG, with 3 x production severs at the datacenter and 2 x DAG DR servers located in a DR site.
- Only the 3 x production servers at the main site have the loopback adapter installed.
- The loopback-DSR-specific settings like 'weakhostrecive, etc' are in effect.
The problem only involves the 3 servers in the DAG with loopback adapters.
The issue is that when a DAG member restarts, sometimes it will cause the online production cluster node which isnot the Cluster Host Server to fail. Consider:
- DAGNode1, Loopback enabled, Healthy, Is Cluster Host Server
- DAGNode2, Loopback enabled, Healthy
- DAGNode3, Loopback enabled, is Restarted
In this scenario, the cluster service on DAGNode2 will experience a loss of network connectivity when DAGNode3 rejoins the cluster (DAGNode2 reports cluster failure on all other nodes) and shortly afterwards the Cluster Service on DAGNode2 will terminate. FailoverClustering 1572 is seen on DAGNode2:
Node 'DAGNode2' failed to join the cluster because it could not send and receive failure detection network messages with other cluster nodes. Please run the Validate a Configuration wizard to ensure network settings. Also verify the Windows Firewall 'Failover Clusters' rules.
Interestingly, if you disable the Loopback on DAGNode3, DAGNode2 will immediately rejoin the cluster! Re-enable the Loopback on DAGNode3 and DAGNode2 immediately fails again! With some more server restarts possibly, you get a stable cluster again with Loopback enabled on all production nodes. The status of the loopback (enabled or not) on the Cluster Host does not impact this issue.
As I mentioned, it is only some restarts that this occurs, usually there is no problem. Also note the Loopback network/adapters do not appear in Cluster Manager and are not listed as cluster networks with cluster.exe. Cluster Validation Wizard passes everything except noting that every node has a duplicate IP on an installed adapter.
Looking for others with experience that have combined DSR-based HLB with CAS/Hub/MBX DAG Cluster on same Exchange computers and were able to use reliably.
There is an unanswered thread from 2010 on this topic:
Some questions / any answers are very welcome!
- Can I add the Loopback adapter to the cluster configuration so that I can use Cluster.exe to ignore the loopback adapter?
- Can I prevent other cluster nodes from seeing the loopback adapters in the other nodes? Is there an ‘ignore partner adapter’ setting?
Thank you!
John Joyner MVP-SC-CDM
P.S. I add this information 3/1/2014:
This link suggests that if you allow the cluster network to partition it will discover the loopback adapters and they will appear in cluster manager: (Did this by enabled IPV6 on Loopback, when done this made Loopback Network appear in Cluster Manager. Then used Cluster.exe to set IgnoreNetwork=$true on the Loopback network.) Result: No change, still caused cluster communication outage when Loopback enabled on third production node that is not the Cluster Group Host.
Developed: A workaround!
1. Just before restarting a node, after drain stop in NS, and after running StartDAGServerMaintenance.PS1 (which pauses the node in Cluster Manager) disable the Loopback Adapter so that when the computer restarts, Loopback is disabled.
2. After node restarts and rejoins cluster in Paused status, and after running StopDAGServerMaintenance.PS1, issue this command to move the Cluster Group host to the computer that was restarted and has the Loopback disabled.
cluster <clustername> group "Cluster Group" /moveto:<nodename>
3. Then safely enable the Loopback on the computer that was restarted and is now the Cluster Group host.
4. Then take the computer out of drain stop in the NS.
This of course only applies to controlled restarts.
In the event of unexpected server crash and recoveries, there is nothing stopping this from happening when the crashed server restarts. Still need a real fix! With knowledge of how to defuse the situation when it happens (disable loopback on the production node that is not the Cluster Group host), it clears the condition immediately. You than then fix it by steps 2 and 3 in the workaround.