I am testing the ability to create a 3 node Hyper-V cluster using Server 2008 R2.
I am using 3 PowerEdge 840 servers connected to a freenas 8.3 box.
I have created 2 networks 1 for iscsi connectivity and the other for shared host/guest vms and heartbeat (this is only a test scenario but can create a discrete one for heartbeat if needed).
I can successfully create a 2 node cluster with CSV and manually fail over disks / VMs between the 2 nodes. However the moment i add a third node everything starts falling apart. Cluster share volumes take a long time to failover and most of the time go offline. The 3 node config does pass cluster validation including the test for persistent reservations. Have researched at length and for most people it seems the persistent reservations support may have been the issue but in my case i am uncertain if this is the culprit.
Part of cluster log pasted below:
00000890.0000047c::2012/12/17-20:22:06.439 WARN [RES] Physical Disk <Cluster Disk 2>: SetSharedPRKey: Setting shared PR key on disk that is not Offline on this node.
00000890.0000047c::2012/12/17-20:22:06.439 INFO [RES] Physical Disk: Enter EnumerateDevices: EnumDevice 0
00000890.0000047c::2012/12/17-20:22:06.486 INFO [RES] Physical Disk: Exit EnumerateDevices: status 0
00000890.0000047c::2012/12/17-20:22:06.486 INFO [RES] Physical Disk <Cluster Disk 2>: SetSharedPRKey: registered shared PR key high 00004070 low 0000734D for device 2
00000598.00000958::2012/12/17-20:29:00.641 WARN [RCM] ResourceControl(STORAGE_GET_SHARED_VOLUME_INFO) to Cluster Disk 2 returned 5004.
00000598.00000ad8::2012/12/17-20:29:05.041 INFO [RCM] rcm::RcmApi::OnlineResource: (Cluster Disk 2)
00000598.000006c8::2012/12/17-20:29:11.484 INFO [DCM] Processing message dcm/unmap
00000598.000006c8::2012/12/17-20:29:11.484 INFO [DCM] Push.AsyncUnmapDisk for 8d6c19df-fa55-488d-b0bb-e8db0686a3c0
00000598.00000b3c::2012/12/17-20:29:11.484 INFO [DCM] SyncHandler for 8d6c19df-fa55-488d-b0bb-e8db0686a3c0
00000598.00000b3c::2012/12/17-20:29:11.484 INFO [DCM] Reservation.SetPrKey(Cluster Disk 2,0)
00000890.0000047c::2012/12/17-20:29:11.484 WARN [RES] Physical Disk <Cluster Disk 2>: SetSharedPRKey: Setting shared PR key on disk that is not Offline on this node.
00000890.0000047c::2012/12/17-20:29:11.484 INFO [RES] Physical Disk: Enter EnumerateDevices: EnumDevice 0
00000890.0000047c::2012/12/17-20:29:11.499 INFO [RES] Physical Disk: Exit EnumerateDevices: status 0
00000598.00000244::2012/12/17-20:29:11.499 WARN [RCM] ResourceControl(STORAGE_GET_SHARED_VOLUME_INFO) to Cluster Disk 2 returned 5004.
00000890.0000047c::2012/12/17-20:29:11.499 INFO [RES] Physical Disk <Cluster Disk 2>: SetSharedPRKey: registered shared PR key high 00006DED low 0000734D for device 2
00000598.00000ad8::2012/12/17-20:29:11.593 WARN [RCM] ResourceControl(STORAGE_GET_SHARED_VOLUME_INFO) to Cluster Disk 2 returned 5004.
00000598.00000ad8::2012/12/17-20:29:11.608 WARN [RCM] ResourceControl(STORAGE_GET_SHARED_VOLUME_INFO) to Cluster Disk 2 returned 5004.
I know the individual nodes are not a problem as i tried all 3 nodes in various 2 node configs and it worked all the time. It is something about the 3 node config that seems to go wrong. In 3 node confg the quorum is set to node majority ands in 2 node it is disk and node majority.
Other things i tried unsuccessfully:
Removed old luns and provisioned brand new luns.
Reduced iscsi initiator name to simply fqdn of the server.
Reduced number of lun presented to just 1.
Any help appreciated. Eventually if this works the aim is to replicate the success to a production environment using a netapp SAN