Okay, this has been going on for months after we performed an upgrade on our 2008 R2 clusters. We upgraded our development cluster from 2008 R2 to 2012 (SP1) and had no issues, saw great performance increases and decided to do our production clusters. At the time, 2012 R2 was becoming prominent and we decided to just hop over 2012, thinking changes in this version weren't that drastic, we were wrong.
The cluster works perfectly as long as all nodes stay up and online. Live migration works great, roles (including disks) flip between machines based on load just fine, etc. When a node reboots, or the cluster service restarts, when the node goes from "Down" to "Joining" and then "Online", the CSV(s) will switch from Online to Online (No Access) and the mount point will disappear. If you were to switch the CSV(s) to the node that just joined back into the cluster, the mount point returns and it goes back to Online.
Cluster validation checks out with flying colors and Microsoft has been able to provide 0 help whatsoever. We have two types of FC storage, one that is being retired and one that we are switching all production machines to. It does this with both storage units, one SUN and one Hitachi. Since we are moving to Hitachi, we verified that the firmware was up-to-date (it is), our drivers are current (they are) and that the unit is fully functional (everything checks out). This has not happened before 2012 R2 and we have proven it by reverting to 2012 on our development cluster. We have started using features that come with 2012 R2 on our other clusters so we would like to figure this problem out to continue using this platform.
Cluster logs show absolutely no diagnostic information that's of any help. The normal error message is:
Cluster Shared Volume 'Volume3' ('VM Data') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.
Per Microsoft our Hitachi system with 2012 R2 and utilizing MPIO (we have two paths) is certified for use. This is happening on all three of our clusters (two production and one development). They mostly have the same setup but not sure what could be causing this at this point.