For years now, we have had event ID 1146 crash nodes in the cluster (RHS process crashes). We have had several paid microsoft cases open, even one with premier. In fact we have one open currently with zero progress in 72 hours (115012612321318).
Is anyone really running 200+ machines out there with Hyper-V with any level of stability in production, or do you have a complete host (event id 1146) or volume (event id 5120) outage every month or so?
We have applied recommended hotfixes, and gone through the configuration many many times.
My only conclusion is that Hyper-V does not scale. Once we started adding a lot of machines and hosts, we started getting event 5120 (with STATUS_IO_TIMEOUT) which is unacceptable. Causes a huge slowdown or makes an entire volume inaccessible and impacts EVERY machine in the volume. The other volumes work when this happens. In fact, we have a VMware cluster attached to the same san with the same host hardware, and it works flawlessly. Both use MPIO, so the timeout is caused by Hyper-V. The load was nearly identical on Vmware and Hyper-V at one time, we had 100 machines on both and the same amount of hosts. CPU load is tiny, memory is less than 50%, IO uses 55 disk spindles for normal storage and another 55 for fast storage.
I'm more or less asking the community how to fix this since the support is not working, but I'm guessing there is no fix and this is really not production ready. I would really like to here from ANYONE (non-sales) that is using 200+ machines without big outages.