Does thinking about HA Host Isolation Response give you heartburn?  Ever had a network/storage problem take down an entire ESXi cluster that was otherwise operating correctly?  To you vSAN and iSCSI users, ever have a network problem cause your entire cluster to go into Host Isolation Response and shut down every VM guest?  To those vSAN users that use erasure encoding (RAID 5/6), ever have enough hosts have a network problem at the same time (either due to an network driver issue or the network switching having problems) and have all disk IO to the whole vSAN datastore paused because not enough hosts are reachable…the hosts and disks are still fully functional they just lost the ability to communicate with each other.  For those people who run high IO loads on a vSAN datastore do you find that a single pair of 10Gb links is not enough to handle the vSAN replication traffic and you are running either multiple 10Gb links per host or looking at moving to 40Gb connected clusters to handle the vSAN traffic load?  Do you have large SQL VM guests with 1TB or more of RAM do you find that you are being forced into 40Gb connectivity to the hosts or running a dedicated vMotion network at 25-40Gb connectivity just in order to migrate a very large SQL VM guest from one host to another in a reasonable amount of time?

Lets look at these problems in more detail.

  1. HA host isolation response
    • Though there is a great amount of customization that is available in the most recent 6.5 releases, the ESXi hosts still require an external network to communicate over in order to determine whether they are isolated or not.
    • There are competing requirements for HA host isolation response.  On one hand if a host loses connectivity you want it to shut down all VM guests so they can be restarted on the other hosts in the cluster.  On the other hand, if the connectivity for the whole cluster is affected then the settings that previously were a good thing will now cause a very bad day when *every* ESXi host finds itself isolated and all VM guests in the entire cluster are shut down.
    • A full cluster shut down due to HA host isolation response is a serious enough and common problem that VMware introduced datastore heartbeating in ESXi version 5.0 so that hosts could use the independent FiberChannel network to heartbeat each other to determine if they are isolated or not.  Alternatively, a lot of VMware administrators purposefully leave HA host isolation response completely disabled (usually after being burned really bad one time by it)
  2. Datastore heartbeating
    • This was a very amazing feature when introduced and greatly increased the viability of of using the HA host isolation response feature.
    • Problem is that it assumes that you are using a dedicated storage network, most likely a FiberChannel network, to connect to that storage.  In today’s age that is becoming less and less common with converged networks running iSCSI on the same network, or vSAN that uses the network to replicate between hosts.
    • Its not uncommon for a case to be made to run iSCSI on dedicated network equipment in part due to the HA host isolation response in VMware.
    • Its also not uncommon to see vSAN deployments running a cheap FiberChannel network to a low end SAN with a LUN that stores minor importance stuff just to be able to gain the ability to heartbeat against a datastore as well as the network.
  3. Bandwidth requirements increasing
    • Most deployments of VMware and most enterprise environments in general are able to comfortably run on 10Gb networking.  Network utilization is often times much below the 10Gb link speeds of the connected hosts.
    • The first problem is that the amount of RAM that fits in a server is increasing and the size of the VM guests are increasing with it.  Lets take a 1TB SQL VM guest for example.  To vMotion that from one host to another will take over 20 minutes just due to the limitations of 10Gb networking assuming you can fully use one of the two 10Gb network connections in a server for that vMotion.  To put a host with 3TB of RAM into maintenance mode could take almost an hour if done over 10Gb.
    • The second problem is seen in vSAN deployments, especially ones that use erasure encoding (RAID 5/6).  There is a lot of inter-host network traffic due to the distributed nature of vSAN data storage.  In storage policy that uses RAID 6 erasure encoding any and every disk IO will hit 6 different hosts.  On high disk IO vSAN clusters this can easily generate a base network load in upwards of 2-4Gbps just for normal vSAN usage.  During vSAN data rebuilds this number will skyrocket.
    • How big of a problem are these increased throughput requirements?
      • There are QoS solutions to both protect production network traffic against vSAN and vMotion traffic flows and also in turn to guarentee a certain base minimum of throughput for vSAN or vMotion.
      • Feel free to disagree, but I am a firm believer that QoS is masking the problem, not solving it, and is more likely to cause additional problems in a trickle down effect.  QoS is only useful in the singular situation that you do not have enough bandwidth to meet your requirements.  QoS covers up the true problem (lack of sufficient bandwidth)  with technical trickery to make it safer to jamb a square peg in a round hole.
      • The alternative is you are stuck looking to upgrading the host connectivity to 40Gb links or building out dedicated vMotion and vSAN replication networks so that you don’t congest production workloads on the network.

 

So what are the problems?  The real problems?

  1. vMotion and vSAN have massive (relatively speaking) bandwidth requirements for hosts in the same cluster.  The VM guests running on those clusters have relatively small bandwidth requirements in comparison.  There is also next to zero bandwidth requirements between different clusters unless you are doing a storage vMotion between them.
  2. When networks fail, they run the risk in failing in such a spectacular manner that it can cause your entire ESXi cluster to go into HA host isolation mode.  How likely the network is to spectacularly fail is usually directly related to how complex that network is, the more complexity the higher the risk that during a failure it will be a spectacularly horrible failure.  Quite ironically, using features such as QoS to try and stretch the existing network out to do more increase the complexity and increase the risk of a spectacular failure.
  3. The previous solution of using datastore heartbeating over an (assumed) dedicated FiberChannel network is disappearing as an option.  iSCSI is a very viable alternative now (and good golly is FiberChannel expensive in comparison) and the software defined storage options such as vSAN are rapidly increasing in popularity.  The health of the ESXi cluster and is ability to respond properly to host isolation failures are back in the same metaphorical basket of relying on the network to always deliver connectivity.
  4. HA host isolation response in VMware ESXi clusters has always been dependent on some form of external connectivity (ethernet or fiberchannel network).  It is difficult to cover all possible failure scenarios of that external connectivity which usually causes VMware clusters to not behave exactly the way you want them to during those failures.

Look at a typical vSAN deployment of an ESXi cluster or any converged iSCSI deployments.  The switches are a single point of failure.  Most spectacular network failures are typically fixed in a short amount of time.  The big problem comes when an ESXi cluster reacts to this outage, a 10 minute outage on the network can cause the complete shutdown of 1000 VM guests and booting all those back up can take hours.  If you are running iSCSI or vSAN then those VM guests were shut down uncleanly which means data integrity checks once the VM guests finish booting up (especially if you are running any transactional data such as SQL databases)

 

 

 

NOTE: This is a feature request.  This is not currently possible.

VMware, how can you fix this easily?  Create a full mesh direct connect network for clusters up to 9 hosts (covers the majority of the deployments) to act as a dedicated back end network.  The front end network is still there (not pictured) for VM guest network connectivity, but back end VMware processes such as cluster state and vSAN/vMotion use this back end network for primary connectivity.  Why 9 hosts?  If you are using dual port QSFP28 network cards with 4x fanout each, that allows for the current 1 host + 8 additional hosts.  VMware…it would be nice if you could support this.

  1. Install a dual port 100Gb network card into each host.  Install 100G-SR4 transceivers
  2. Install two 4x25Gb fan out fiber cables onto each NIC.  Each card will have 8x 25Gb connections coming off of it.
  3. Use inline fiber couplers to connect hosts to each other

NOTE: This is a feature request.  This is not currently possible.

NOTE: This is a feature request.  This is not currently possible.

At this point all 9 hosts in the cluster will be in a full mesh with each other using 25Gb links.  Once 200Gb becomes mainstream this will be 50Gb links.  VMware could program ESXi to have this full mesh network be fully auto configuring with the ESXi hosts auto discovering their peers.  Then use this full mesh network as a heartbeat network to keep and share cluster state.  In addition, it can also be used for vSAN data replication between hosts and used as the vMotion network.

The entire setup are network cards and fiber cables, the only thing available to fail are the cards and cables themselves.  Failures are predictive, if a network card fails then a single host goes offline.  If a fiber cable breaks then the connectivity between two hosts is all that is affected.  HA host isolation response can respond accurately.  vSAN and vMotion both have the ability to have active/passive interfaces assigned so should there be a failure on 25Gb cluster mesh those vmkernel interfaces can revert to using the 10Gb switched production network.  Most importantly, all of this is available using already existing commodity hardware.  Total cost for a cluster of 9 servers for cards and cables would be ~$7000.  On a 9 node vSAN cluster that costs easily upwards of $200,000 in whole, an additional $7000 is a bargain for the gains.

Cluster state improvements:

  1. There is a dedicated network for heartbeats that has no other active/managed devices in it.  No switches to fail or to have problems.
  2. 100% of the configuration is software settings inside of ESXi.  The cards and cabling is all physical L1, there is nothing to configure(misconfigure).
  3. Link state of the network interfaces themselves can be used for logic to help determine partial/odd failure scenarios since in effect the ESXi hosts are directly connected to each other using fiber crossover cables.

vSAN improvements:

  1. When using erasure encoding (RAID 5/6) there is a large amount of inter-ESXi host traffic.  many->many
  2. For each individual ESXi host, this will provide a 25Gb link to each and every other ESXi host.  For example:
    • In a cluster of 9, lets assume that each host has ~3Gbps of vSAN baseline traffic
    • That works out to be 500Mbps per other connected host
    • On the mesh = 500Mbps for each of the 25Gb links for vSAN baseline traffic
  3. The resiliency improvements of this is astounding.  If a vSAN cluster looses the network connectivity between nodes good things do not happen.  At best, VM guests go into a suspended state waiting for disk IO.  At worst there runs the potential for data corruption for applications that do not like the storage being yanked out from under them (SQL).

vMotion improvements:

  1. Moving a 1TB of RAM single VM from one host to another host would take around 8 minutes instead of 20 minutes.  This is a corner case but it does happen.
  2. A more normal situation is if you need to put a host with 3TB of RAM into maintenance mode and vacate all the VM guests.  The dual port 100Gb NIC if in a x16 PCIe 3.0 slot is limited to a theoretical 126Gbps due to the PCIe slot.  A full vacate of a host with 3TB of RAM to the 8 other hosts would take about 4 minutes.  This is compared to 60 minutes on 10Gb traditionally connected hosts and 15 minutes for 40Gb traditionally connected hosts (one of the two uplinks to the switches is active for vMotion)
  3. vMotion traffic does not normally go over the same network as the VM guests so there is no possibility for congestion.

NOTE: This is a feature request.  This is not currently possible.

Leave a Reply

Close Menu