After doing some research on the PSC architecture and how it relates to vCenter William Lam covers it perfectly, I wanted to experiment further with a dual PSC setup with automatic failover, keeping a manual re-point as a backup method. Manually re-pointing vCenter to an alternate PSC is pretty straight forward:
cmsso-util repoint --repoint-psc systemname_of_second_PSC
C:\Program Files\VMware\vCenter Server\bin\cmsso-util repoint --repoint-psc systemname_of_second_PSC
William Lam has also developed a script that can automate the task as well. For an automated HA type solution, VMWare “officially” supports using either a NSX-v, F5 BIG-IP or Citrix Netscaler as a load balancer. (Note: The CPX is something interesting I want to later explore…) But as far as load balancing – load between multiple PSCs do not get balanced. Configuration documentation show these load balancing appliances actually put multiple PSC deployments in more of an active/passive state. Meaning that at any one time, vCenter will only exchange data with the PSC it is currently pointed to. If that PSC should ever fail, then you can point vCenter to the other existing PSC.
As stated above, to get vCenter to point to the alternate PSC, a manual re-point must be performed on vCenter to reference the new PSC target, PSC2. As also stated above, the process is straight forward, log into vCenter and issue the re-point command. So can we save ourselves some time and get this done automatically for us? Depending on when a failure occurs, it could take 5 – 10 minutes once reported if someone is available. But if a failure happens, during the middle of the night for example, it could take longer. And while vCenter is unable to access the PSC it cannot be used for anything. This potentially means any automated processes will get disrupted as well, like backups. I’ve seen instances where snapshots were left on VMs because the backup process lost connection to vCenter during the a VM backup task. It took a snapshot of the VM while it had access to vCenter, but lost connection before it could consolidate that snapshot – no good. So the faster we can get vCenter pointed to the alternate PSC, the shorter the disruption time. So let’s look into automating the task if possible and know we can re-point manually if needed.
This where the proxy server comes in. By using a proxy server we can point vCenter to it instead of one of actual PSCs. The proxy, in our example, will take care of forwarding all requests between vCenter and PSC1. In the event of a PSC1 failure it will automatically shift requests to PSC2.
For this exercise, I chose to use HAProxy for this experiment. HAProxy is a high performance load balancer and has been around a very long time. However, it is not on VMWare’s supported list of load balancers, so keep this in mind. They will not help, or offer any type of support due to any issues with HAProxy. Thus, you have your backup, manual re-point option if needed. Disclaimer: Use at your own risk.
Andreas Peetz already has a fantastic write-up of getting HAProxy working with vSphere 6.0 PSCs which I’ve confirmed works with both vSphere 6 update 3 and 6.5. However, there is only one downfall here using any proxy to achieve HA between PSCs as we’re trying to do. The proxy itself is now a single point of failure.
So to keep everything HA, which is our goal, we need to make sure the proxy itself is also … highly available. So how do we do this? Well, there are many possible solutions we could explore, but what I decided here was to leverage the power of Docker Swarm. Using Docker Swarm, we can turn our proxy server into a highly available service and take advantage of the built in functionality of the Swarm cluster itself – Desired State, HA, Proxy requests, etc.
Stay tuned for my next post where we’ll go further into this.
Unfortunately, I was unable to finish writing up the details / results of this experiment. However, I can report that using this setup, HAProxy was able to failover succesfully while the HAProxy service itself remained highly available.