Sunday, August 23, 2009

ESX and virtual machines losing time and hanging

I ran into a nasty bug between VMware ESX and VMWare Fusion. The
VMWare guys had no idea about it and after 3 hours on the phone with
them with little progress I started experimenting on my own and found
the solution.

Remember that I migrated my VMWare Fusion hosted VMs to ESX using the
VMWare converter tool. Prior to conversion I removed all snapshots
created by VMWare fusion.

The symptoms were that the machines under ESX were losing time and
network connectivity intermittently with no log entries on ESX and no
errors on the machines other than the massive time jumps. You could
see the machines "disappear" for 2+ minutes at a shot.

Turns out the issue was VMWare Fusion's autoprotect feature. I didn't
disable that on the VMs prior to migration and while ESX doesnt
support that feature it appears to break it. ESX was creating some
kind of snapshots frequently and there was no way at ESX to disable
this functionality.

My solution was to use VMware converter to go back from ESX, load the
machine into VMware fusion, turn off the Autoprotect feature, then re-
convert the machine back to ESX. Since then everything has been perfect.

The VMWare support people were friendly but not helpful and despite
the obvious client hangs and lots of snapshots getting created by ESX
were unwilling to admit it was an ESX issue. Obviously if my fix above
fixed the issue then it was a VMWare issue. I made no changes to the
guest operating systems.


Peter Lauterbach said...

Hi Rob,

Once you figured out it was the autoprotect feature, you can change the setting in the .vmx file on the ESX server, without having to unconvert and reconvert back to Fusion. The .vmx file is in /vmfs/volumes/'datastore'/'vm name' on the ESX host, you can edit it and make changes. Just remember to keep a backup.

Rob said...

Yes, the issue though is that there's no single "on/off" switch reflected in the vmx file. Also I think the way they do snapshots to get to a complete copy you have to unwind each snapshot back to the last time there was a whole copy made. Do you know if that latter part is true?

In other words my vmdk's were full of references to all these ESX-created snapshots and I had no idea how to unwind all that without losing data.

While the VMWare converter app neglected to shut off autoprotect it did correctly remove all snapshots when going back from VMX to VMWare fusion. I dont know if I could have done that by hand.

Peter Lauterbach said...

You can use the 'snapshot manager' to manage the snapshots (delta VMDK), just right click over the VM -> Snapshot -> Snapshot manager.

If the VM was in a consistent state and powered down when you originally converted it, and you did not make any changes you care about after you powered it up on the ESX host, you can go to the earlier snapshot, and the VM will revert to the state when you first converted it to the ESX host.

If the VM has been running for a while, and people have been doing useful work you don't want to recreate, you can use the snapshot manager to delete the snapshots, and it will roll the state forward into the original VMDK.

Rob said...

Tried that. It appears the snapshots that were getting created by this autoprotect mode were not "normal" ones in that they couldn't be removed. VMWare support was also surprised you couldn't delete them.