Snapshot removal stops a virtual machine for long time

Snapshot removal stops a virtual machine for long time 


Details

When a snapshot removal (consolidation) is in progress, you cannot perform other VM tasks, such as power operations, or vMotion migration, in the virtual machine. You must remove the snapshot without any interruption to ensure data integrity. Based on the amount of snapshot delta to be committed, the amount of time varies.

This article outlines the actions taking place against the virtual machines snapshots.

NoteVerify if the Virtual Machine is in responsive state by performing a ping test or trying to access the VM through RDP.

 Solution
For live consolidations, virtual machine activity (specifically disk writes) during this time must also be committed. This delta information is kept through svm mirror device that is responsible for copying the data, committed at the end of the snapshot removal.

For busy virtual machines, the volume of activity may consume system resources for longer than a usual amount of time, resulting in a larger Consolidate Helper snapshot delta.

For example, a virtual machine with one virtual disk (disk.vmdk) and one snapshot, will have these files:
  • disk.vmdk with extent disk-flat.vmdk
  • disk-000001.vmdk with extent disk-000001-delta.vmdk
If you choose to remove or consolidate the snapshot:
  1. An additional snapshot delta is created, the Consolidate Helper:
     
    • disk.vmdk with extent disk-flat.vmdk
    • disk-000001.vmdk with extent disk-000001-delta.vmdk
    • disk-000002.vmdk with extent disk-000002-delta.vmdk. The virtual machine is no longer writing to the above two files; all current writes while the snapshot removal is in progress are committed to the disk-000002-delta.vmdk extent file via disk-000002.vmdk.
       
  2. The VMware ESXi host's DiskLib API consolidates disk-flat.vmdk with disk-000001-delta.vmdk. Meanwhile, the virtual machine continues writing to disk-000002-delta.vmdk.
     
  3. After completing the consolidation of the snapshot, the ESXi host consolidates the Consolidate Helper disk-000002-delta.vmdk with disk-flat.vmdk.

    The Virtual Machines are stunned for the duration of the consolidation. In typical circumstances, this process is completed almost immediately. A VM with considerable amounts of delta gathered in the temporary snapshot are stunned for a noticeable or disruptive amount of time. This can have adverse effects on guest applications or services.
     
  4. When all delta information recorded in disk-000002-delta.vmdk has been committed to disk-flat.vmdkdisk-000002-delta.vmdk and its descriptor file disk-000002.vmdk are removed from the datastore. The virtual machine continues from its base disk or selected point.

    For example, in the vmware.log file of the virtual machine, you see entries similar to:

    2017-06-10T23:08:57.330Z| vcpu-0| I120: DISKLIB-CHAINESX : ChainESXOpenSubChain: numLinks = 1, numSubChains = 1
    2017-06-10T23:08:57.330Z| vcpu-0| I120: DISKLIB-CHAINESX : ChainESXOpenSubChain:(0) fid = 71989930, extentType = 0
    2017-06-10T23:08:57.330Z| vcpu-0| I120: DISKLIB-LIB   : Resuming change tracking.
    2017-06-10T23:08:57.331Z| vcpu-0| I120: DISKLIB-CBT   : Initializing ESX kernel change tracking for fid 71989930.
    2017-06-10T23:08:57.332Z| vcpu-0| I120: DISKLIB-CBT   : Successfully created cbt node 2b9e7eac-cbt.
    2017-06-10T23:08:57.332Z| vcpu-0| I120: DISKLIB-CBT   : Opening cbt node /vmfs/devices/cbt/2b9e7eac-cbt
    2017-06-10T23:08:57.332Z| vcpu-0| I120: DISKLIB-LIB   : Opened "/vmfs/volumes/5d6553bc-7efb6b6c-19b9-00505601301f/disk/disk-000002.vmdk" (flags 0x18, type vmfsSparse).
    2017-06-10T23:08:57.332Z| vcpu-0| I120: SnapshotVMXNeedConsolidateIteration: Size of helper disk '/vmfs/volumes/5d6553bc-7efb6b6c-19b9-00505601301f/disk/disk-000002.vmdk' = 1048576 bytes, approx. time required for consolidating helper disk = 0.263418 sec.


    Notes
  • Additional RDP session may drop during a long snapshot, this is expected behavior.
  • The reason the Virtual machine may become unresponsive is because of the stun process integrated with Consolidation process. The stun/freeze happens to accommodate the changes to be written back to the base disk from the Delta disks. At times if the virtual machine is generating a lot of I/O or the underlying storage is experiencing latency, the stun time can increase. In ESXi, it predetermines how long the stun goes, and if it detects it takes more than 15 seconds, it waits for another Iteration, it keeps doing that for 10 iterations. On the 10th iteration, it stuns the virtual machine for the required period of time it needs to. This process has changed and if the process detects that it is over 15 seconds, after 9th iteration it stops trying and cannot delete the snapshot.
Now that being said, simple reasons why the VM would have frozen for long time is either underlying storage latency or simply there were too many I/Os at the time to write back.

Cheers πŸ˜‰πŸ˜‰πŸ˜‰
Happy Learning 

Comments