ZFS Enabled Disaster Recovery for Virtualization

OpenZFS provides a slew of features that make it uniquely well suited as a reliable storage back end for virtualization, and for disaster recovery functionality with extremely tight RTO and RPOs. If you are not familiar how ZFS can provide Recovery Point Objectives of only a few minutes, you might be interested in our previous article on the topic.

Let’s look at how you can use ZFS to build resilient infrastructure for your VMs that is able to instantly roll back from attacks such as ransomware. This approach forms the backbone of a reliable disaster recovery plan for virtualization, minimizing downtime after failures. We’ll also cover how VMs can be replicated offsite to allow you to get your infrastructure b…

Automation: Snapshots and Replication

The first task any administrator should be thinking of when setting up a new VM server is safety. In this context, “safety” means being sure that you’ve always got a snapshot to roll back to–even if you lose an entire physical machine!

This means setting up some automated tasks. Snapshots are fabulous, but if you didn’t take one before something catastrophically bad happened, you’ve missed your window. Similarly, if you didn’t replicate all your snapshots from your primary machine to a backup before that lightning strike fried the drive(s) in the primary, it’s too late afterward.

So, we need a framework that allows us to do all of the following automatically, repeatedly, and reliably:

Create new snapshots of important datasets at a specified regular interval
Destroy unnecessary automatic snapshots after they reach a specified age
Replicate snapshots from primary sources to backup targets

Ideally, we very much also want a monitoring solution to let us know when any of the above goes wrong.

There are many choices of framework to achieve these goals but Sanoid is our own favorite. Although we’ve linked to the project’s development environment on Github, it is available directly from just about every distribution’s main repositories–so, for example, one can simply pkg install sanoid on FreeBSD, or apt install sanoid on Debian or Ubuntu.

Sanoid itself is a daemonless system intended to be invoked by cron or a systemd timer as desired, at regular intervals and in the format sanoid –cron. Syncoid is a replication orchestrator that mimics rsync in functionality and syntax, making automation (again via cron or systemd timer) simple.

A Brief Sanoid Walkthrough

In this article, we unfortunately don’t have the space to go into detail about the care and feeding of a sanoid system. But we suspect the following excerpts from a real-world sanoid config file and root crontab will tell you most of what you need to know:

[data/images]
use_template = backup
recursive = yes
#############################
# templates below this line #
#############################

[template_production]
hourly = 36
daily = 30
monthly = 3
yearly = 0
autosnap = yes
autoprune = yes

[template_backup]
hourly = 36
daily = 30
monthly = 3
yearly = 0

### don't take snapshots - snapshots replicate in from source
### on backup datasets, they're not generated locally
autosnap = no
autoprune = yes

### monitor hourlies and dailies, but don't warn or
### crit until they're over 48h old, since replication
### is typically daily only
hourly_warn = 288
hourly_crit = 360
daily_warn = 40
daily_crit = 60

In this real-world VM host, the machine is using the backup template–which, as we can see, keeps 36 hourly snapshots, 30 daily snapshots, and 3 monthly snapshots. This machine is a backup target, not the production primary. Instead of taking snapshots locally, it replicates them in from the primary host. It still needs to prune those snapshots locally when they get stale, though!

So, although this machine does not take snapshots locally, it does track them. If there are 37 or more hourly snapshots and any of the oldest hourly snapshots are more than 36 hours old, they will be removed until either the oldest is newer than 36 hours, or there are no more than 36 hourly snapshots remaining.

We can get another view of how this works in the same system’s root crontab:

# m h dom mon dow command
*/12 * * * * /usr/local/bin/sanoid --cron

5 * * * * /usr/local/bin/syncoid -r root@elided-prod0:zroot/images data/images

Every five minutes, this system executes sanoid –cron, which quietly tallies up all the snapshots of monitored datasets and applies the policies set in sanoid.conf.

Once per hour, this system replicates new snapshots in from the primary host, elided-prod0. And since these machines are dedicated VM host servers, the actual data we’re moving back and forth is–you guessed it–VM images!

What about monitoring?

All of what we’ve got is great–but we need some way to easily be certain that all of it is working as intended. Monitoring snapshot health is an essential step in maintaining a disaster recovery strategy for virtualization. Luckily, Sanoid has built-in functionality to help with this as well.

Let’s take a look at Sanoid’s two monitoring arguments, as executed on another couple of real-world host systems:

root@elided-dr0:/# sanoid --monitor-health
OK ZPOOL data : ONLINE {Size:7.25T Free:607G Cap:91%}
root@elided-dr0:# sanoid --monitor-snapshots
OK: all monitored datasets (data/images, data/images/ERP-APPSERVER, data/images/ERP-DBSERVER, data/images/fileserver-storage, data/images/qemu, data/images/zabbix) have fresh snapshots

So far, so good–we’ve got a pair of simple-to-run commands that give us terse but pretty comprehensive diagnostic output. But wait, it gets better. These commands can also serve as tests to be run by Nagios, or to pair with the simple, free service healthchecks.io.

That’s because these commands exit 0, 1, or 2 for OK, Warn(ing), or Crit(ical problem) as appropriate! Let’s check that out. First, we’ll test the error code on the system above:

root@elided-dr0:/# sanoid --monitor-snapshots ; echo ?$
OK: all monitored datasets (data/images, data/images/ERP-APPSERVER, data/images/ERP-DBSERVER, data/images/fileserver-storage, data/images/qemu, data/images/zabbix) have fresh snapshots
0

See that trailing “0”? That’s the exit code we got after running sanoid –monitor-snapshots, and it’s telling us–very, very scriptably–that everything was okay, without forcing us to parse all the detail it produced on the console.

If used as a simple Nagios check, this means that you get the 0, 1 or 2 for OK, Warn, or Crit that Nagios requires–but you also get some nice, human-readable and much more detailed output that shows up in the text for that status in Nagios.

What if we look at a system that is currently experiencing some problems, instead?

root@redacted-hs0:~# sanoid --monitor-snapshots ; echo $?
CRIT: data/images newest hourly snapshot is 15h 56m 30s old (should be < 6h 0m 0s), CRIT: data/images/dc0 newest daily snapshot is 30d 16h 56m 29s old (should be < 2d 12h 0m 0s), CRIT: data/images/dc0 newest hourly snapshot is 30d 4h 56m 29s old (should be < 6h 0m 0s), CRIT: data/images/fileserver newest daily snapshot is 30d 16h 56m 28s old (should be < 2d 12h 0m 0s), CRIT: data/images/fileserver newest hourly snapshot is 30d 4h 56m 30s old (should be < 6h 0m 0s), WARN: data/images/mailserver newest monthly snapshot is 38d 16h 56m 27s old (should be < 32d 0h 0m 0s)
2

The storage capacity of this backup host wasn’t upgraded at the same time as that of the production host it replicates from. When given time, that resulted in completely filling this backup host’s pool, although the primary is still trucking along fine.

The real solution to this problem is simple: the backup host needs to get a couple more drives tossed into a couple of its open bays, adding the missing capacity. In the meantime, we manually destroyed some of the oldest snapshots on the system–even though they were still within defined freshness policy in sanoid.conf–in order to allow it to continue receiving new ones.

However, that’s not the point of showing you this output–the point is that in this case, sanoid –monitor-snapshots exited 2 (for Crit) rather than 0 (for OK) or 1 (for Warn). We can also see that this happens despite some of the datasets under monitoring only reaching Warn status (and others remaining OK)–since Crit was the worst outcome detected, Crit is the output given.

Demonstration: monitoring sanoid with healthchecks.io

To use these monitoring arguments with healthchecks.io, simply check their exit code, and only run curl against the UUID you created for your test if they exit 0. For example, if you want to test your pool and ping your healthchecks account once per hour, you might use this crontab entry:

0 * * * * sanoid --monitor-snapshots ; rc=$? ; if [ $rc -ge 0 ] ; then curl https://hc-ping.com/[an-hc-UUID] ; fi
5 * * * * sanoid - -monitor-health ; rc=$? ; if [ $rc -ge 0 ] ; then curl https://hc-ping.com/[another-hc-UUID] ; fi

This gets you hourly updates to two separate Healthchecks tests: one for your pool health and another for your snapshot freshness. Which, if it fails, lets you know that you don’t have as many or as new snapshots as policy says you should.

If either test gets a warn or crit, the health-checks ping won’t run, and you can receive a notification however you’ve configured your healthchecks account to notify you.

Production Is Anything You Care About

This is pretty much implied by the last few sections, but we wouldn’t be doing our job if we didn’t call it out explicitly as well: you need to back up your production systems. Of course, this raises a question that isn’t answered frequently enough: what is “production?”

You don’t have to be making money on a server for it to be a production system. There doesn’t need to be any business involved, nor does it have to live anywhere outside your home. The real answer to what makes a system “production” is simple: would it hurt if you lost the data on that system?

If the answer is “yes, it would hurt,” then that is a production system, and you need to back it up!

What Is a “Backup?”

Please, everyone, recite the tired old adage with me: RAID is not a backup. Yes, the line is tired–but it’s tired because we need to keep repeating it. There are still those who need to keep hearing it.

Perhaps your pool is composed of RAIDz3 vdevs. Heck, maybe it’s composed of eight-wide mirror vdevs–that’s still not a backup! Let’s look at a short list of potential failures that neither mirrors nor RAIDz can protect you from:

admin destroys the wrong dataset(s) by error
power surge fries several drives simultaneously
deranged SATA/SAS controller streams garbage data to all connected drives simultaneously
server catches on fire due to improper storage of non-IT materials in IT closet
entire datacenter (or office building) destroyed due to environmental catastrophe
entire machine is physically stolen during an office break-in

I have personally weathered all of the above problems without dataloss–all because I make certain to set up, monitor, and practice restoration of backups on all my production systems.

So far, we’ve just discussed a couple of things that aren’t a backup.

What is a backup? Ideally, it’s a second copy of all of your data on a separate machine. The backup machine should not trust the production systems it backs up–this means it should pull backups from them, not allow them to push backups to it. This prevents an attacker who compromises your production from destroying your backups before you can make use of them.

And finally, the *degree *of separation must be considered, and matched to the catastrophes you’re trying to prevent. If the city you live in is destroyed, do you still care about this data? If so, then you need to have a backup that’s geographically distant from that city. But if not–for example, if that data is only relevant to a business which you expect to close doors forever after a major natural disaster–then perhaps a backup location on the other side of town is sufficient.

Similarly, someone who is worried about keeping photos of their family intact whether or not their home burns down absolutely need an offsite backup–while a business which expects to close shop forever if its office building burns down might not need offsite backup, despite it being an actual business.

While most people and organizations will need offsite backup at a minimum, and will probably want wide geographic separation between backup and production if possible, others may genuinely not. Our role here isn’t to give you a list of rules to blindly follow–it’s to give you the tools to know what questions need answering, and how to find those answers!

Backing Up Your VMs

Now that we’ve got the basics down–we need to back up production systems, we know what “production” really means in this context, and we know what a “backup” is and is not–we can talk about how to accomplish these goals using OpenZFS.

The underlying technology we will use is called asynchronous incremental replication, and it’s built right into OpenZFS itself. We don’t really have time to get into how replication works, but we can show you what it does, using syncoid to orchestrate the process.

Demonstration: OpenZFS Replication Using Syncoid

In this example, we take a new snapshot each time we create a file in a dataset:

root@elden:/# truncate -s 1G /tmp/1G.bin
root@elden:/# zpool create testpool /tmp/1G.bin
root@elden:/# zfs create testpool/source
root@elden:/# for i in {1..3} ; do touch /testpool/source/$i ; zfs snapshot testpool/source@$i ; done

Now we’ve got a dataset with three testfiles named 1, 2, and 3. We also have snapshots named 1, 2, and 3–each taken immediately after creating the testfile with the same name.

We can verify all of this using the special hidden .zfs directory for testpool/source:

root@elden:/# echo /testpool/source: ; ls /testpool/source ; for i in {1..3} ; do echo /testpool/source/.zfs/snapshot/$i ; ls /testpool/source/.zfs/snapshot/$i ; done
/testpool/source:
1  2  3
/testpool/source/.zfs/snapshot/1
1
/testpool/source/.zfs/snapshot/2
1  2
/testpool/source/.zfs/snapshot/3
1  2  3

Now that we have a source, we need a target. We can see that one doesn’t already exist:

root@elden:/# zfs list -r testpool
NAME              USED  AVAIL  REFER  MOUNTPOINT
testpool          264K   832M    24K  /testpool
testpool/source    65K   832M    26K  /testpool/source

But that’s just fine; we’ll create the target during our first replication. What does that look like?

root@elden:/# syncoid -r testpool/source testpool/target
INFO: Sending oldest full snapshot testpool/source@1 to new target filesystem testpool/target (~ 12 KB):
45.3KiB 0:00:00 [12.7MiB/s] [==================================] 359%
INFO: Sending incremental testpool/source@1 ... syncoid_elden_2025-05-09:18:24:28-GMT-04:00 to testpool/target (~ 22 KB):
22.0KiB 0:00:00 [1.29MiB/s] [==============================>   ]  98%

In order to do what we wanted, syncoid first performed a full replication of the oldest snapshot, @1, followed by an incremental replication from @1→@3.

But the fun part is, you didn’t actually need to think about whether this would be a full, or an incremental, or (as it turned out) a full followed by an incremental...all you needed to do was invoke syncoid. Know that when it finishes, you have exactly the same data on the target as you had on the source, including your snapshots.

Let’s examine the outcome:

root@elden:/# echo /testpool/target: ; ls /testpool/target ; for i in {1..3} ; do echo /testpool/target/.zfs/snapshot/$i ; ls /testpool/target/.zfs/snapshot/$i ; done
/testpool/target:
1  2  3
/testpool/target/.zfs/snapshot/1
1
/testpool/target/.zfs/snapshot/2
1  2
/testpool/target/.zfs/snapshot/3
1  2  3

Not only do we have files 1, 2, and 3 present in testpool/target, we’ve got the individual snapshots we took as well, in the same condition they were in. What happens if we add another file, another snapshot, and replicate again?

root@elden:/# touch /testpool/source/4 ; zfs snapshot testpool/source@4
root@elden:/# syncoid -r testpool/source testpool/target
INFO: Sending incremental testpool/source@syncoid_elden_2025-05-09:18:29:56-GMT-04:00 ... syncoid_elden_2025-05-09:18:31:30-GMT-04:00 to testpool/target (~ 8 KB):
9.68KiB 0:00:00 [1.04MiB/s] [=====================================================] 117%
root@elden:/# echo /testpool/target: ; ls /testpool/target ; for i in {1..4} ; do echo /testpool/target/.zfs/snapshot/$i ; ls /testpool/target/.zfs/snapshot/$i ; done
/testpool/target:
1  2  3  4
/testpool/target/.zfs/snapshot/1
1
/testpool/target/.zfs/snapshot/2
1  2
/testpool/target/.zfs/snapshot/3
1  2  3
/testpool/target/.zfs/snapshot/4
1  2  3  4

Although the command we issued–syncoid -r testpool/source testpool/target–did not change, its actions did. Since we already have the target in place, and the source and target have at least one matching common snapshot, syncoid simply performed an incremental replication for us–and we can see that, as expected, it brought over the new file and new snapshot just fine.

What Needs Backing Up?

Regardless of which hypervisor you’re using–Linux KVM, FreeBSD’s bhyve, or anything else–there are essentially two things you need to back up: your VM images, and your VM definitions.

VM Images

VM images are, simply put, the “disk drives” associated with the VM. On the back end, these are usually flat files, vmdk’s, .qcow2 files, or ZVOLs.

Obviously, if you’ve got the entire drive image of a crashed system, you’ve got all or nearly all of its data–but it may still not be immediately useful. For example, if that data is trapped inside an MSSQL database, it’s going to be very painful to access unless you can successfully boot the entire virtual machine, and bring up the Microsoft SQL Server instance along with it!

VM images can, on most platforms, be placed wherever you’d like to place them. Your VMs know what files to base their virtual drives on according to their hardware definitions, which we’ll talk a bit more about in the next section.

Generally, we recommend using flat sparse files (created using the truncate command) or .qcow2 files (created with the qemu-img command) for most folks on most hypervisor platforms.

Sparse files generally outperform .qcow2 files, but if you’re running under Linux KVM, the .qcow2 files offer some additional functionality revolving around VM hibernation that could be attractive enough to outweigh their performance limitations.

We strongly recommend an individual dataset per VM and per VM drive, for those VMs with more than one virtual drive. This is to enable you to rollback an individual VM without being forced to lose data on other VMs–or, in the case of multi-drive VMs, to be able to roll those drives back independently.

The ability to roll back drives independently might not seem important at first glance–but, for example, if a Windows Server VM suddenly refuses to boot after a Windows Update gone awry, you might be very glad indeed of the ability to roll its C: drive and operating system back without affecting many terabytes of user-stored data on its D: drive!

VM Definitions

This is where backing up the VM definitions comes into play. A VM “definition” may come with different terminology depending on which platform you’re using, but it all boils down to the same thing: a list of all the “hardware” configurations to present to the VM when booting it up.

Although you can typically just wrap a default definition around a backed-up VM image and get it to boot, it may require some annoying extra configuration afterward. For example, a Windows Server VM will notice that the MAC addresses changed on all its network interfaces, incorrectly assume that means they’re entirely different cards, and therefore bring them all up under DHCP regardless of whether they were configured statically or not!

Since the interfaces in question won’t remember their old MAC addresses, this means that they won’t get their old IP address even if you had them configured to use DHCP, and had made “reservations” in your DHCP server. Those reservations match the MAC address to the desired IP. So new MAC addresses mean new IP addresses which in turn means your restored machine is not just ready to go.

Where your hypervisor keeps its VM definitions changes from system to system–but, for example, an Ubuntu system keeps its KVM VM definitions in /etc/libvirt/qemu. If you restore the files in that directory (and its subdirectories) along with the VM images themselves–which may be kept wherever you like–then your VMs will be fully functional and ready to boot right up after you finish restoration.

One last hint: if it isn’t obvious, you really want to keep those VM definitions in a dataset nearby your VMs, so it’s easy to back them up in one swell foop.

Under Linux KVM, you might do this as follows:

root@box:~# zfs create poolname/images/qemu
root@box:~# rsync -ha /etc/libvirt/qemu/ /poolname/images/qemu
root@box:~# mv /etc/libvirt/qemu /etc/libvirt/qemu-dist
root@box:~# zfs set mountpoint=/etc/libvirt/qemu poolname/images/qemu

Now, assuming you store your VM images in datasets beneath poolname/images, you can back up the entire system with a single command: syncoid -r poolname/images root@othermachine:otherpool/images gets everything you care about in a single step!

Restoring your VMs

This is where the magic really comes into play. Essentially, restoration always looks like four simple steps:

Force-poweroff the VM if running (no point in babying it, you’re about to blow it away)
Restore the VM image file
If necessary, restore and import the VM definition file
Restart the VM

Step 2 can look a bit different depending on what broke, and how. You might simply zfs rollback poolname/images/VMname@before-stuff-broke for most breakfixes. Or if you need to restore from a backup machine, syncoid -r root@backup:backuppool/images/VMname poolname/images/VMname will do the trick.

If the VM definition is intact, you’re ready to restart the VM. If you lost the VM definition for some reason, restore it much like you would restore the VM image–although if you’re only restoring a single VM rather than an entire host, you probably just want to copy the specific VM definition you need, rather than rolling back or replicating the entire dataset containing all of them.

Demonstration: Restoring a Single VM under Linux KVM

In this example, ransomware hit a Windows fileserver. However, the ransomware only gained access to the VM itself–not to the host. Since this is a local issue, and our VM definition is intact, we can fix it with a simple rollback.

root@box:~# virsh destroy fileserver
root@box:~# zfs rollback -r mypool/images/fileserver@autosnap_2025-05-09_10:00:01_hourly
root@box:~# virsh start fileserver

That’s it: our fileserver is up, running, and functional, in precisely the condition it was in at 10:00:01 on the ninth of May, 2025.

You weren’t expecting something more difficult, were you?

Demonstration: Restoring All VMs on a Linux KVM Host

In this example, an environmental issue knocked out an entire physical machine. Thankfully, it was being regularly, automatically backed up as we recommend–the task now is to restore all of our VMs from backup hardware, onto the replacement hardware.

Once Ubuntu and the necessary packages to run KVM have been installed, and the new pool created, we just replicate in the opposite direction. We still trust our backup systems more than our production systems, so we’ll do this by pushing from backup to production, after placing backup’s SSH key into production’s .ssh/allowed_hosts:

root@backup:~# syncoid -r backuppool/prodpool/images root@new-prod0:prodpool/images

Now that we’ve restored all of our images–and definitions, since we cleverly had ZFS automount those for us as outlined in the previous section–all we need to do is recreate our mountpoint, import our VM definitions, and start the VMs!

We’re going to be a touch lazy here. While we could individually virsh import each VM definition file and then individually virsh start each VM defined, it’s easier to just bounce the system–when it comes up, it will automatically import the VMs whose definitions are found under /etc/libvirt/qemu, and automatically start any of them which are symlinked into /etc/libvirt/qemu/autostart.

root@new-prod0:~# mv /etc/libvirt/qemu /etc/libvirt/qemu-dist
root@new-prod0:~# zfs set mountpoint=/etc/ilbvirt/qemu prodpool/images/qemu
root@new-prod0:~# shutdown -r now

Since our original system had all of our VMs defined the way we wanted, and whichever of them autostarting when we wanted them to, the new system inherits that behavior. After the reboot, it’s like nothing ever changed.

Demonstration: Promoting a Hotspare Host to Production

Our final demonstration again assumes Linux KVM–but the procedure looks very similar on other distributions and hypervisor platforms.

In this scenario, we lost a production system–but we’ve got a “hotspare” style backup system located on the same network subnet, which took automated replication on an hourly (or potentially more frequent) basis from production.

We want to get our users back up and running as rapidly as possible, so rather than wait for new hardware to be configured, we’re temporarily just going to boot up our VMs directly on the hotspare host itself!

The first step is pausing any replication jobs–and this is very important, because we don’t want incoming replication to blow away a running VM! This is usually going to look like commenting out a crontab entry (or disabling a systemd timer):

# 5 * * * * syncoid root@prod:data/images data/images

In this example, we simply commented out the crontab entry with a leading #.

Now that we don’t have to worry about inbound replication jobs potentially wiping the state of a running VM, it’s time to make sure we can run those VMs. First, we import VM definitions. Then we make sure the VM images are mounted where the definitions expect them to be. And finally, we start the VMs!

In this example, we’re going to presume that we really only care about a single VM. Choose to mount it where the definition will look for it, rather than editing the defintion to look for it where our hotspare system is actually keeping it.

root@hotspare:~# mkdir -p /prodpool/images
root@hotspare:~# zfs set mountpoint=/prodpool/images hotsparepool/prodpool/images
root@hotspare:~# virsh define /prodpool/images/qemu/myvm.xml
root@hotspare:~# virsh start myvm

That’s it!

Since this hotspare system backs up more than one production system (as we can infer by the fact that it backs up prodpool/images onto [cpde]hotsparepool/prodpool/images)[/code], we chose to simply “define” the VM using the existing file in place, rather than replacing the entirety of /etc/libvirt/qemu. In practice, this both drops a copy of myvm.xml into /etc/libvirt/qemu and informs KVM of the VM definition’s existence, in a single step.

The best part is, since this hotspare host is on the same network subnet as the failed production host was… these VMs boot up fully reachable, with their original IP addresses, and no further configuration necessary on the host, inside the VM, or on any client systems that access that VM!

Using this technique, you can restore service to an entire stack of virtual machines in literal seconds after their normal production host catastrophically fails.

Conclusion

OpenZFS is an incredible technology to pair with your favorite hypervisor–properly configured, it offers your VMs blistering performance as well as unmatched reliability, efficient backups, and easy migration and management.

If you are looking to finally add the disaster recovery for virtualization component of your infrastructure or just review and enhance what is already in place, consider Klara’s ZFS Disaster Recovery Design Solution.