ZFS refuses to trust anything it cannot verify, whereas most filesystems assume that storage hardware will return the correct data unless it reports an error, ZFS makes no such assumptions: every block must be proven correct. That difference matters, because silent corruption is one of the most dangerous failure modes in modern storage—by the time you notice it, the damage is already done.
ZFS checks everything it stores, keeps the checksums inside the parent block pointers, and leans on redundancy to repair anything that does not match. Scrubs are a specialized patrol read that walks the entire pool and confirms that the data still matches the record of what should be there.
In this article, we will walk through what scrubs do, how the Merkle tree layout lets ZFS validate metadata a…
ZFS refuses to trust anything it cannot verify, whereas most filesystems assume that storage hardware will return the correct data unless it reports an error, ZFS makes no such assumptions: every block must be proven correct. That difference matters, because silent corruption is one of the most dangerous failure modes in modern storage—by the time you notice it, the damage is already done.
ZFS checks everything it stores, keeps the checksums inside the parent block pointers, and leans on redundancy to repair anything that does not match. Scrubs are a specialized patrol read that walks the entire pool and confirms that the data still matches the record of what should be there.
In this article, we will walk through what scrubs do, how the Merkle tree layout lets ZFS validate metadata and data from end to end, how redundancy ties into checksum repair, and why scrubs are not the same as resilvers.
What Are ZFS Scrubs?
A ZFS scrub is a pool-wide verification procedure that reads every allocated block of data and metadata, and checks it against its stored checksum. This verification includes metadata blocks, user data blocks, and even the parity blocks ZFS stores to be able to recover from checksum errors. Many descriptions of scrubs incorrectly imply that only user data is checked, but on the contrary, ZFS treats metadata with the same level of protection, and a scrub verifies both thoroughly.

During a scrub, ZFS walks the entire tree of block pointers that make up the dataset. ZFS is built around a Merkle tree structure in which each parent block contains block pointers for its children, and each block pointer contains the checksum for the block it references. The parent checksum therefore protects the child metadata. This recursive structure continues downward until the physical blocks on disk. If a leaf block is corrupted, the mismatch propagates upward, making it impossible for corruption to hide.
When scrub reads a block, it recalculates the checksum from the data returned by the disk and compares the computed value to the checksum stored in the block pointer. If they match, ZFS can be sure the block is valid, whereas if the values differ, the block is corrupt. ZFS then tries to repair the block using the available redundancy.
Scrubs differ significantly from traditional filesystem checks. Tools such as fsck or chkdsk examine logical structures and attempt to repair inconsistencies related to directory trees, allocation maps, reference counts, and other metadata relationships. ZFS does not need to perform these operations during normal scrubs because its transactional design ensures metadata consistency. Every transaction group moves the filesystem from one valid state to another. The scrub verifies the correctness of the data and metadata at the block level, not logical relationships.
Checksums, Redundancy, and Self Healing
ZFS ensures block correctness through a combination of strong checksums and redundancy. Checksums detect corruption and redundancy makes it possible to repair corruption. Both are necessary and neither alone is sufficient.
HDDs typically have a BER (Bit Error Rate) of 1 in 1015, meaning some incorrect data can be expected around every 100 TiB read. That used to be a lot, but now that is only 3 or 4 full drive reads on modern large-scale drives. Silent corruption is one of those problems you only notice after it has already done damage.
In mirrored configurations, ZFS read from any of the copies. In a three-way mirror, ZFS can lose two copies and still recover the correct block from the third. Unlike with legacy RAID mirrors, the checksum allows ZFS to determine which mirror copies are correct, and which are corrupt, and apply the relevant repair writes, where hardware RAID, or mdraid would simply synchronize the two copies to make them identical again, possibly spreading the damage and destroying the remaining correct copies of the data. In RAID-Z configurations, ZFS reconstructs the block from parity and writes the repaired version back to disk. The repaired block is then available for future reads. This behavior is the self-healing property of ZFS.
When ZFS reads any block, if it detects data corruption, it will automatically issue additional reads for other copies or parity to reconstruct the correct data, then write it back to maintain integrity. Scrubs extend the same behavior across the entire pool, even the parity itself, which is not normally read. Scrubs ensure that blocks are corrected before corruption can accumulate to a dangerous level.
Small errors appear naturally over time due to cosmic radiation, media decay, and mechanical issues. A system that never performs scrubs allows these small errors to accumulate. Once accumulated, they may exceed redundancy capacity and lead to data loss. A system that scrubs regularly prevents the accumulation from ever reaching that threshold.
ZFS follows a clear rule: if it does not have enough redundancy to rebuild the correct data, it reports an error instead of returning corrupted data. You see the errors listed in zpool status and can act accordingly. This combination of detection, repair, and strict failure behavior forms the foundation of ZFS reliability.
Interpreting zpool status
The zpool status command provides insight into scrub progress, scrub results, and pool health. A scrub report includes the number of blocks examined, the number of blocks repaired, the duration of the scrub, and the average scan rate. It also includes error counts for each device.
Repaired blocks indicate that ZFS found mismatched checksums and corrected the underlying data. A small number of repaired blocks is normal for large storage pools, and occasional bit rot is expected. A rising number of repaired blocks over multiple scrubs signals a problem.
Checksum errors that occur during normal operation are often more serious. If ZFS repairs the data using redundancy, the pool remains healthy, but the device involved should be examined carefully. A single device that repeatedly produces checksum errors often requires replacement.
The scan rate during a scrub is another value that administrators and storage engineers frequently examine. However, this value requires interpretation. Scrub progress estimates can be misleading, this is because the scrub process goes through different phases that involve varying amounts (and sizes) of I/O.
Early in the scrub, the estimate may promise an unrealistically short completion time. Later, it may predict an extremely long duration. Both extremes are normal. This is why it’s better to focus on long-term patterns rather than instantaneous values.
Historical scrub data is often more informative than a single scrub result. A significant increase in duration may indicate fragmentation, device slowdowns, workload changes, or controller problems. Monitoring systems that record scrub durations can reveal trends that manual inspections overlook.
Below is an example of a zpool status report:
pool: tank state: ONLINE scan: scrub in progress since Wed Dec 10 10:14:27 2025 1.23T scanned at 1.38G/s, 842G issued, 31.2G repaired 10 percent done, 1 days 02:13:48 to go config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 sda ONLINE 0 0 5 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 errors: No known data errors
-
In this report, the repaired count identifies mismatched blocks that ZFS corrected. Occasional repaired blocks are normal in large pools but, rising totals across scrubs indicate degrading hardware.
-
The READ and WRITE counters indicate device level I/O errors. Persistent values on a single device suggest cable problems, controller instability, or a failing disk.
-
The CKSUM column indicates checksum mismatches detected during normal reads.
ZFS also provides the zpool events* command, which records asynchronous pool events. These events capture issues like checksum errors, failed reads, or device removals, giving context to scrub results by showing which files or blocks were affected. ZFS then integrates with ZED (or zfsd on FreeBSD), which can take automatic action, such as replacing a failed disk with a spare, or onlining a disk that is reattached. Additionally, these tools can be extended to allow the administrator to add their own responses to such events, such as triggering locate LEDs and sending notifications. While zpool events* does not report hardware-level occurrences such as controller resets, combining its output with scrub observations can help identify underlying data integrity or device problems.
Scrubs vs Resilvers
Scrubs and resilvers are both scanning operations, but they serve different purposes and exhibit different behaviours.
A scrub verifies the integrity of blocks that already exist on disk. It reads all allocated blocks and verifies their checksums. Scrub repairs blocks that do not match their stored checksums and is a maintenance process designed to detect corruption.
A resilver rebuilds one or more devices after a failure or replacement. During a resilver, ZFS identifies which blocks need to be reconstructed on the replacement device. If there are multiple vdevs, a resilver can skip data on healthy vdevs to speed up the resilvering process, making it much faster than a scrub. Whereas a scrub always reads all data, including the parity, to ensure it is correct.
The performance profile of scrubs differs from resilvers in focus rather than scope. Both traverse the entire metadata tree, but a resilver only processes blocks belonging to missing or replaced devices, skipping healthy data, while a scrub examines all allocated blocks in the pool to detect silent corruption.
Automation and Monitoring
Most systems that include ZFS schedule scrubs once per month. This frequency is appropriate for many environments, but high churn systems may require more frequent scrubs. Archival systems that contain largely static data also benefit from more frequent scrubbing because infrequently accessed data is the most vulnerable to silent corruption.
Administrators can adjust scrub schedules through cron or through native periodic maintenance frameworks. When adjusting schedules, administrators must consider workload patterns, peak I/O periods, and the impact of scrubs on latency-sensitive applications. Scrubs consume read bandwidth across all devices, so scheduling scrubs during low activity periods often produces better performance and more predictable estimates.
Monitoring systems should record scrub duration, repair counts, error counts, and device anomalies. ZED provides immediate notifications for pool events, including scrub completion and errors encountered during scrubs. Integrating ZED with email, Slack, or a ticketing system ensures that administrators receive timely alerts.
When automation and monitoring are combined, scrubs become a predictable part of storage operations rather than an occasional task.
Best Practices
Administrators can significantly improve ZFS reliability by following consistent best practices related to scrubs, redundancy, monitoring, and hardware management.
- Run Scrubs On a Regular Basis
Monthly scrubs are the most common, but every environment is different. No environment should go more than 4 months without a scrub. The goal is to prevent the accumulation of corruption rather than to react to it.
- Respond to Scrub Findings Immediately
Even small increases in repaired counts should be investigated. Repeated checksum errors on a device indicate instability, and devices that produce repeated errors should be replaced before they cause degraded performance or unrecoverable damage.
- Maintain Hardware Properly
Faulty or loose cables cause checksum errors that resemble disk failure, but they are only one category of problems. I have also run into controller firmware bugs that caused intermittent errors even when the cabling was solid. Similarly, drive trays with vibration issues can also degrade device performance. You should verify hardware health regularly, especially when repeated scrub anomalies appear.
- Maintain Separate Backups
Although ZFS provides strong protection against many forms of data loss, it is not a substitute for backups and proper disaster recovery policies. Scrubs cannot repair data that was written in a corrupted state or data that was overwritten accidentally.
Wrapping Up
ZFS scrubs conduct complete, block-level verification of the entire pool. They validate checksums, repair corruption through redundancy, and warn you about developing problems.
A well planned ZFS deployment includes scrub automation, monitoring, redundancy planning, hardware validation, and operational discipline. This combination ensures predictable behavior across years of service.
Klara supports countless organizations that rely on ZFS for production workloads. Our team assists with designing scrub policies, pool architecture, and long-term scaling strategies. With Klara’s ZFS Storage Design service, work with ZFS engineers to ensure you make the right choices and are well served by your storage in the long term.