Monoidal Hashing for Data Deduplication
scannedinavian.com·3d·
Discuss: Hacker News
🔄Content Deduplication
Preview
Report Post

Posted on 2025-12-03

The Problem

Can I beat rsync for both cpu and network costs when a file on the server and client are different? That is, can I find the smallest difference between a file on the server and client, and send only the change, like rsync?

As of<2025-12-03 Wed>, I don’t know for sure, but I think it’s likely.

rsync

Originally, rsync used a sliding window was hashed, moving byte by byte. This was very slow. (but could use many cores/SIMD!)

The naive approach of breaking a file into 8k chunks fails if someone adds a single new byte at the beginning. This is called the boundary shift problem.

Content Defined Chunking

[Content defined chunking](https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC…

Similar Posts

Loading similar posts...