Sun’s ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiple times.
Deduplication is the process of eliminating duplicate copies of data and mapping duplicate blocks of data to a single one instead of multiples. When data is highly replicated, which is typical of backup servers, virtual machine images, and source code repositories, deduplication can reduce space consumption not just by percentages, but by multiples.
What to dedup: Files, blocks, or bytes?
Dedup is generally either file-level, block-level, or byte-level. Chunks of data — files, blocks, or byte ranges — are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77.
For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.
File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedup requires more processing power, and is said to be good for virtual machine images. Byte-range dedup uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS’s 256-bit block checksums. The deduplication is done inline, with ZFS assuming it’s running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words.”
More information can be found at Jeff Bonwick’s blog!