Chucking you duplicate chunks... Microsoft Research, &quot;Eliminating Duplicated Primary Data&quot;

Friday, October 14, 2011

Chucking you duplicate chunks... Microsoft Research, "Eliminating Duplicated Primary Data"

Microsoft Research - Eliminating Duplicated Primary Data

"The amount of data created and stored in the world doubles about every 18 months. Some of that data is distinctive—but by no means all of it. A PowerPoint presentation might start bouncing around a work group, and within a week, many nearly identical copies could be scattered across an enterprise’s desktops or servers.

Eliminating redundant data is a process called data deduplication. It’s not new—appliances that scrub hard drives and delete duplicate data have existed for years. But those appliances work across multiple backup copies of the data. They do not touch the primary copy of the data, which exists on a live file server.

Deduplicating as data is created and accessed—primary data, as opposed to backup data—is challenging. The process of deduplicating data consumes processing power, memory, and disk resources, and deduplication can slow data storage and retrieval when operating on live file systems.

...

Sengupta and Li next tackled the problem of detecting duplicated data. That required building and maintaining an index of existing data fragments—also called “chunks”—in the system. Their goal was to make the indexing process perform well with low resource usage. The Microsoft Research team’s solution is based on a technology they designed called ChunkStash, for “chunk metadata store on flash.” ChunkStash stores the chunk metadata on flash memory in a log-structured manner, accesses it using a low RAM-footprint index, and exploits the fast-random-access nature of the flash device to determine whether new data is unique or duplicate. Not all of the performance benefits of ChunkStash are dependent on the use of flash memory, and ChunkStash also greatly accelerates deduplication when hard disks alone are used for storage, which is the case in most server farms.

...

Product-Team Engagement

Sengupta and Li’s work on deduplication caught the eye of the Windows Server team, which was in the early stages of working on Windows Server 8. The opportunity to include deduplication in the release was tempting and driven by customer needs and industry trends.

“Storage deduplication,” says Thomas Pfenning, general manager for Windows Server, “is the No. 1 technology customers are considering when investing in file-based storage solutions.”

The process of deduplication breaks up data into smaller fragments that become the target for a deduplication, too. These fragments could be entire files or “chunks” of a few kilobytes. Because data is subject to edits and modifications over time, breaking data into smaller chunks and deduplicating those smaller pieces might be more effective than finding and deduplicating entire files.

...

They discovered that chunking data resulted in significantly larger savings compared with deduplication of entire files.

..."

We're going to be hearing more and more about deduplication (dedupe), single instance storage (SIS), etc in the coming years. Windows Server 8 is betting heavily on it. In my day job, dedupe and single instance storage is huge and it's good to see it getting built into the OS, and having so many big brains working on it. :)

What intrigued me in this article (and what we're seeing in Windows Server 8) is that they are looking beyond files and into dedupe/SIS for "chunks". These chunks can be on the file system or even in memory. Interesting...