Windows Server 2012's Data Deduplication - What is it, how will it help you and why you should care... - Greg's Cool [Insert Clever Name] of the Day

The Storage Team at Microsoft - File Cabinet Blog - Introduction to Data Deduplication in Windows Server 2012

"Hi this is Scott Johnson and I’m a Program Manager on the Windows File Server team. I’ve been at Microsoft for 17 years and I’ve seen a lot of cool technology in that time. Inside Windows Server 2012 we have included a pretty cool new feature called Data Deduplication that enables you to efficiently store, transfer and backup less data.

This is the result of an extensive collaboration with Microsoft Research and after two years of development and testing we now have state-of-the-art deduplication that uses variable-chunking and compression and it can be applied to your primary data. The feature is designed for industry standard hardware and can run on a very small server with as little as a single CPU, one SATA drive and 4GB of memory. Data Deduplication will scale nicely as you add multiple cores and additional memory. This team has some of the smartest people I have worked with at Microsoft and we are all very excited about this release.

Does Deduplication Matter?

Hard disk drives are getting bigger and cheaper every year, why would I need deduplication? Well, the problem is growth. Growth in data is exploding so much that IT departments everywhere will have some serious challenges fulfilling the demand. Check out the chart below where IDC has forecasted that we are beginning to experience massive storage growth. Can you imagine a world that consumes 90 million terabytes in one year? We are about 18 months away!

...

Welcome to Windows Server 2012!

This new Data Deduplication feature is a fresh approach. We just submitted a Large Scale Study and System Design paper on Primary Data Deduplication to USENIX to be discussed at the upcoming Annual Technical Conference in June.

...

5) Sub-file chunking: Deduplication segments files into variable-sizes (32-128 kilobyte chunks) using a new algorithm developed in conjunction with Microsoft research. The chunking module splits a file into a sequence of chunks in a content dependent manner. The system uses a Rabin fingerprint-based sliding window hash on the data stream to identify chunk boundaries. The chunks have an average size of 64KB and they are compressed and placed into a chunk store located in a hidden folder at the root of the volume called the System Volume Information, or “SVI folder”. The normal file is replaced by a small reparse point, which has a pointer to a map of all the data streams and chunks required to “rehydrate” the file and serve it up when it is requested.

...

It slices, it dices, and it cleans your floors!

Well, the Data Deduplication feature doesn’t do everything in this version. It is only available in certain Windows Server 2012 editions and has some limitations. Deduplication was built for NTFS data volumes and it does not support boot or system drives and cannot be used with Cluster Shared Volumes (CSV). We don’t support deduplicating live VMs or running SQL databases. See how to determine which volumes are candidates for deduplication on Technet.

Try out the Deduplication Data Evaluation Tool

To aid in the evaluation of datasets we created a portable evaluation tool. When the feature is installed, DDPEval.exe is installed to the \Windows\System32\ directory. This tool can be copied and run on Windows 7 or later systems to determine the expected savings that you would get if deduplication was enabled on a particular volume. DDPEval.exe supports local drives and also mapped or unmapped remote shares. You can run it against a remote share on your Windows NAS, or an EMC / NetApp NAS and compare the savings.

Summary:

I think that this new deduplication feature in Windows Server 2012 will be very popular. It is the kind of technology that people need and I can’t wait to see it in production deployments. I would love to see your reports at the bottom of this blog of how much hard disk space and money you saved. Just copy the output of this PowerShell command: PS> Get-DedupVolume

30-90%+ savings can be achieved with deduplication on most types of data. I have a 200GB drive that I keep throwing data at and now it has 1.7TB of data on it. It is easy to forget that it is a 200GB drive.

Deduplication is easy to install and the default settings won’t let you shoot yourself in the foot.

Deduplication works hard to detect, report and repair disk corruptions.

You can experience faster file download times and reduced bandwidth consumption over a WAN through integration with BranchCache.

Try the evaluation tool to see how much space you would save if you upgrade to Windows Server 2012!

..."

I still think the coolest part of this feature, besides that we finally get data dedupe baked into a normal edition of Windows Server (and not just the Storage Server editions) is the sub-file or chunk deduping. It's not file level but block level dedupe. So the files do not be to be duplicates to take advantage of this, just parts of them duplicated...