Tuesday, January 14, 2014

How Many "Documents" in a Gigabyte? It depends (and it's going up)

E-Discovery Search Blog - How Many Documents in a Gigabyte? An Updated Answer to that Vexing Question

For an industry that lives by the doc but pays by the gig, one of the perennial questions is: “How many documents are in a gigabyte?” Readers may recall that I attempted to answer this question in a post I wrote in 2011, “Shedding Light on an E-Discovery Mystery: How Many Docs in a Gigabyte.”

At the time, most people put the number at 10,000 documents per gigabyte, with a range of between 5,000 and 15,000. We took a look at just over 18 million documents (5+ terabytes) from our repository and found that our numbers were much lower. Despite variations among different file types, our average across all files was closer to 2,500. Many readers told us their experience was similar.

Just for fun, I decided to take another look. I was curious to see what the numbers might be in 2014 with new files and perhaps new file sizes.  So I asked my team to help me with an update. Here is a report on the process we followed and what we learned.[1]

How Many Docs 2014?

For this round, we collected over 10 million native files (“documents” or “docs”) from 44 different cases....

...

Including all files gets us awfully close to 5,000 documents per gigabyte, which was the lower range of the industry estimates I found. If you pull out the EML files, the number drops to 3,594.39, which is midway between our 2011 estimate (2,500) and 5.000 documents per gigabyte.

Which is the right number for you? That depends on the type of files you have and what you are trying to estimate. What I can say is that for the types of office files typically seen in a review, the number isn’t 10,000 or anything close. We use a figure closer to 3,000 for our estimates

image

If you're in my industry, you'll have heard this question a thousand times, seen about a million calculators and zillion charts attempting to answer this, which in the end usually is "it depends." Yet, we've been doing this now for a decade+ and are getting better at answering it. This post does a great, vendor neutral, job in attempting to answer it.

You're not in the eDiscovery/ESI/LitSupport biz? I still think you might find this data interesting as it's something you might not have normally asked or considered...

No comments: