Tuesday, March 02, 2010

Stacks and stacks of data - Your copy of the Stack Overflow’s (and family) public data is a download away

stackoverflow blog - Creative Commons Data Dump Mar 10

“The latest version of the Stack Overflow Trilogy Creative Commons Data Dump is now available. This reflects all public data in …

  • Stack Overflow
  • Server Fault
  • Super User
  • Meta Stack Overflow

… up to March 2010.


stackoverflow blog - Stack Overflow Creative Commons Data Dump

“We decided early on that all user-generated content on Stack Overflow would be under a Creative Commons license.

All those great Stack Overflow questions, answers, and comments, so generously contributed by all of you, are licensed under cc-wiki:

The current anonymized public data dump is ~500 megabytes, 7zipped, and contains these files:

  1. badges.xml
  2. comments.xml
  3. posts.xml
  4. users.xml
  5. votes.xml

All four Trilogy sites are now included in the data dump: Stack Overflow, Server Fault, Super User, and Meta Stack Overflow.


I have to wonder if I can use this data…

In my day life, we have to deal with/process large amounts of production data. But to test/demo that is hard. Trying to find enough “open” and “safe” data is tough. You want “real” data, but of course you can’t use client/production data, so you’re kind of stuck between a rock and a hard case.

So while this data isn’t a normal EDD/ESI discovery target (i.e. not emails or loose files), I find the thought of its content interesting. Think about language analysis and detection. Think about it for conversation thread management. Think about using it to test scale.



Related Past Post XRef:
Need a ton of email data (10’s of gig’s)? Need it in PST form? Need it to be public data? Want to look behind the curtain into Enron? The EDRM Data Set Project is for you…
Need test/sample/demo data that’s safe for public (and/or client) consumption? Then GenerateData.com!
Data, data, everywhere free data… At least in the Guardian’s Data Store – Tons of data, all free and all delivered via Google Spreadsheets (get your mashup engines started)

No comments: