Tuesday, March 02, 2010

Stacks and stacks of data - Your copy of the Stack Overflow’s (and family) public data is a download away

stackoverflow blog - Creative Commons Data Dump Mar 10

“The latest version of the Stack Overflow Trilogy Creative Commons Data Dump is now available. This reflects all public data in …

  • Stack Overflow
  • Server Fault
  • Super User
  • Meta Stack Overflow

… up to March 2010.


stackoverflow blog - Stack Overflow Creative Commons Data Dump

“We decided early on that all user-generated content on Stack Overflow would be under a Creative Commons license.

All those great Stack Overflow questions, answers, and comments, so generously contributed by all of you, are licensed under cc-wiki:

The current anonymized public data dump is ~500 megabytes, 7zipped, and contains these files:

  1. badges.xml
  2. comments.xml
  3. posts.xml
  4. users.xml
  5. votes.xml

All four Trilogy sites are now included in the data dump: Stack Overflow, Server Fault, Super User, and Meta Stack Overflow.


I have to wonder if I can use this data…

In my day life, we have to deal with/process large amounts of production data. But to test/demo that is hard. Trying to find enough “open” and “safe” data is tough. You want “real” data, but of course you can’t use client/production data, so you’re kind of stuck between a rock and a hard case.

So while this data isn’t a normal EDD/ESI discovery target (i.e. not emails or loose files), I find the thought of its content interesting. Think about language analysis and detection. Think about it for conversation thread management. Think about using it to test scale.



