Wednesday, September 29, 2010

And even more Enron (PST’s that is) We’re talking 107GB, compressed, of data…

EDRM - Data Set

“The EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of e-discovery software and services, through three initiatives:

  • EDRM ESI Reference Data Sets
  • EDRM Software Reference Data Set
  • EDRM Probabilistic Hash Data Set

alt PLEASE NOTE: These files may contain viruses, as can be the case with any set of files collected during discovery. Appropriate caution should be used when handling the files. alt

EDRM ESI Reference Data Sets

This initiative collects, evaluates, and publishes ESI data sets for use in testing e-discovery software and services. There are currently four data sets available:

EDRM Enron Email Data Set v2: An updated set of Enron e-mail messages and attachments:

  • More custodians (150), more email
  • 153 zipped .pst and 159 zipped .xml files
  • Approximately 107 GB zipped
  • Email now organized by custodian folder, not by collection + custodian folder; to remove duplicates that occurred in the collection process and make the set appear more like users’ standard mailboxes
  • Email now fixed to handle multi-line MIME headers
  • Now with corresponding xml files in EDRM XML format

…”

EDRM - EDRM Enron Email Data Set v2

“The EDRM Enron Email Data Set v2 consist of Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST.

Files in each group are organized by custodian and listed alphabetically with compressed file sizes in parentheses. Materials for some custodians are spread across more than a single XML or PST file.

Select any combination of XML files or any combination of PST files to download.

image…”

EDRM - EDRM Enron Email Data Set v2 Information Files Available

“Three files documenting the EDRM Enron Email Data Set v2 are now available.  They are:

  • edrm-enron-v2_pst-md5.txt: This file includes a listing of the PST ZIP files and the PST files themselves along with MD5 hashes
  • edrm-enron-v2_xml-md5.txt: This file includes a listing of the XML ZIP files along with MD5 hashes
  • edrm-enron-v2_dataset-info-md5.json: This JSON file includes information on the dataset including file counts, document counts, files by custodian, and MD5 hashes for PST ZIP, PST, and XML ZIP files

…”

That’s officially “one big boat load” of data…

 

Related Past Post XRef:
EDRM Enron Reference Data v2 now available
Need a ton of email data (10’s of gig’s)? Need it in PST form? Need it to be public data? Want to look behind the curtain into Enron? The EDRM Data Set Project is for you…
EDRM - Electronic Discovery Reference Model

No comments: