Tuesday, July 27, 2010

Data, files and drive images… Computer Forensic Reference Data Sets (CFReDS), Digital Corpora and “1 million files.” Oh my

SANS Computer Forensic Investigations and Incident Response Blog - I’m here! Now what?

“…

Staying sharp can be tough.  There are many high quality blogs and forums that are fantastic resources for learning and exchanging information, but I’m the type of person who learns by doing, not just reading.  However, you can only image your own hard drive and examine it for practice so many times before you’re bored to death with it.  Fortunately, in addition to the free and low cost tools out on the net, there are also a number of freely available disk images available for download.  There are images available in several different file system formats, so you won’t find yourself limited to just one type. The images have documented content which can be used to compare against the data your tools produce.

The site I’ve most taken advantage of when downloading images is …

image

…”

Computer Forensic Reference Data Sets (CFReDS)

“NIST is developing Computer Forensic Reference Data Sets (CFReDS) for digital evidence. These reference data sets (CFReDS) provide to an investigator documented sets of simulated digital evidence for examination.  Since CFReDS would have documented contents, such as target search strings seeded in known locations of CFReDS, investigators could compare the results of searches for the target strings with the known placement of the strings. Investigators could use CFReDS in several ways including validating the software tools used in their investigations, equipment check out, training investigators, and proficiency testing of investigators as part of laboratory accreditation. The CFReDS site is a repository of images. Some images are produced by NIST, often from the CFTT (tool testing) project, and some are contributed by other organizations. National Institute of Justice funded this work in part through an interagency agreement with the NIST Office of Law Enforcement Standards.

In addition to test images, the CFReDS site contains resourcesto aid in creating  your own test images. These creation aids will be in the form of interesting data files, useful software tools and procedures for specific tasks.

image…”

Digital Corpora 

“DigitalCorpora.org is a website of digital corpora for use in computer forensics research. Some of the corpora on this website are freely available, while others are only available to researchers under special arrangement.

From here you can view the available:

…”

Digital Corpora  - Govdocs1 — (nearly) 1 million freely-redistributable files

“In recent years a significant amount of forensic research has involved the analysis of files or file fragments. In the absence of such corpora, researchers and students who wish to work with files first need to collect files—a surprisingly difficult task if one wishes a large number of files of many types from a variety of sources. Although many files can be freely downloaded from the web, building and running a high performance document discovery and downloading tool is not a trivial task. Once files are downloaded they need to be analyzed, characterized and curated. Finally, many corpora that might be assembled cannot be easily redistributed due to privacy or copyright concerns.

For these reasons, we have created and released a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.

Each file in the corpus is presented as a numbered file with a file extension (e.g. 0000001.jpg). The file extension is typically the file extension that was provided to us when the file was downloaded. The file extension is a suggestion—it is not part of the corpus.

We are making the corpus available in several ways:

  • Distributed as a set of 1000 directories, with 1000 files in each directory, …
  • Distributed as a set of 1000 ZIP files, each with 1000 files …
  • As a set of 10 subset “threads” (subset0.zip through subset9.zip), each one containing containing 1000 randomly chosen documents. …
  • Through a search interface that allows searching for any file by search term …

…”

While most of you won’t care about this, I do and since it’s my blog… :p

Sometimes you just need allot of files and finding enough that are “safe” can be hard. You can create them yourself, but that can be a pain. What you really want is a collection of thousands to hundreds of thousands of files, of mixed types, that are a known quantity and are freely redistributable.

Looks like what you want is right here… :)

 

Related Past Post XRef:
Need a ton of email data (10’s of gig’s)? Need it in PST form? Need it to be public data? Want to look behind the curtain into Enron? The EDRM Data Set Project is for you…
EDRM Enron Reference Data v2 now available
Stacks and stacks of data - Your copy of the Stack Overflow’s (and family) public data is a download away

No comments: