Wednesday, September 19, 2012

CopyFile will always copy a file or raise an error it not, right? It verifies that the data was really really copied, right? [Hint: Nope]

The Old New Thing - Does the CopyFile function verify that the data reached its final destination successfully?

A customer had a question about data integrity via file copying.

I am using the File.Copy to copy files from one server to another. If the call succeeds, am I guaranteed that the data was copied successfully? Does the File.Copy method internally perform a file checksum or something like that to ensure that the data was written correctly?

The File.Copy method uses the Win32 Copy­File function internally, so let's look at Copy­File.

Copy­File just issues Read­File calls from the source file and Write­File calls to the destination file. (Note: Simplification for purposes of discussion.) It's not clear what you are hoping to checksum. If you want Copy­File to checksum the bytes when the return from Read­File, and checksum the bytes as they are passed to Write­File, and then compare them at the end of the operation, then that tells you nothing, since they are the same bytes in the same memory.

...

But wait, there's also the problem of caching controllers. Even when you tell the hard drive, "Now read this data from the physical media," it may decide to return the data from an onboard cache instead. You would have to issue a "No really, flush the data and read it back" command to the controller to ensure that it's really reading from physical media.

And even if you verify that, there's no guarantee that the moment you declare "The file was copied successfully!" the drive platter won't spontaneously develop a bad sector and corrupt the data you just declared victory over.

This is one of those "How far do you really want to go?" type of questions. You can re-read and re-validate as much as you want at copy time, and you still won't know that the file data is valid when you finally get around to using it.

Sometimes, you're better off just trusting the system to have done what it says it did.

If you really want to do some sort of copy verification, you'd be better off saving the checksum somewhere and having the ultimate consumer of the data validate the checksum and raise an integrity error if it discovers corruption.

I've run into this issue... Using the standard file copy API's, files were not getting fully copied, but somehow corrupted in the middle somewhere. And due to the pipeline it wasn't being found for a while. No errors, not disk issues, not anti-virus, but the 0's and 1's were not the same (nor of course were their hashes). And it wasn't consistent. For months, no issues and then a bunch. And it wasn't all or even many files, a few in a million, but enough to be a headache. So I built in a series of validations, checks, retries and such to try to keep this from happening or at least recovering when it did. And all that time I really didn't know why it was happening. I mean, I thought I should be good using FileCopy/CopyFile, right? I thought I was crazy...

Ah... then I read Raymond's post. See I wasn't crazy!

No comments: