Windows file copy bug revisited | Jim's Random Notes

Operating systems use file caching to prevent having to read commonly-accessed information from the disk repeatedly. Depending on the situation, the operating system might keep the most recently read stuff in memory, or it might keep the most commonly accessed stuff in memory. Either way, the system dynamically adjusts how much memory it uses for caching. The idea is that if your programs aren’t using the memory, the OS can use it for cache with the understanding that if your programs need it, the cache gets flushed.

It’s kind of like using your neighbor’s covered parking place when he’s away, with the understanding that when he comes back, the space is his again.

A good caching scheme can make a big difference in the speed of disk access. Main memory is slow compared to the CPU, but the disk drive is glacial. If you can avoid hitting the disk for data that’s already been read, you’re way ahead.

A poor caching scheme can result in slower disk access, and an idiotic caching scheme can bring your computer to a halt. In at least one instance, Windows has an idiotic caching scheme that is in a very real sense a security vulnerability.

I’ve mentioned several times the idiotic file caching bug that causes Windows to come to a screeching halt when copying large files. I did a bit more research and testing, and I can say with some certainty what’s happening, although I don’t know all the internals of why. Here’s what appears to be happening.

Program requests the first block of the file.
The server reads the first block into memory, and sends it to the client.
Steps 1 and 2 are repeated many times, and each time the server saves the sent data in memory.
Over time, memory begins to fill with data that the server is caching. I can’t tell if the server is reading ahead in order to have the next block ready to send, or if it’s holding on to data already sent, just in case somebody else might want to read it.
When the cache fills free memory, the operating system starts looking for other memory to use. It starts paging idle programs to virtual memory. Then active programs. And then, as bizarre as it may seem, it looks like the operating system starts paging the cache to memory!

In the past, I thought that Windows was reading ahead, caching parts of the file that hadn’t yet been sent to the client. But my test results point more towards the other conclusion: that Windows is buffering things that it’s already sent. I freely admit that I could be wrong, as I haven’t yet been able to say with certainty which of the two is true. Read-ahead caching at least makes some limited amount of sense. Paging a disk cache to disk would be especially idiotic, but it would definitely explain why the system’s responsiveness continues to degrade after memory fills up. Still, I can’t prove that it’s actually doing that.

Without some serious kernel-level debugging, I don’t have any way to determine exactly what is being cached, but it mostly doesn’t matter. The result is the same: Windows pages out working programs and active data in favor of the file cache. Ultimately, this makes the server unusable.

It doesn’t take an exceptionally large file by today’s standards in order for this to happen. For example, trying to copy a 50 gigabyte file from a server that has 16 gigabytes of memory will exhibit this behavior. I’ve experienced this problem on Windows XP, Server 2008, and Windows Server 2008 R2.

This is a security vulnerability because a very large file in a shared directory is an invitation to a denial of service attack. All a user has to do is start copying that file. In short order, the file server’s memory will fill up, it will start swapping, and will stop serving requests. Worse, because the machine becomes non-responsive, there will be no way to see what’s causing the problem or to disconnect the offending computer. The system administrator is left with the options of 1) trying to find and disconnect the perpetrator at the other end; or 2) reset the file server. The first can be very difficult in a large organization, and the second will cause data corruption if there are any files open for writing.

Understand, a user doesn’t have to be a “black hat” in order to perpetrate this “attack”. He could be reading a file that he is authorized to read. The operating system is the culprit here.

What causes this?

In the past, I thought that the problem was caused by the CopyFile and CopyFileEx API functions, which the standard Windows copy commands (COPY, XCOPY, and ROBOCOPY) call to do their jobs. Whereas it’s true that those two functions do exhibit the problem, it’s not limited to those two functions.

It turns out that any program that reads through a large file can trigger this problem. Any program that uses the CreateFile API function to open a network file can cause this to happen. And CreateFile is what ends up being called by the runtime libraries of almost every Windows programming language. Certainly the C++ I/O subsystem calls it, as does the .NET runtime. If you’re writing C# code that accesses very large files across the network, you will see this happen.

I don’t have any way to say for certain what’s going on inside of Windows. That is, I don’t know the internal mechanism that causes this idiotic caching behavior. But I do know that calling CreateFile with default parameters will trigger it, and CopyFile and CopyFileEx appear to call CreateFile with default parameters.

What can you do about it?

From a user’s standpoint, the only option is to find some other program to copy files. TeraCopy works well as a replacement for copying with Windows Explorer. I’ve verified that it does not trigger the caching problem. Although TeraCopy has a command line interface, it just starts the GUI with the parameters you give it. Because it creates a new window, it doesn’t block the command interpreter. As a result, TeraCopy is useless in scripts. At least, I don’t see how to make it work well in a script.

I’ve heard good things about FastCopy, but I haven’t tried it yet. It might be a good replacement for the Windows tools. I don’t know whether FastCopy triggers the caching problem, but I strongly suspect that it doesn’t. I will know more once I test it. Be forewarned: the program has an unusual command line syntax.

Programs that need to copy large files can call CopyFileEx, and pass the new (as of Windows Vista and Server 2008) COPY_FILE_NO_BUFFERING flag in the last parameter. As the documentation for CopyFileEx says:

The copy operation is performed using unbuffered I/O, bypassing system I/O cache resources. Recommended for very large file transfers.

It really does work. If you have a way to call the native CopyFileEx API function, then the copy file problem is solved. C and C++ programmers are in the clear. I’m building a replacement for the .NET File.Copy method, which will allow you to include this parameter. It’s not an ideal solution, but it’s not terribly painful, either. I should have that code available soon.

The bigger problem is how to handle file streams in general. As I said above, any program that reads large files across the network can trigger the bad caching behavior on the file server. This common pattern (in C#), for example, can cause the file server to lock up:

    using (var reader = new StreamReader(@"\\server\data\bigfile.txt"))
    {
        // read lines from the file
    }

This code can cause the problem, too:

    using (var reader = new BinaryReader(File.Open(@"\\server\data\bigfile.bin")))
    {
        // read binary file here
    }

I’ve never run into the problem using StreamReader, but then I’ve never had to stream a 50 gigabyte text file. I have, however, been tripped up by this problem when using BinaryReader.

The good news for C and C++ programmers and others who use native file I/O (i.e. CreateFile, ReadFile, etc.) is that there’s a solution to the problem, although that solution is somewhat inconvenient. The bad news for .NET programmers is that there is no solution that doesn’t involve unsafe code and some serious mucking about with native memory allocations.

Programmers who call the Windows API functions directly can pass the FILE_FLAG_NO_BUFFERING flag to CreateFile. As the documentation says:

The file or device is being opened with no system caching for data reads and writes. This flag does not affect hard disk caching or memory mapped files.

There are strict requirements for successfully working with files opened with CreateFile using the FILE_FLAG_NO_BUFFERING flag, for details see File Buffering.

The important parts in the File Buffering article have to do with the alignment and file access requirements. Specifically, the buffer size has to be a multiple of the volume sector size and should be aligned on addresses in memory that are integer multiples of the sector size. The second requirement is hardware dependent. So your system might work okay if the alignment is off, but other systems might crash or corrupt data.

So any time you call ReadFile, the address you specify in the lpBuffer parameter must be properly aligned. You can’t pass an arbitrary offset into the buffer like &buffer[15]. You have to do your own buffering. It’s slightly inconvenient, but it’s easy enough to create a wrapper with a buffer and the requisite logic to handle things properly.

.NET programmers are stuck. .NET file I/O ultimately goes through the FileStream class, which knows nothing about the FILE_FLAG_NO_BUFFERING option. There is a FileStream constructor that allows you to specify some options, but the FileOptions enumeration doesn’t include a NoBuffering value. And if you examine the source code for FileStream, you’ll see that it has no special code for ensuring the alignment of the allocated buffer, and no facility for ensuring that reads are done at properly aligned addresses.

You can fake it by creating your own NoBuffering value and passing it to the constructor, but there’s no guarantee that it will work because NET programs can’t control memory allocation alignment. At best, it would work sometimes.

It appears that the only solution for .NET programmers is to create an UnbufferedFileStream class that uses VirtualAlloc to allocate the buffer, and has code to properly handle reads and writes. That’s a big job, especially if you want it to serve as a plug-in replacement for the standard FileStream.

Even if you were to write that proposed UnbufferedFileStream class, it wouldn’t handle all cases, and some others would be a bit inconvenient. For example, the code to open a StreamReader would be:

    var ufs = new UnbufferedFileStream(filename, bufferSize);
    using (var reader = new StreamReader(ufs))

You can even put it on a single line, since StreamReader will automatically close the base stream:

    using (var reader = new StreamReader(new UnbufferedFileStream(filename, bufferSize)))

But other things would still be a problem. For example, .NET 4.0 introduced the File.ReadLines method, which returns an enumerator that reads the file a line at a time. It makes quick work out of reading a text file:

    for (var line in File.ReadLines(@"\\server\data\bigfile.txt"))
    {
        // do something with the line
    }

But that will undoubtedly exhibit the bad caching behavior if the file you’re reading is very large, and there’s nothing you can do about it because there’s no corresponding method (at least, not that I know of) that will enumerate the lines in a stream that you pass to it. You can certainly create such a method, but then you have to remember to use it rather than the standard File.ReadLines.

This is a serious bug

I don’t know if this bug still exists in Windows 7. I find very little information about it online, although there are some reports of users experiencing it when copying relatively small files (four gigabytes, for example) from computers that have only one or two gigabytes of RAM. I expect the problem to get increasingly worse as file sizes and hard drive capacities increase.

I’m somewhat surprised that the bug exists at all. I can’t imagine building an operating system that swaps program code and active data to disk in favor of a file cache. That seems like an incredibly stupid thing to do. But I understand that bugs happen. From what I can glean, though, this bug has existed at least since Windows 2000, which would be more than 10 years. And that really surprises me. You would think that in 10 years they’d be able to come up with a solution that would provide the benefits of file caching without the disastrous side effect of bringing a server to its knees with large files.

There is some chatter online about being able to set caching limits, but I haven’t seen anything like a definitive solution, or even anything that looks promising. Certainly nothing I’ve tried solves the problem. If there is a configuration setting that will prevent this problem, it should be the default. Users shouldn’t have to go digging through tech notes in order to disable an optimization (file caching) that by default is the equivalent of a ticking time bomb.