This is the second in a short series about reading data from streams in C#. In the first post I showed that you can’t depend on a Stream to give you all the bytes you ask for when you call Read
, even if you know that those bytes are or will soon be available.
A lot of programmers don’t understand that point when they first start working with streams other than FileStream, and when they figure it out or somebody explains it to them, they often ask something like:
Why don’t streams work like files?
Which is the wrong question to ask. FileStream
is a special case of Stream
, not the other way around. FileStream
works the way it does because it can. But not all streams can work the way file streams can, because not all types of streams have all the information that a file system has. So let’s forget about file streams for a bit and talk about why the Stream
API works the way it does, which is the real question those programmers are asking.
A Stream
is, well, a stream or river of bytes. Imagine sitting at the mouth of the Mississippi River, watching the water flow into the Gulf of Mexico. Does that stream of water have a beginning? Certainly the river has a beginning somewhere up in northern Minnesota, but that’s not where the stream of water begins. Nor is it where all the water comes from. The water comes from all over the country and has been flowing down to the Gulf in its current form since the last Ice Age. And although I’m certain there will be an end to that mighty river at some point in the future, I’m convinced that I won’t see it. The water just keeps on flowing. For all practical purposes, the Mississippi river had no beginning and has no end.
A Stream
is just like that. There’s a beginning, but once you’ve consumed data from a stream, you can’t get it back. Just like you can’t get the water back from the Mississippi once it’s flowed into the Gulf. You also can be assured that there will be an end to the stream. Eventually. But you don’t know where that end is. The stream might end after the next byte, or there might be several terabytes still to be read. You can’t know, in the general case.
Let’s say you write a program that reads data from a stream in 64 kilobyte chunks, and processes those chunks. Imagine that there’s a “give me everything I asked for” function that always returns exactly the number of bytes that you ask for, unless the stream ends in which case it gives you everything up to the end of the stream. In short, it works the way that FileStream.Read
appears to work. I’ll call the function ReadExact
, since “give me everything I asked for” is a handful to type and typing it annoys me because I rarely get everything I ask for anyway. Digressions aside, the program would look like this.
byte[] buffer = new byte[65536];
// ReadExact returns 0 when you try to read
// after end of stream has been reached.
int bytesRead = theStream.ReadExact(buffer);
while (bytesRead != 0)
{
// process the buffer
ProcessData(buffer, bytesRead);
// and refill the buffer
bytesRead = theStream.ReadExact(buffer);
}
With that code, you would end up calling ProcessData
with a long sequence of buffers containing 65,536 bytes, and one final buffer that contains fewer bytes than that. Unless the stream’s length was an exact multiple of 65,536 bytes.
And there’s my first point. Unless you can guarantee that the stream’s length will be an exact multiple of whatever buffer size you choose, your program has to handle short buffers. That is, you must be able to handle the case that ReadExact
returned fewer bytes than you asked for. In this case, ReadExact
doesn’t provide any benefit over the existing Read
functionality.
Still, there might be use for ReadExact
when you only need 100 bytes and you know that the stream has 100 bytes. Except you can’t know. Maybe somebody cut the network cable and only 50 bytes came across. Or maybe the sender got hung up or there was a bug and although it said 100 bytes were to follow, only 23 bytes came down the wire. Any number of things can happen. How should ReadExact
react in any of those situations? Should it time out or wait forever? Is a timeout or end of stream an error or an expected condition? The desired behavior is different for every application. There just isn’t a generally accepted way for such a method to work. That’s another reason why it isn’t supplied.
Not only are streams essentially infinite, they’re also arbitrarily fast (or slow). Many programs (perhaps most?) can process data a whole heck of a lot faster than they can read the data. It makes sense to process data in small chunks as it comes in rather than to read everything and then process it. It might not even be possible to read everything because, again, the stream is essentially infinite.
For all those reasons, it just doesn’t make sense to have a general ReadExact
method. You’re free to create one, but I think you’ll find that it’s not as useful as you thought it would be when you started. It’s just easier to embrace the way that streams work, and write your programs so that they can handle data as it trickles in.
Last time I said that FileStream.Read
appears to always fill the buffer, except when it reaches end of file. I went digging in the .NET Framework Source, to see what FileStream.Read
really does, and I found this comment:
// We may have read less than the number of bytes the user asked
// for, but that is part of the Stream contract. Reading again for
// more data may cause us to block if we're using a device with
// no clear end of file, such as a serial port or pipe. If we
// blocked here & this code was used with redirected pipes for a
// process's standard output, this can lead to deadlocks involving
// two processes. But leave this here for files to avoid what would
// probably be a breaking change. --
Then follows some code that refreshes the stream buffer if the file isn’t a pipe handle, and tries to fulfill the read request.
In other words, if you’re using FileStream.Read
on a normal disk or network file, then the current implementation ensures that you get what you ask for. But if you just have a FileStream
from some unknown source, there is no such guarantee. And, as I pointed out last time and the comment verifies, that behavior is just an implementation detail. It’s unlikely that the .NET Framework team will change this behavior, so you’re probably safe writing code that assumes FileStream.Read
will fill the buffer if it can, provided you know that the stream is wrapping a file handle. But if you don’t know where the FileStream
came from (i.e. it’s a pipe), then counting on that behavior could get you into trouble.
Overall, I’d say you’re better off writing code that checks the return value of FileStream.Read
, and reads more if the buffer wasn’t full and end of stream wasn’t reached.
The real reason the Stream
API works the way it does is because it has to work that way. The API exactly reflects how data streams work.
FileStream
works the way it does because it can. The file system contains a lot of information about a file (the size, in particular) that can’t be known about streams in general, and because the file is held there on disk you can seek to the start or end of the file, or anywhere in the middle to read or write whatever you want. And the file system is fast when compared to streams in general. A FileStream
is really a bunch of file-related functions that, among the many things it does, lets you read the file in the same way you would read any other type of Stream
. Expecting the Stream
API to work like a FileStream
is kind of like expecting a flintlock to work like an AK-47.
In this entry and the first one, I’ve been discussing streams of bytes without assuming any format to those bytes. Most streams aren’t just bytes. Rather, they’re bytes with some defined format. Maybe they’re lines of text, binary records that are 37 bytes long each, or commands separated by newlines, semicolons, or some other delimiters. Very often, those lines, records, or commands are part of a client-server setup that uses a request/response model: the client makes a request, the server sends a response, client makes another request, etc. Because of the way streams work, handling those types of requests can be a little more involved than you might expect. That’s what I’ll talk about next time.