Compression confusion

While we’re on the subject of compression, I thought I’d do a little experiment.  I ran a program executable through a filter that turns it into a .HEX file.  That is, every byte of the file is represented as a 2-digit hexadecimal number in the output file.  The resulting file is exactly twice the size of the original executable, but is really just a different representation of the same information.  I then compressed both files into a single archive using WinZip’s maximum compression.  Here are the results:

FileOriginal sizeCompressed size
executable8,588 bytes4,441 bytes
hex file17,176 bytes4,951 bytes

I’ll admit my ignorance and say that I’m slightly puzzled by these results.  Whereas I didn’t expect a general purpose compressor to figure out that the files are essentially the same thing and perform the hex-to-binary translation before compressing the .hex file, I certainly didn’t expect a difference of 11.5% in the sizes of the compressed files.  I guess this just shows that there’s still room for general purpose compressors to improve.

Saving bandwidth by compressing HTML

I’ve been wondering recently why HTML traffic on the web isn’t transmitted in a compressed form.  It should be an easy matter, I thought, to either store HTML in compressed form on the server, or compress it on the fly during transmission.  The client browser could then decompress the stream as it’s received, and render it.  This may have been unthinkable in 1994 when a 90 MHz Pentium with 16 megabytes of RAM was top of the line hardware.  But today you can buy a 1 GHz machine with 128 megabytes of RAM for $500.  Hardware isn’t the issue.

As it turns out, I’m not the first one to come up with the idea.  The HTTP 1.1 specification includes compression (either gzip or Unix compress format), and most browsers have supported the compressed formats since 1998.  So what’s the problem?  Servers.  Neither Apache nor Microsoft’s IIS support on-the-fly compression of content.  According to Peter Cranstone’s article HTTP Compression speeds up the Web,  the potential bandwidth savings by using HTTP compression is 30%!  (That’s typical when you factor in graphics and other formats.  Pure text performance would be more like 50 to 75% savings).  It’s criminal that servers don’t currently support this compression.

Things may be changing.  The article mentions an Apache mod (mod_gzip) that performs the compression.  IIS 5.0 also supports some compression, as discussed in this article from Microsoft TechNet.

In some ways it’s unfortunate that browsers already support HTTP 1.1 compression.  There are much better and more efficient compression methods than gzip, but we’re pretty much stuck with that unless we come up with a new standard.  Either that, or come up with a thin client of some kind that can decompress any new format before it gets to the browser.  That would be browser specific, though, with all the associated problems.  But even gzip compression is better than no compression.  If the server that hosts my web site did compression, then this web page would take 20 rather than 60 seconds to download on a 28.8 K bps modem.

Yes, I know many of you have fat pipes and couldn’t care less about people with modems.  But before you scoff at the idea of using compression, realize three things.

  1. You are in the minority.  Broadband just hasn’t taken off like I and many other people thought it would.  (Ask me sometime what my @Home stock is worth.)
  2. A connection is only as fast as its weakest link.  All those people trying to download big pages with slow modems are hurting web server performance.  If the servers supported compression, those people would get their pages faster and response would improve for all users.
  3. You may not notice the difference if text gets there 30% faster, but large web site operators and small companies would certainly notice if their bandwidth requirements were cut by a third.