HashSet limitations

Version 3.5 of the .NET runtime class library introduced the HashSet generic collection type. HashSet represents a set of values that you can quickly query to determine if a value exists in the set, or enumerate to list all of the items in the set. You can also perform standard set operations: union, intersection, determine subset or superset, etc. HashSet is a very handy thing to have. Simulating the same functionality in prior versions of .NET was very difficult.

I’ve made heavy use of HashSet in my code since it was introduced, and I’ve been very happy with its performance. Until today. Today I ran into a limitation that makes HashSet (and the generic Dictionary collection type, as well) useless for moderately large data sets. It’s a memory limitation, and how many items you can store in the HashSet depends on how large your key is.

I’ve mentioned before that the .NET runtime has a 2 gigabyte limit on the size of a single object. Even in the 64-bit version, you can’t make a single allocation that’s larger than 2 gigabytes. I’ve bumped into that limitation a few times in the past, but have been able to work around them by restructuring some things. I thought I was safe with the HashSet, though. Even with an 8-byte key, I figured I should be able to store on the order of 250 million items. I found out today that the number is quite a bit lower: a little less than 50 million. 47,995,853 to be exact. After I figured out what was causing my problem, I verified it with this program:

static void Main(string[] args)
{
    HashSet<long> bighash = new HashSet<long>();
    for (long i = 0; i < 50000000; ++i)
    {
        if ((i % 100000) == 0)
        {
            Console.Write("r{0:N0}", i);
        }
        bighash.Add(i);
    }
    Console.WriteLine();
    Console.Write("Press Enter");
    Console.ReadLine();
}

The program throws OutOfMemoryException when it tries to add the 47,995,853rd (or perhaps the 47,995,854th) item, because it’s increasing the capacity of an internal data structure and that data structure exceeds 2 gigabytes.

If I reduce the size of the key to 4 bytes (a .NET long is 8 bytes), then I can add just a little less than 100 million items before hitting the limit. Let’s think about that a little bit.

50 million keys of 8 bytes each should take up about 400 megabytes. 100 million keys of 4 bytes each should take up about 400 megabytes. I realize that there’s some overhead in a hash table to deal with collisions, but five times is excessive! I can’t imagine a hash table implementation that has an overhead of five times the total key size. And yet, that’s what we have in .NET.

It’s bad enough in today’s world, where a machine with 16 gigabytes of RAM can be had for under $2,000, that we have to deal with the 2-gigabyte-per-object limitation in .NET. But to have the runtime library’s implementation of a critical data structure squander memory in this way is too much.

Any workaround is very painful. We’ll have to write our own hash table implementation that allocates unmanaged memory and mucks around with pointers in unsafe code. We’re old C programmers, so that’s not beyond our capabilities. But it sure makes me wonder why I selected .NET for this project. In the process, we’re going to lose a lot of the functionality of Dictionary and of HashSet.

I can’t be the only one running up against these kinds of limitations. 10 years ago, a data set of 100 million items may have been considered large. Today 100 million is, at best, moderately large. There are plenty of applications that work with billions of items and today’s computers have the capacity to store them all in RAM. We damned well should be able to index them in RAM using modern tools.

I hope the .NET team is working on a solution to the 2-gigabyte limit, and I’d strongly suggest that they take a very close look at their hash table implementation.