Last month in HashSet Limitations, I noted what I thought was an absurd limitation on the maximum number of items that you can store in a .NET HashSet or Dictionary collection. I did more research on all the major collection types and wrote a series of articles on the topic for my .NET Reference Guide column. If you’re interested in how many items of a particular type can fit into one of the .NET collection types, you’ll want to read those four or five articles.
I mentioned last month that the HashSet and Dictionary collections appeared to have a limit of 47,995,854 items. Whereas that really is the limit if you use the code that I showed, the real limit is quite a bit larger: about 89.5 million with the 64-bit runtime (61.7 million on the 32-bit runtime) if you have String keys and your values are references. The difference has to do with the way that the data structure grows as you add items to it.
Like all of the .NET collection types, HashSet and Dictionary grow dynamically as you add items. They start out with a very small capacity–fewer than 16 items. As you add items and overflow the capacity, the collection is resized by doubling. But the capacity is not exactly doubled. It appears that the new capacity is set to the first prime number that is larger than twice the current capacity. (Some hashing algorithms perform better when the number of buckets is a prime number.) I don’t know the exact resize sequence, but I do know that there are noticeable pauses at around 5 million, 11 million, and 23 million, and then the program crashes at 47 million. The reason for the crash is that the collection is trying to resize its internal array to the next prime number that’s somewhere around 96 million. And that’s larger than the empirically determined maximum size of about 89.5 million.
So how does one get a collection to hold more than that 48 million items? By pre-allocating it. If you know you’re going to need 50 million items, you can specify that as the initial capacity when you create the object:
Dictionary<string, object> myItems = new Dictionary<string, object>(50000000);
You can then add up to 50 million items to the collection. But if you try to add more, you’ll get that OutOfMemory exception again.
That works well with Dictionary, but HashSet doesn’t have a constructor that will let you specify the initial capacity. When I was writing my .NET column on this topic, I discovered that you can create a larger HashSet indirectly, by building a List and then passing that to the HashSet constructor that takes an IEnumerable parameter. Here’s how:
static void Main(string[] args) { int maxSize = 89478457; Console.WriteLine("Max size = {0:N0}", maxSize); // Initialize a List<long> to hold maxSize items var l = new List<long>(maxSize); // now add items to the HashSet for (long i = 0; i < maxSize; i++) { if ((i % 1000000) == 0) { Console.Write("\r{0:N0}", i); } l.Add(i); } Console.WriteLine(); // Construct a HashSet from that list var h = new HashSet<long>(l); Console.WriteLine("{0:N0} items in the HashSet", h.Count); Console.Write("Press Enter:"); Console.ReadLine(); }
That works fine if you know in advance what items you want to put into your HashSet, but it doesn’t do you much good if you want to add things one at a time. At the time I wrote the article, I didn’t have a solution to the problem.
I later discovered that after creating the collection, you can remove all the items (either by calling Remove 89 million times, or by calling Clear) and you’re left with an empty HashSet that has a capacity of 89 million items. It’s a roundabout way to get some functionality that I think should have been included with the HashSet class, but if you need it, that’s how you do it. Understand, though, that it’s undocumented behavior that might change with the next release of the .NET Framework.
In last month’s note, I also grumbled a bit about the perceived 5x memory requirement of Dictionary and HashSet. It turns out that I was off base there. Internally, these collections store a key/value pair for each item. Since typically both the key and the value are references, that adds up to 16 bytes per item, giving a best case of about 134 million* items fitting into the .NET maximum 2 gigabyte allocation. The real limit of 89,478,457 indicates that the per-item overhead is 24 bytes (2 GB / 24 = 89,478,485), which is actually pretty reasonable.
*The number is 128 * 1024 * 1024. It seems to be understood that when we programmers say “128 K items,” we really mean 131,072 items (128 * 1,024). But there doesn’t seem to be a generally accepted terminology for millions or billions of items. I don’t hear “128 M items” or “128 G items” in conversation, nor do I recall seeing them in print. I’ve heard (and used) “128 mega items” or “128 giga items,” but those sound clunky to me. We don’t say “128 kilo items.” Is there generally accepted nomenclature for these larger magnitudes? If not, shouldn’t there be? Any suggestions?