Some years back I did a lot of research on and experimentation with the binary heap data structure. During that time I noticed that publicly available implementations placed the root of the heap at
array, even in languages where arrays are 0-based. I found that incredibly annoying and wondered why they did that.
As far as I’ve been able to determine, the binary heap data structure originated with Robert J. Floyd in his 1962 article, “TreeSort” (Algorithm 113) in the August 1962 edition of Communications of the ACM. As with all algorithms published in CACM at the time, it was expressed in Algol: a language with 1-based arrays. The calculation to find the left child of any parent is:
childIndex = parentIndex * 2. The right child is
(parentIndex * 2) + 1, and the parent of any node can be found by dividing the node index by 2.
When C gained popularity as a teaching language in the 1980s (for reasons that escape me, but I digress), authors of algorithms texts converted their code from Pascal (again, 1-based arrays) to C (0-based arrays). And instead of taking the time to modify the index calculations, whoever converted the code made a terrible decision: keep the 1-based indexing and allocate an extra element in the array.
It’s almost certain that the person who converted the code was smart enough to change the index calculations. I think the reason they didn’t was laziness: they’d have to change the code and likely also change the text to match the code. So they took the easy route. And made a mess.
int a;, and the indexes are from 0 through 99. This is ingrained in our brains very early on. And then comes this one data structure, the binary heap, in which, if you want to hold 100 items, you have to allocate 101! The element at index 0 isn’t used.
That’s confusing, especially to beginning programmers who are the primary consumers of those algorithms texts.
Some argue that placing the root at
array makes the child and parent calculations faster. And that’s likely true. In a 0-based heap, the left child node index is at
(parentIndex * 2) + 1, and the parent index of any child is
(childIndex - 1)/2. There’s an extra addition when calculating the left child index, and an extra subtraction when calculating the parent. The funny thing is that in Algol, Pascal, and other languages with 1-based arrays, those “extra” operations existed but were hidden. The compiler inserted them when converting the 1-based indexing of the language to the 0-based indexing of the computer.
But in the larger context of what a binary heap does, those two instructions will make almost no difference in the running time. I no longer have my notes from when I made the measurements myself, but as I recall the results were noisy. The difference was so small as to be insignificant. Certainly they weren’t large enough in my opinion to justify replacing the code. If my program’s performance is gated on the performance of my binary heap, then I’ll replace the binary heap with a different type of priority queue algorithm. A paring heap, perhaps, or a skip list. The few nanoseconds potentially saved by starting my heap at index 1 just isn’t worth making the change.
The C++ STL
priority_queue, the Java
PriorityQueue, the .NET
PriorityQueue, and python’s
heapq all use 0-based binary heaps. The people who wrote those packages understand performance considerations. If there was a significant performance gain to going with a 1-based binary heap, they would have done so. That they went with 0-based heaps should tell you that any performance gain from a 1-based heap is illusory.
I am appalled that well-known authors of introductory algorithm texts have universally made this error. They should be the first to insist on clear code that embraces the conventions of the languages in which they present their code. They should be ashamed of themselves for continuing to perpetrate this abomination.