Null is evil, Chapter 9346

If we’ve learned anything from 45 years of C coding, it’s that NULL is evil. At least relying on NULL return values is usually evil, especially when it forces you to write contorted special case handling code. I ran across a case in point today that surprised me because the programmer who wrote it should have known better. And for sure the senior programmer who reviewed the code should have flagged it for rewriting.

There is a method that takes an existing object and updates it based on fields in another object. The code writes a change log entry for each field updated. The code is similar to this:


    private bool UpdatePerson(Person person, Models.Person personUpdate)
    {
        var changes = new List<ChangeLogEntry>();
        
        changes.Add(CompareValues(person.FirstName, personUpdate.FirstName, "First Name"));
        changes.Add(CompareValues(person.LastName, personUpdate.LastName, "Last Name"));
        // ... other fields here
        
        changes = changes.Where(x => x != null).ToList();  // 

The CompareValues function returns a ChangeLogEntry if the field has changed. If the field didn’t change, CompareValues returns null.

This is Just Wrong. It works, but it’s wonky. The code is adding null values to a list, knowing that later they’ll have to be removed. Wouldn’t it make more sense not to insert the null values in the first place?

One way to fix the code would be to write a bunch of conditional statements:


    ChangeLogEntry temp;
    temp = CompareValues(person.FirstName, personUpdate.FirstName, "First Name");
    if (temp != null) changes.Add(temp);
    
    temp = CompareValues(person.LastName, personUpdate.LastName, "Last Name");
    if (temp != null) changes.Add(temp);
    
    // etc., etc.

That’s kind of ugly, but at least it doesn’t write null values to the list.

The real problem is the CompareValues function:


    private ChangeLogEntry CompareValues(string oldValue, string newValue, string columnName)
    {
        if (oldValue != newValue)
        {
            var change = new ChangeLogEntry(columnName, oldValue, newValue);

            return change;
        }

        return null;
    }

Yes, it’s poorly named. The method compares the values and, if the values are different, returns a ChangeLogEntry instance. If the values are identical, it returns null. Perhaps that’s the worst thing of all: returning null to indicate that the items are the same. This model of returning null to indicate status is a really bad idea. It forces clients to do wonky things. Of all the horrible practices enshrined in 40+ years of C-style coding, this one is probably the worst. I’ve seen and had to fix (and, yes, written) too much terrible old-style C code to ever advocate this design pattern. null is evil. Avoid it when you can.

And you usually can. In this case it’s easy enough. First, let’s eliminate that CompareValues function:


    if (person.FirstName != personUpdate.FirstName) changes.Add(new ChangeLogEntry(person.FirstName, personUpdate.FirstName, "First Name"));
    if (person.LastName != personUpdate.LastName) changes.Add(new ChangeLogEntry(person.LastName, personUpdate.LastName, "Last Name"));

That’s still kind of ugly, but it eliminates the null values altogether. And we can create an anonymous method that does the comparison and update, thereby removing duplicated code. The result is:


    private void UpdatePerson(Person person, Models.Person personUpdate)
    {
        var changes = new List<ChangeLogEntry>();
        
        // Anonymous method logs changes.
        var logIfChanged = new Action<string, string, string>((oldValue, newValue, columnName) =>
        {
            if (oldValue != newValue)
            {
                changes.Add(new ChangeLogEntry(columnName, oldValue, newValue));
            }
        }
        
        logIfChanged(person.FirstName, personUpdate.FirstName, "First Name");
        logIfChanged(person.LastName, personUpdate.LastName, "Last Name");
        // ... other fields here

        if (changes.Any())
        {
            // update record and write change log
        }
    }

That looks a lot cleaner to me because there are no null values anywhere, and no need to clean trash out of the list.

Here’s the kicker: I didn’t rewrite the code. If the code were as simple as what I presented here, I probably would have changed it. But the code is a bit more complex and it’s currently running in a production environment. I hesitate to change working code, even when its ugliness causes me great mental pain. I’m almost certain that I can make the required changes without negatively affecting the system, but I’ve been burned by that “almost” too many times with this code base. If I had a unit test for this code, I’d change it and depend on the test to tell me if I screwed up. But there’s no unit test for this code because … well, it doesn’t matter why. Let’s just say that this project I inherited has a number of “issues.” The one I’ll relate next time will boggle your mind.

Efficiency of Pairing heap

Last time, I introduced the Pairing heap, and showed an example of its structure and how it changes as items are added and removed. At first glance it seems unlikely that this can be faster than the binary heap I spent so much time exploring three years ago. But it can be. Let me show you how.

As I mentioned in the discussion of the binary heap, inserting an item takes time proportional to the base-2 logarithm of the heap size. If there are 100 items in the heap, inserting a new item will take, worst case, O(log2(100)) operations. Or, about 7 swaps. It could be fewer, but it won’t be more. In addition, removing the minimum item in a min-heap will require O(log2(n)) operations. In a binary heap, insertion and removal are O(log n).

As you saw in my introduction to the Pairing heap, inserting an item is an O(1) operation. It never involves more than a comparison and adding an item to a list. Regardless of how many items are already in the heap, adding a new item will take the same amount of time.

Removal, though, is a different story altogether. Removing the smallest item from a Pairing heap can take a very long time, or almost no time at all.

Consider the Pairing heap from the previous example, after we’ve inserted all 10 items:

    0
    |
    6,4,3,5,8,9,7,2,1

It should be obvious that removing the smallest item (0) and re-adjusting the heap will take O(n) operations. After all, we have to look at every item during the pair-and-combine pass, before ending up with:

    1
    |
    2, 8, 3, 4
    |  |  |  |
    7  9  5  6

But we only look at n/2 items the next time we remove the smallest item. That is, we only have to look at 2, 8, 3, and 4 during the pair-and-combine pass. The number of items we look at during successive removals is cut in half with each removal, until we get to two or three items per removal. Things fluctuate a bit, and it’s interesting to write code that displays the heap’s structure after every removal.

By analyzing the mix of operations required in a very large number of different scenarios, researchers have determined that the amortized complexity of removal in a Pairing heap is O(log n). So, asymptotically, removal from a Pairing heap is the same complexity as removal from a binary heap.

It’s rather difficult to get a thorough analysis of the Pairing heap’s performance characteristics, and such a thing is way beyond the scope of this article. If you’re interested, I suggest you start with Towards a Final Analysis of Pairing Heaps.

All other things being equal, Pairing heap should be faster than binary heap, simply because Pairing heap is O(1) on insert, and binary heap is O(log n) on insert. Both are O(log n) on removal. Keep in mind, though, that an asymptotically faster algorithm isn’t always faster in the real world. I made that point some time back in my post, When theory meets practice. On average, Pairing heap does fewer operations than does binary heap when inserting an item, but Pairing heap’s individual operations are somewhat more complex than those performed by binary heap.

You also have to take into account that Pairing heap can do some things that binary heap can’t do, or can’t do well. For example, the merge (or meld) function in Pairing heap is an O(1) operation. It’s simply a matter of inserting the root node of one heap into another. For example, consider these two pairing heaps:

    1                  0
    |                  |
    2, 8               5
    |
    4

To merge the two heaps, we treat the root nodes just as we would two siblings after removing the smallest item. We pair them to create:

    0
    |
    1, 5
    |
    2, 8
    |
    4

With binary heap, we’d have to allocate a new array that’s the size of both heaps combined, copy all items from both heaps to it, and then call the Heapify method to rebuild the heap. That takes time proportional to the combined size of both heaps.

And, as I mentioned previously, although changing the priority of an item in a binary heap isn’t expensive (it’s an O(log n) operation), finding the node to change is an O(n) operation. With Pairing heap, it’s possible to maintain a node reference, and it’s been shown (see the article linked above) that changing the priority of a node in a Pairing heap is O(log log n).

Another difference between Pairing heap and binary heap is the way they consume memory. A binary heap is typically implemented in a single contiguous array. So if you want a heap with two billion integers in it, you have to allocate a single array that’s 8 gigabytes in size. In a Pairing heap, each node is an individual allocation, and each node contains two pointers (object references). A Pairing heap of n nodes will occupy more total memory than a binary heap of the same size, but the memory doesn’t have to be contiguous. As a result, you might be able to create a larger pairing heap than you can a binary heap. In .NET, for example, you can’t create an array that has more than 2^31 (two billion and change) entries. A Pairing heap’s size, however, is limited only by the available memory.

The Pairing Heap, Part 1

A key takeaway from my early discussion of heaps is that if all we need is the ability to get the smallest item whenever we reach into the bag, then there’s no need to keep the entire bag in sorted order if some less expensive alternate ordering will work. Because of that relaxed requirement, we can implement a priority queue with a heap a whole lot more efficiently than with a sorted list.

I like to call it creative laziness.

Binary heaps are useful and quite efficient for many applications. For a pure priority queue, it’s hard to beat the combination of simplicity and efficiency of a binary heap. It’s easy to understand, easy to implement, and performs quite well in most situations. But the simplicity of the binary heap makes it somewhat inflexible, and also can prove to be inefficient when heaps become very large.

Even though the binary heap isn’t sorted, its ordering rules are pretty strict. If you recall, a valid binary heap must satisfy two properties:

  • The heap property: the key stored in a node is greater than or equal to the key stored in the parent node.
  • The shape property: it’s a complete binary tree.

Maintaining the shape property is required if you want to efficiently implement the binary heap in a list or array, because it lets you implicitly determine the locations of the child or parent nodes. But nothing says that you have to implement a binary heap in an array. You could use a more traditional tree representation, with explicit nodes that have right, left, and parent node references. That has some real benefits.

For example, a common extension to the priority queue is the ability to change the priority of an item in the queue. It’s possible to do that with a binary heap, and in fact isn’t difficult. It’s inefficient, though, because in a binary heap it takes a sequential search to find the item before you can change its priority.

But if you have a traditional tree representation, then you can maintain direct references to the nodes in the heap. At that point, there’s no longer a need for the sequential search to find the node, and changing an item’s priority becomes a pure O(log n) operation. Algorithms such as the A* search algorithm typically use something other than a binary heap for their priority queues, specifically because changing an item’s priority is a common operation.

Another common heap operation is merging (also known as melding): combining two heaps into one. If you have two binary heaps, merging them takes time proportional to the combined size of both heaps. That is, it takes O(n + m) time. The basic idea is to allocate an array that’s the size of both heaps, put all the items into it, and then call a BuildHeap function to construct the heap. With a different heap representation, you can do the merge in constant time.

It turns out that maintaining the shape property when you go to a more traditional tree representation is needlessly expensive. If you eliminate the need for the shape property, it becomes possible to indulge in even more creative laziness.

And, my oh my, have people been creative. Here’s a partial list of heap implementations.

There’s a lot of discussion about which of those is the “fastest” or “best” heap implementation. A lot of it involves theoretically optimum asymptotic behavior, and can get pretty confusing for those of us who just want a working heap data structure that performs well in our applications. Without going into too much detail, I’ll just say that some of those heap implementations that have theoretically optimum performance in some areas (the Brodal queue, for example) are difficult to implement and not at all efficient in real-world situations.

Of those that I’ve worked with, I particularly like Pairing heap because of its simplicity.

The pairing heap’s shape is much different than that of a binary heap. Rather than having right and left child references, a pairing heap node maintains child lists. It maintains the heap property in that any child in a min-heap is greater than or equal to its parent, but the concept of shape is thrown out the window. For example, consider this binary heap of the numbers from 0 to 9.


                    0
              1           3
          5       2   7       4
        9   6   8

Again, in the binary heap, each node has up to two children, and the children are larger than the parents. Although there are multiple valid arrangements of values possible for a binary heap, they all maintain that shape property.

A Pairing heap’s structure is much more flexible. The smallest node is always the root, and children are always larger than their parents, but that’s where the similarity ends. The best way to describe the structure is to go through an exercise of adding and removing items.

If you start with an empty heap and add an item, then that item becomes the root. So, let’s add 6 as the first item in the heap:

    6

Now, we add the value 3. The rules for adding are:

  1. If there is no root, then the new item becomes the root.
  2. If the new item is greater than or equal to the root, then the new item is added to the root item’s child list.
  3. If the new item is smaller than the root, then the new item becomes the root, the old root’s child list is appended to the new root’s child list, and the old root is appended to that child list.

In this case, 3 is smaller than 6, so 3 becomes the root and 6 is added as a child of 3.

    3
    |
    6

Here, the vertical bar character (|) is used to denote a parent-child relationship. So node 6 is a child of node 3.

Now we add 4 to the heap. Since 4 is larger than 3, 4 is added to the child list.

    3
    |
    6,4

When we add 2 to the heap, 2 becomes the new root, and 3 is appended to the child list:

    2
    |
    6,4,3

I’ll add 5, 8, 9, and 7 in that order, giving:

    2
    |
    6,4,3,5,8,9,7

Adding 0 makes 0 the root, and then adding 1 appends to the child list:

    0
    |
    6,4,3,5,8,9,7,2,1

Note that the heap property is maintained. The smallest item is at the root, and child items and their siblings are larger than their parents. Because the smallest item is at the root, we can still execute the get-minimum function in constant time.

And in fact, insert is also a constant time operation. Inserting an item consists of a single comparison, and appending an item to a list.

Things get interesting when we want to remove the smallest item. Removing the smallest item is no trouble at all, of course. It’s figuring out the new root that gets interesting. This proceeds in two steps:

First, process the children in pairs from left to right, creating 2-node heaps with the lowest item at the root, and the other item as a child. So using the child list above, we would create:

    4 3 8 2 1
    | | | |
    6 5 9 7

So 6 is a child of 4, 5 is a child of 3, etc. In this case, 1 doesn’t have a child.

Then, we process from right to left, taking the lower item as the root and adding the other item as a child of the root. Starting with 1 and 2, we combine them to create:

    4 3 8 1
    | | | |
    6 5 9 2
          |
          7

Proceeding to compare 1 and 8:

    4 3 1
    | | |
    6 5 2, 8
        |  |
        7  9

Note here that 2 and 8 are siblings, but 7 is a child of 2, and 9 is a child of 8. 9 and 7 are not siblings.

When we’re done processing, we have:

    1
    |
    2, 8, 3, 4
    |  |  |  |
    7  9  5  6

Again, the heap property is maintained.

It’s very important to understand that the heap property is maintained. All child nodes are larger than their parent nodes. Siblings (children of the same node) are represented in an unordered list, which is just fine as long as each sibling is larger than the parent node. Also, a node’s children only have to be larger than their parent node. Being smaller than a parent’s sibling does not violate the heap property. In the figure above, for example, 5 is larger than its parent, 3, but smaller than its parent’s sibling, 8.

Removing 1 from the heap leaves us with

    2, 8, 3, 4
    |  |  |  |
    7  9  5  6

Again, we pair from left to right. 2 is smaller than 8, so 2 becomes the root of the subtree. But what happens to the children? They’re appended to the child list of the other node. So pairing 2 and 8 results in:

    2
    |
    8, 7
    |
    9

And pairing the others gives:

    2      3
    |      |
    8, 7   4, 5
    |      |
    9      6

And combining from right to left makes 2 the root of the new tree:

    2
    |
    8, 7, 3
    |     |
    9     4, 5
          |
          6

Again, you can see that the heap property is maintained.

Now, when we remove 2 from the heap, we first pair 8 and 7 to create:

    7     3
    |     |
    8     4, 5
    |     |
    9     6

And then combine to form:

    3
    |
    4, 5, 7
    |     |
    6     8
          |
          9

When 3 was selected as the smallest item, node 7 was appended to the child list.

That’s the basics of the Pairing heap. You might be asking yourself how that can possibly be faster than a binary heap. That will be the topic for next time.