Skip list revisited

In C# versus the data structure, I described the skip list data structure and the difficulty I saw with implementing it in .NET. I published a C# implementation a couple of months ago, and it did indeed suffer from excessive memory usage. It performed quite well, but the memory requirement of 100 bytes per node blew my memory budget.

With a little thought and some indirection, I was able to reduce that to an average of about 20 bytes per node. I published that implementation in A More Memory-Efficient Skip List. The resulting data structure is faster than my initial implementation and uses about one-fifth the memory. Overall, I’m happy with the way it works. I think the design is clever–perhaps even elegant–but the implementation is, of necessity, somewhat involved and a little messy. That said, it works well and I’m glad to have it.

My friend and development team member (co-founder, business partner, whatever) Joe Langeway floated the idea one day of building something similar to a skip list, but that had only two pointers per node: a Next pointer, and a FarForward pointer. The idea was to create a simple linked data structure that had much of the benefit of a skip list, but without all that complexity. We tossed ideas around for a day or two, and then I sat down to code it up.

The node structure is very simple. It contains the data and two link references. I’ll illustrate here with a node that contains integers, but it’s easy enough to extend this to make it work with any data type.

class Node
{
    int Value;
    Node Next;
    Node FarForward;
}

The idea, then, is to have the data structure assign far forward pointers as items are inserted. How to assign those far forward pointers was (and perhaps still is) a matter of some disagreement. Since I was implementing the thing, I chose what I thought was the simplest approach: when traversing the list to find the insertion spot, make the first unused far forward pointer encountered in the traversal reference the new node.

Imagine you were inserting the numbers 1, 3, and 2, in that order. The first node is inserted trivially and its far forward pointer is null. When we try to insert the second node, the code has to traverse the list (follow Next and FarForward pointers) to find the insertion spot. Along the way, the code chooses the first non-null forward pointer to reference the new node. When the value 3 is inserted, the first node is modified so that its Next and FarForward pointers both reference the new node.

When 2 is inserted, there’s no unused forward pointer ahead of it, so the only thing pointing to the new node is the first node’s Next reference. The result is that node 1 has a Next value of 2, and a FarForward value of 32.Next is 3, and 2.FarForward is null.

This data structure has two nice properties. First, it’s already in sorted order. Traversing the list in sorted order is a simple matter of following the Next links. Finding a particular value, though, isn’t necessarily a sequential search. The Find method can use the FarForward pointers to skip through the list much more quickly than examining every single node. In this particular case, for example, it only has to examine two nodes to find the value 3. It never has to examine the second node.

Implementing the first version of this took very little time, and performance is surprisingly good in the general case. I found that search takes, on average, about 10 * (Log10(n) - 1) probes, where n is the number of items in the list. So to find an item in a collection of one million takes approximately 50 probes (10 * (log10(1000000) – 1)). That’s not as fast as a skip list or a balanced binary tree, both of which would be log2(n), or in this case about 20 probes. But it’s a huge improvement over sequential search (n/2, or 500,000 probes).

Unfortunately, the data structure has a few drawbacks. It exhibits pathological behavior when the input data is sorted or reverse-sorted, and it is pretty sensitive to ordered subranges. If you insert items in sorted order, the FarForward link in every node points to the next node. And if you insert items in reverse order, all of the FarForward links are null. In either case, finding an item becomes a sequential scan of the list, taking on average n/2 probes.

The other drawback is that the number of probes required to find an item in the general case is quite variable. The average is reasonable: about 2.5 times the worst case for binary search. But it varies quite a bit. In my testing, the maximum number of probes required to find an item is approximately 3.5 times the average. So, whereas the average for a list of one million items is about 50, the maximum is somewhere around 175.

I’ve explored several different ways to alleviate the problems associated with ordered data. Preventing pathological behavior when data is ordered is not very difficult, although the result is still unacceptable, with a search time that’s about 10 times what it is for a list built from unordered data. Solving the ordered and reverse-ordered cases also complicates the code quite a bit. It’s still manageable, but not nearly as elegant as the initial implementation.

You might rightly ask why I’m even bothering with a data structure that is almost certain to be slower than existing solutions that I already have access to. There are several reasons. First (and perhaps foremost), it’s a really interesting puzzle. Just a little thought produced a simple data structure that has logarithmic complexity in the general case. It comes within spitting distance of the skip list and is much easier to understand and implement. Also, there are certain benefits to having a linked data structure whose nodes are of a fixed size, even if that data structure is somewhat slower than another data structure that has variable-sized nodes.

An interesting use of the insertion algorithm for this data structure would be to build a double-linked list from a stream of unordered data. The idea would be to treat the linked list node’s Prev pointer as a far forward pointer during construction. When the last data item is received, you can make a single sequential pass through the list to set the Prev pointers to their correct values. You could use a heap, but it would require more space. If your desired result is a double-linked list, this algorithm would use only O(n) memory. Building a heap would require O(n) extra memory.

I won’t post the code yet, since I’m still tinkering and the code is kind of ugly right now. But if you let me know that you’re interested in puzzling on this, you might give me some incentive to clean things up sooner. I’d be interested to see if a combined effort could come up with solutions to some of the problems.