An interesting programming puzzle

I ran across this puzzle yesterday.

Write a function that, given a positive integer, will express that number as a series of simple mathematical operations (add, subtract, multiply, divide) using only single-digit numbers. Output of the function is a string containing a postfix expression.

So, for example, 36 would result in 66* or 94*. 11 would be 92+ or 74+, etc. The number 101 could be expressed as 554**1+.

It didn’t take long to come up with a recursive function that meets the requirements. Here it is in C#:

    string GetExpression(int val)
    {
        if (val < 10)
        {
            return val.ToString();
        }
        int quo, rem;
        // first see if it's evenly divisible
        for (int i = 9; i > 1; --i)
        {
            quo = Math.DivRem(val, i, out rem);
            if (rem == 0)
            {
                if (val >= 90 || (val < 90 && quo <= 9))
                {
                    // value is (i * quo)
                    return i + GetExpression(quo) + "*";
                }
            }
        }

        quo = Math.DivRem(val, 9, out rem);
        // value is (9 * quo) + rem
        // optimization reduces (9 * 1) to 9
        var s1 = "9" + ((quo == 1) ? string.Empty : GetExpression(quo) + "*");
        var s2 = GetExpression(rem) + "+";
        return s1 + s2;
    }

The overall idea here is that first I try to divide the number evenly by a one-digit number. If it does divide evenly, then the result is (x * quotient), where x is the divisor. If the number isn’t evenly divisible by a one-digit number, then the function divides the value by nine and generates (9 * quotient) + remainder.

I made one simple optimization based on the observation that any number in the range 0-90 inclusive can be expressed either as the product of two single-digit numbers, or with an expression of the form (9 * x) + y, where x and y are one-digit numbers. 31, for example, is (9 * 3) + 4, or 93*4+ in postfix.

The function above generates the shortest possible expressions for all numbers in the range 0 through 90. I think its expressions for 91 through 101 are as short as possible, if non-obvious. But I know that it doesn’t always generate optimum expressions. For example, the generated expression for 10,000 is 85595*5+*** (it works out to (8 * 1250)). But 10,001 results in 99394*5+**4+*2+ (postfix for (9 * 1111) + 2). A shorter expression would be (10000 + 1) or, in postfix, 85595*5+***1+.

I’m at a loss as to how I would go about ensuring an optimum expression for any positive integer. I know that I could add more special cases, but I’m already disappointed by the existing special case in my algorithm for numbers 0 through 90. Even the check for even divisibility kind of bothers me, but that’s the only way I could find to make things reasonable at all.

If the puzzle interests you, I’d be interested in any insights.

Making me crazy

I don’t travel as much as I used to, so I’m not up on all the latest changes. The last time I traveled by air was two years ago, and somebody else made the reservations. I don’t remember the last time I booked a flight.

This evening I was making reservations to go to southern California. I typically just go to Southwest Airlines because their rates are competitive if not always the lowest, and I always get good service. But seeing the cost of the flight I thought I’d shop around. Several carriers showed much lower prices for the trip I wanted. Until I took into account restrictions, extra charges, etc. Their Web sites don’t make it easy for me to feel confident that I’m getting what I think I’m getting, and by the time I added everything up it wasn’t a whole lot cheaper than Southwest.

I entered my Rapid Rewards account (Southwest’s frequent flier program) with my flight information so I’d get points for the flight. Why not, right? But then I couldn’t check out. You see, my Rapid Rewards account has my name as Jim Mischel. But new (new to me, at least) government regulations (“Safe Travel” or some such) insist that the name on the ticket match the name on my driver’s license, which is James Mischel. Uff. Southwest’s helpful Web site suggested that I change the name on my Rapid Rewards account.

But the Rapid Rewards account information page says:

For security purposes, we do not allow name changes online. Instead, please forward your Rapid Rewards card, along with photocopies of legal documentation (ex. driver license, marriage certificate, etc.) and an informal letter indicating your legal name, to Rapid Rewards, P.O. Box 36657, Dallas, TX 75235.

Just shoot me.

The only solution I could come up with was to remove the Rapid Rewards number. So I won’t get my points for the flight. Probably wouldn’t matter, anyway; I don’t fly enough for the points to mean anything.

Ain’t technology wonderful?

Bird in a branch

A carving friend who vacations in Colorado brought me a piece of Bristlecone pine. Fred is quite an accomplished carver who I think started about the time I did. He’s concentrated on carving faces, mostly out of Aspen. He also does stylized carvings from many different types of wood. The sycamore gecko I carved is from one of his patterns, and his cedar armadillo was the inspiration for my mesquite armadillo, although I didn’t use his pattern.

The Bristlecone pine stayed on the floor of my truck for a couple of months while I tried to figure out what to do with it. I thought about carving one of my birds to add to the collection, but for some reason couldn’t bring myself to chop up that piece of wood just for a bird. Last week I finally figured it out: I’d carve a bird in the branch. It’s a kind of carving I hadn’t attempted before.

I think what convinced me to try it was the fragment of a small limb that was sticking out from what was otherwise a fairly straight and boring branch. I decided to use that limb as part of the bird’s tail. I wish I’d taken pictures from start to finish. Unfortunately, I just got two after spending some time roughing it out.

Rough carved bird. Note that the carving is just sitting on the bandsaw table. I did the rough carving with my Foredom power carver. Do not get the idea that I cut this out with the bandsaw.

As I said, this was new territory for me. I’d always started my bird carvings with a bandsaw cutout. Not here. I had to carve around the bird figure with my Foredom power carver. Carving a figure that remains part of the larger piece of wood is quite different. It can be difficult at times because I can’t just turn the thing over and carve from a different direction. The uncut part of the branch often got in the way. Detailing the beak was particularly difficult because I couldn’t easily get the Foredom in there, even with the detailing handpiece.

Roughing out took me an hour or two on Saturday. Sunday I spent two or three more hours roughing out and then detailing the bird. The result is a little thin and not quite symmetrical, but I thought it turned out pretty nice.

The bird figure is very smooth and sanded with 220 grit. The rest of the carved wood is sanded with 120 grit, and not perfectly smooth. I left some dips. I thought about trying to texture it like a nest, but didn’t have a good idea of what that should look like. Rather than do something that detracted from the carving, I just sanded and called it good enough.

The finish is two coats of Watco Natural Danish Oil, which I applied to the entire piece–including the uncarved wood. I haven’t yet decided if i should add a couple coats of a spray polyurethane to give it a little shine. We’ll see.

I made plenty of mistakes on this piece, especially in the bird shape. But I understand how and why I made them, and figure I can do better next time. I especially liked doing this with the bird shape because it’s a familiar subject, just rendered a little differently. I find trying to do something familiar in a new way to be an effective learning experience.

All things considered, it was a fun project and a good learning experience. And I like the way it turned out.

Heaps: constructors for the DHeap class

Continuing with my implementation of a D-ary heap, after a rather long break during which I was busy with other things . . .

The hardest part of implementing the DHeap class was getting the constructors right. Once those are right, the rest of the class just falls into place. The constructors were difficult because there are four different parameters that control construction, and I wanted to avoid having to create a different constructor for every possible combination of parameters. With a little thought and judicious use of default parameters, I was able to do everything with three constructors: one private and two protected.

Internally, the DHeap class depends on four fields:

    // Heap item comparer
    private readonly IComparer<T> _comparer;

    // Result comparer (for Min-heap and Max-heap)
    private readonly Func<int, bool> _resultComparer;

    // Number of children for each node.
    private readonly int _ary;

    // List that holds the heap items
    private readonly List<T> _items;

I used List<T> here rather than an array (i.e. T[]) to take advantage of the automatic growth that the List class provides. I could have implemented my own dynamic array and saved a few cycles, but the slight performance gain really isn’t worth the extra effort.

The _comparer is an IComparer<T> implementation that compares two items. Clients can pass a custom comparer if their type doesn’t implement IComparable<T>, or if they want to compare items differently than the default comparer does. If no custom comparer is supplied, the constructors will use Comparer<T>.Default.

We discussed the _resultComparer in a previous post. It’s there so that we can use the same code for a Min-heap and for a Max-heap. The MinDHeap<T> and MaxDHeap<T> derived classes pass the appropriate comparison function as a parameter to the constructor.

The _ary property determines the “arity” of the heap. Passing 2 will create a binary heap, passing 4 will create a 4-heap, etc.

The constructors allow the client to supply values for _comparer_ary, and _resultComparer. In addition, clients can specify a capacity, which determines the initial capacity of the heap. Doing so prevents unnecessary reallocations in the same way that the initialCapacity parameter to the List constructor works.

Given that, the class with its constructors and internal fields looks like this:

    [DebuggerDisplay("Count = {Count}")]
    public class DHeap<T>: IHeap<T>
    {
        private readonly IComparer<T> _comparer;
        private readonly Func<int, bool> _resultComparer;

        private readonly int _ary;

        private readonly List<T> _items = new List<T>();

        private DHeap(int ary, Func<int, bool> resultComparer,
            IComparer<T> comparer = null)
        {
            _ary = ary;
            _comparer = comparer ?? Comparer<T>.Default;
            _resultComparer = resultComparer;
        }

        internal DHeap(int ary, Func<int, bool> resultComparer, int capacity,
            IComparer<T> comparer = null)
            : this(ary, resultComparer, comparer)
        {
            if (ary < 2)
            {
                throw new ArgumentOutOfRangeException("ary",
                    "must be greater than or equal to two.");
            }
            if (capacity < 0)
            {
                throw new ArgumentOutOfRangeException("capacity",
                    "must be greater than zero.");
            }
            _items = new List<T>(capacity);
        }

        internal DHeap(int ary, Func<int, bool> resultComparer,
            IEnumerable<T> collection, IComparer<T> comparer = null)
            : this(ary, resultComparer, comparer)
        {
            if (ary < 2)
            {
                throw new ArgumentOutOfRangeException("ary",
                    "must be greater than or equal to two.");
            }
            if (collection == null)
            {
                throw new ArgumentNullException("collection",
                    "may not be null.");
            }
            _items = new List<T>(collection);
            Heapify();
        }
    }

DHeap<T> is a base class that is intended to be used only by the MinDHeap<T> and MaxDHeap<T> derived classes. The private constructor sets the common properties from supplied parameters, and the two protected constructors are called by the derived class’s constructors. Because there are no public constructors, no other classes can derive from this one.

I considered making the constructors public so that any client could derive from this class, but doing so would necessitate making most of the methods virtual, and the private fields protected so that derived classes could manipulate them. But if a derived class wanted to override, say, Insert, it would likely be changing almost all of the implementation. At that point, is it really a derived class or a new class entirely? It seemed more reasonable to provide the IHeap<T> interface for those who want to create a different type of heap.

The ary and resultComparer parameters are common to both of the accessible constructors, and must be specified. The comparer parameter also is common to both, but it can be defaulted (i.e. not supplied). The only real difference between the two constructors is that one takes an initial capacity and the other takes a collection from which the heap is initialized. Both constructors call the private constructor to set the common values, and then initialize the heap.

Clients don’t create instances of DHeap<T>. Instead, they create instances of MinDHeap<T> and MaxDHeap<T>. Those classes each consist of a static comparison function (the result comparer) and three constructors that simply chain to the DHeap<T> constructors.

    public class MinDHeap<T> : DHeap<T>
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static private bool MinCompare(int r)
        {
            return r > 0;
        }

        /// <summary>
        /// Initializes a new instance that is empty
        /// and has the specified initial capacity.
        /// </summary>
        public MinDHeap(int ary, int capacity, IComparer<T> comparer = null)
            : base(ary, MinCompare, capacity, comparer)
        {
        }

        /// <summary>
        /// Initializes a new instance that is empty
        /// and has the default initial capacity.
        /// </summary>
        public MinDHeap(int ary, IComparer<T> comparer = null)
            : this(ary, 0, comparer)
        {
        }

        /// <summary>
        /// Initializes a new instance that contains
        /// elements copied from the specified collection.
        /// </summary>
        public MinDHeap(int ary, IEnumerable<T> collection,
            IComparer<T> comparer = null)
            : base(ary, MinCompare, collection, comparer)
        {
        }
    }

    public class MaxDHeap<T> : DHeap<T>
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static private bool MaxCompare(int r)
        {
            return r < 0;
        }

        /// <summary>
        /// Initializes a new instance that is empty
        /// and has the specified initial capacity.
        public MaxDHeap(int ary, int capacity, IComparer<T> comparer = null)
            : base(ary, MaxCompare, capacity, comparer)
        {
        }

        /// <summary>
        /// Initializes a new instance that is empty
        /// and has the default initial capacity.
        /// </summary>
        public MaxDHeap(int ary, IComparer<T> comparer = null)
            : this(ary, 0, comparer)
        {
        }

        /// <summary>
        /// Initializes a new instance that contains
        /// elements copied from the specified collection.
        /// </summary>
        public MaxDHeap(int ary, IEnumerable<T> collection, IComparer<T> comparer = null)
            : base(ary, MaxCompare, collection, comparer)
        {
        }
    }

I marked the result comparison methods (MinCompare and MaxCompare) with an attribute that tells the compiler to inline them aggressively. Considering that comparison is what determines how fast the heap runs, it pays to make those comparisons as inexpensive as possible, within certain limits. Adding the attribute provides a measurable performance gain at very little cost in code size or complexity.

Clients that want to create a binary Min-heap of integers can write:

    var myHeap = new MinDHeap(2);

To create a trinary Max-heap of some custom type, using a custom comparison function:

    var myHeap = new MaxDHeap(3, 1000, new CustomComparer());

That assumes, of course, that somewhere there is a class called CustomComparer that implements IComparable<MyType>.

Next time I’ll complete implementation of the DHeap<T> class and give you a link to where you can download the code and some examples.

Fun with text encodings

Text encoding is a programmer’s nightmare. Life was much simpler when I didn’t have to know anything other than the ASCII character set. Even having to deal with extended ASCII–characters with values between 128 and 255–wasn’t too bad. But when we started having to work with international character sets, multibyte character sets, and the Unicode variants, things got messy quick. And they’re going to remain messy for a long time.

When I wrote a Web crawler to gather information about media files on the Web, I spent a lot of time making it read the metadata (ID3 information) from MP3 music files. Text fields in ID3 Version 2 are marked with an encoding byte that says which character encoding is used. The recognized encodings are ISO-8859-1 (an 8-bit character set for Western European languages, including English), two 16-bit Unicode encodings, and UTF-8. Unfortunately, many tag editors would write the data in the computer’s default 8-bit character set (Cyrillic, for example) and mark the fields as ISO-8859-1.

That’s not a problem if the resulting MP3 file is always read on that one computer. But then people started sharing files and uploading them to the Web, and the world fell apart. Were I to download a file that somebody in Russia had added tags to, I would find the tags unreadable because I’d be trying to interpret his Cyrillic character set as ISO-8859-1. The result is commonly referred to as mojibake.

The Cyrillic-to-English problem isn’t so bad, by the way, but when the file was originally written with, say, ShiftJIS, it’s disastrous.

My Web crawler would grab the metadata, interpret it as ISO-8859-1, and save it to our database in UTF-8. We noticed early on that some of the data was garbled, but we didn’t know quite what the problem was or how widespread it was. Because we were a startup with too many things to do and not enough people to do them, we just let it go at the time, figuring we’d get back to it.

When we did get back to it, we discovered that we had two problems. First, we had to figure out how to stop getting mangled data. Second, we had to figure a way to un-mangle the millions of records we’d already collected.

To correct the first problem, we built trigram models for a large number of languages and text encodings. Whenever the crawler ran across a field that was marked as containing ISO-8859-1, it would run the raw uncoded bytes through the language model to determine the likely encoding. The crawler then used that encoding to interpret the text. That turned out to be incredibly effective, and almost eliminated the problem of adding new mangled records.

Fixing the existing mangled records turned out to be a more difficult proposition. The conversion from mangled ISO-8859-1 to UTF-8 resulted in lost data in some circumstances, and we couldn’t do anything about that. In other cases the conversion resulted in weird accented characters intermixed with what looked like normal text. It was hard to tell for sure sometimes because none of us knew Korean or Chinese or Greek or whatever other language the original text was written in. Un-mangling the text turned out to be a difficult problem that we never fully solved in the general case. We played with two potential solutions.

The first step was the same for both solutions: we ran text through the trigram model to determine if it was likely mangled, and what the original language probably was.

For the first solution attempt, we’d then step through the text character-by-character, using the language model to tell us the likelihood of the trigrams that would appear at that position and compare it against the trigram that did appear. If we ran across a trigram that was very unlikely or totally unknown (for example, the trigram “zqp” is not likely to occur in English), we’d replace the offending trigram with one of the highly likely trigrams. This required a bit of backtracking and it would often generate some rather strange results. The solution worked, after a fashion, but not well enough.

For the second attempt we selected several managled records for every language we could identify and then re-downloaded the metadata. By comparing the mangled text with the newly downloaded and presumably unmangled text, we created a table of substitutions. So, for example, we might have determined that “zqp” in mangled English text should be replaced by the word “foo.” (Not that we had very much mangled English text.) We would then go through the database, identify all the mangled records, and do the substitutions as required.

That approach was much more promising, but it didn’t catch everything and we couldn’t recover data that was lost in the original translation. Ultimately, we decided that it wasn’t a good enough solution to go through the effort of applying it to the millions of mangled records.

Our final solution was extremely low-tech. We generated a list of the likely mangled records and created a custom downloader that went out and grabbed those records again. It took a lot longer, but the result was about as good as we were likely to get.