Good riddance to rubbish writing

In Gen Z never learned cursive. The effects of this are more widespread than you think (which references an October 2022 Atlantic article, Gen Z Never Learned To Read Cursive), the author describes the potential effects of discontinuing cursive writing instruction. In truth, the first article mentioned just expands a bit on one or two of the points made in the Atlantic article.

A note before I continue. Cursive is actually any form of connected writing. What most Americans refer to as “cursive” is the Palmer Method script that was introduced in the early 20th century, that most of us learned to fear and loathe, and many of us (myself included) can’t write legibly today. In the text below, I use “cursive” to mean that Palmer Method and its variants.

The Palmer Method, by the way, is a refinement of the Spencerian Method of writing that was introduced in the 1840s. Both are teaching a method of writing that’s optimized for 17th century pens. A primary consideration is keeping the pen on the paper because every time the pen was lifted and then placed back onto the paper, the ink tended to blot. These methods go to sometimes absurd lengths to keep the pen on the paper, even when doing so is less than optimum. The introduction of the inexpensive mass-produced ball point pen eliminated the ink blotting problem and thus the necessity to always keep the pen on the paper. But the script lives on.

The big complaint about no longer teaching cursive is that “the past is presented to us indirectly.” That is, because historical documents are written in script that students aren’t taught to read and write, they have to depend on transcription if they want to read the documents. There’s also some complaint that kids won’t be able to read letters from grandma.

And it’s true: without instruction and practice, cursive writing is difficult or impossible to read. Fortunately, it takes about a week to learn how to read it. Maybe a bit of practice after that, but if a child knows how to read block printing, learning to read the cursive script that’s been the mainstay of elementary writing instruction for a century is very easy.

In addition, the cursive script that’s been taught since the early 1900s isn’t even the same script that was used in our founding documents. The Declaration of Independence, perhaps the most oft-cited historical document cited in this argument, was written in Round hand, a script that originated in England in the 17th century. Learning to read modern cursive writing certainly helps in reading that document, but it still requires a bit more puzzling out.

Point is, the important historical documents all have been rendered in a typewritten font. There’s little or nothing to be gained for most people in reading the originals. There’s the incredibly weak argument that scholars will have to be taught cursive like they’re taught Elizabethan, Medieval, or ancient Cuneiform script. My response is, “duh.” Digging deeply into the past requires specialized skills. The ability to read cursive today is just a little bit more important than knowing how to hitch a horse to a buggy.

The “past is presented indirectly” argument for learning cursive writing just doesn’t fly. It’s as relevant to day-to-day life as the idiotic argument that those who can’t drive a car with manual transmission are somehow not “real” drivers. Ranks right up there with the, “when I was a boy, we had to trudge through three feet of snow, in the dark, uphill both ways to use the bathroom” stories we laugh about.

To be fair, there are benefits of learning cursive. First, it’s much faster than block printing, and there is that ability to read letters from grandma. It helps in developing fine motor skills, and studies show that students with dyslexia find learning cursive helps with the decoding process. However, people with dysgraphia are hindered by being forced to write in cursive.

Although I haven’t had to depend on my ability to write in cursive for at least 40 years, I do agree that many people must be able to write legibly and more quickly than they can with just block printing. But we should be teaching kids a different method. There are other connected writing systems that are easier to learn, faster, and easier to write legibly than the antiquated loopy Palmer Method that was designed 100 years ago for writing with a quill pen: a technology that was obsolete before the Palmer Method was introduced. (The ball point pen was invented in the 1880s.) For example, Getty-Dubay Italic, introduced in 1976, is much easier to read and write, and doesn’t require the loops and other forms that were necessary in older scripts to prevent the ink from blotting. There are other, similar, writing styles that are optimized for today’s writing instruments.

Whether those newer methods carry with them the benefits of developing fine motor skills and assistance to dyslexic students is an open question. I think it likely. I also suspect that it would have less of a negative impact on those with dysgraphia because the letter forms are so similar to the printed letter forms. As for letters from grandma, that’s becoming irrelevant, too. At 62, I’m older than a lot of grandmas and I’ll tell you from experience that a lot of them write cursive as well and as often as I do: illegibly and almost never. Letters from grandma that are written in cursive will pretty much cease to exist in my lifetime.

It’s true that trying to write those modern scripts using a 17th century quill pen would be as disastrous as taking a horse and buggy on an Interstate highway. Sometimes you have to discard old things and embrace the unquestionable benefits of new technology. Unless you like having to check for black widow spiders after trudging through the snow to the outhouse.

Posted in Uncategorized | Tagged , , | Comments Off on Good riddance to rubbish writing

Translation difficulties

I get it: translation is hard. Heck, I’m a reasonably bright native English speaker and often have difficulty translating my own thoughts into understandable English.

This is a message that was posted in a woodcarving group:

“Hello, I am writing a message to help my father. And I see myself. Only on the American or Canadian woodcarving site and no response. It’s just for the books. And politeness. It’s when it’s repetitive that it’s not funny. But you how many millions to be connected. I find that very embarrassing. Administrators must take their jobs seriously. I have already reported them, I pass the imfermire contest as if I was going to sew up a person at any time, have a nice day everyone.”

The author’s native language is, I think, Italian. Or perhaps French. I suspect not an English speaker, although it’s possible that his grasp of English is better than my grasp of his native language. I cannot tell if the message is the result of automatic translation, or if the author did the translation himself with the help of a Italian-to-English dictionary. Either way, I cannot make any sense of it.

Which is weird. I’ve seen bad translations before. But usually I can get the gist of a message that’s been automatically translated: a “hook” that gives me a broad idea, and from there puzzle out a few details. For example, the word “imfermire” in the above text looked promising. It looks like a misspelling of the Italian word “infermiere,” meaning “nurse.” The best I can guess is that the author is having trouble getting some woodcarving books for his father. Not sure where the nurse comes in.

The author’s responses to comments provide no useful information. Which isn’t too much of a surprise. I imagine he has to translate the question, then write and translate a response. The combined errors inherent in that process aren’t conducive to understanding. Automatic translation software is especially bad at round-tripping because errors accumulate very quickly.

Can the technology that powers the new crop of generative AIs be put to good use in the automatic translation space? I imagine feeding an Italian-to-English translation to a tool that can leverage its knowledge of translation errors and spit out a short and meaningful summary. Is such a tool within our grasp?

Posted in AI | Tagged , | Comments Off on Translation difficulties

Ask an AI: whittling or wood carving?

A friend of mine asked a generative AI (Nova: Powered by ChatGPT & GPT 3.5 Turbo) about the difference between whittling and wood carving. This is the response she got:

Wood carving and whittling have been practised for thousands of years, dating back to ancient civilizations such as the Egyptians and Greeks. The early humans used wood carving to make various objects like weapons, tools, utensils, and figurines.

It is believed that the practice of carving wood began as an extension of the use of stone tools, as people realized that wood was a more versatile material than stone. Over time, wood carving became a widely practiced art form, with artisans developing diverse styles and techniques.

Whittling, on the other hand, is a type of wood carving that involves using a knife to remove a small amount of wood from a larger piece. It is believed to have originated in Europe during the Middle Ages, where it was primarily used to make simple household objects like wooden spoons, bowls, and cups.

Whittling gained popularity in the United States during the 18th and 19th centuries, when it became a popular hobby for colonists and pioneers. Today, wood carving and whittling remain popular artistic expressions, and practitioners continue to develop new styles, techniques, and forms.

This reminds me of something a not-very-bright fourth grader would turn in after consulting the encyclopedia and visiting the local branch library. Bad paraphrasing of randomly selected factual (or maybe not) statements strung together into paragraphs that are tenuously cohesive and sometimes self-contradictory. With a little fluff thrown in to show that it really was the student’s work. The student would receive a “C” for such work if the teacher were especially kind and impressed with the diligent research, good spelling (although there’s that curious use of “practised” rather than “practiced”), and penmanship.

That said, the article does answer the question: whittling is a type of wood carving. The rest of the article is mostly gibberish, sprinkled as it is with unsupported factual statements, uninformed speculation, and known falsehoods. But whittling really is a type of wood carving.

Exactly what constitutes whittling is an open question. Merriam-Webster defines “whittle” as a transitive verb:

1a. to pare or cut off chips from the surface of (wood) with a knife
1b. to shape or form by so paring or cutting
2.  to reduce, remove, or destroy gradually as if by cutting off bits with a knife

By that definition, whittling is wood carving done with a knife. If you are carving wood with a knife, you are whittling. According to the dictionary. But that definition is not universally accepted. If you ask five wood carvers the difference, you’re going to get at least five answers. In my experience, most of those answers are of the “I know it when I see it” variety. Some say that it has to do with the level of planning involved. But everybody’s line is set differently. To some, anything more complex than a sharpened stick is “carving.” To others, anything carved from a stick found on the ground is “whittling.” Some put a time limit on it. Others base their judgement on the quality or purpose of the final product. My primitive carved knives and forks might be “whittling,” for example, but my friend’s beautifully carved and decorated (all using just a knife) replica dagger is a “carving.”

I like the dictionary definition. All the other definitions implicitly and sometimes not so implicitly make value judgements that amount to “whittling is just passing time, whereas carving is creating something of value.”

In any case, I’d be interested to know if anybody would find the AI-generated response to be anything other than gibberish. Elementary and secondary educators should be exposing students to this type of answer and pointing out the obvious flaws (unsupported and contradictory statements, wandering paragraphs, etc.) so that students can learn to spot them. It’ll be a while (decades, at least) before these generative AIs can write a freshman term paper that would get past an instructor who’s paying attention. It’s probably a good idea to be able to spot AI-generated content so you don’t make the mistake of depending on it.

Posted in Software development, Wood carving | Tagged , , , | Comments Off on Ask an AI: whittling or wood carving?

A fun little puzzle

I ran across this puzzle about 10 years ago. Although it didn’t take me long to come up with the optimum solution, I find the puzzle interesting because it just might be a good interview question for programmers at different levels of experience.

The puzzle:

Given an unsorted array of distinct integers, arrange them such that a1 < a2 > a3 < a4 > a5 < a6 ... So, for example, given the array [9,6,3,1,4,8], one valid arrangement would be [6,9,1,8,3,4]

The obvious solution is to sort the numbers and then interleave them. Put the smallest number at a1, the largest at a2, second smallest at a3, etc. Sorted, our sample array is [1,3,4,6,8,9]. If we interleave the numbers as I describe, the result is [1,9,3,8,4,6]. It’s easy enough to check if that’s right: just replace the commas with alternating < and >: [1<9>3<8>4<6].

Any programmer fresh out of school should be able to come up with that solution and code it, and tell me that the complexity is O(n log n) because of the sort. If a junior programmer supplied that solution to the problem, I’d be satisfied.

But there is an O(n) solution that I’d expect an intermediate or senior engineer to discover if prompted. That is, if they told me the solution was to sort and interleave, I’d ask them if they could think of a solution with lower expected running time.

Given any three successive numbers in the array, there are four possible relationships:

  1. a < b < c
  2. a < b > c
  3. a > b < c
  4. a > b > c

The relationship we desire is (2).

In case (1), the first condition is already met (a < b), and we can swap b and c to give us a < c > b.

In case (2) both conditions are met so we don’t have to do anything.

In case (3), we swap a and b to give us b < a ? c. But we already know that b < c, so if we have to swap a and c to meet the second condition, the first condition is still met.

In case (4), we know that a > c, so if we swap a and b to meet the first condition, the second condition is still met.

Now, add a fourth number to the sequence. You have a < b > c ? d. If it turns out that c < d then there’s nothing to do. If c > d then we have to swap them. But that doesn’t mess up the previous condition because if b > c > d then by definition b > d.

You use similar logic to add the fifth number. You have a < b > c < d ? e. If d > e then there’s nothing to do. If d < e then by definition c < e, so swapping d and e doesn’t invalidate anything that comes before.

That understanding leads to this pseudocode that makes a single pass through the array, swapping adjacent values as required:

for i = 0 to n-2
    if i is even
        if (a[i] > a[i+1])
            swap(a[i], a[i+1])
        end if
    else // i is odd
        if (a[i] < a[i+1])
            swap(a[i], a[i+1])
    end

I find that very elegant.

Posted in Software development | Tagged | Comments Off on A fun little puzzle

Birthdays, random numbers, and hash keys

You’re probably familiar with the Birthday Paradox, either from your introductory statistics class or from hanging around computer programmers for too long. Briefly, the math shows that the odds of any two people in a group having the same month and day of birth are quite a bit higher than you would intuitively expect. How high? Given a group of just 23 people, the odds are better than 50% that two of those people will share the same birthday. With 50 people in a group, the odds of two sharing the same birthday is above 97%, and by the time you have 60 people, it’s nearly a sure thing.

Testing the birthday paradox

This falls into the realm of what I call, “Math that just don’t feel right.” Nevertheless, it’s true. We can demonstrate it with a simple program that selects random numbers in the range of 0 to 364 until it repeats itself.

class Program
{
    const int NumKeys = 365;
    const int Thousand = 1000;
    const int Million = Thousand * Thousand;
    const int NumTries = 10 * Million;
    static void Main(string[] args)
    {
        // List contains number of items to collision for each iteration.
        List<int> Totals = new List<int>();
        // Do the test NumTries times, saving the result from each attempt.
        for (int i = 0; i < NumTries; ++i)
        {
            int n = DoTest(NumKeys);
            Totals.Add(n);
            if ((i % 100000) == 0)
            {
                Console.WriteLine("{0,5:N0} - {1:N0}", i, n);
            }
        }
        // Group the items by their result.
        var grouped = from cnt in Totals
                      group cnt by cnt into newGroup
                      orderby newGroup.Key
                      select newGroup;
        // Output a table of results and counts, with running percentage.
        int runningTotal = 0;
        foreach (var grp in grouped)
        {
            runningTotal += grp.Count();
            Console.WriteLine("{0,3:N0}: {1,7:N0} ({2,8:P2})  ({3,8:P6})", grp.Key, grp.Count(),
                (double)grp.Count() / NumTries, (double)runningTotal / NumTries);
        }
        Console.WriteLine("Min items to collision = {0:N0}", Totals.Min());
        Console.WriteLine("Max items to collision = {0:N0}", Totals.Max());
        Console.WriteLine("Average items to collision = {0:N0}", Totals.Average());
        Console.ReadLine();
    }
    // Random number generator used for all attempts.
    static Random rnd = new Random();
    static int DoTest(int nkey)
    {
        HashSet<int> keys = new HashSet<int>();
        // Pick random numbers in the range [0..nkey-1]
        // until a duplicate is selected.
        int key;
        do
        {
            key = rnd.Next(nkey);
        } while (keys.Add(key));
        return keys.Count;
    }
}

The program isn’t as complicated as it looks. I structured it so that you can run the test many times and then aggregate the results. In this particular case, I run 10 million iterations.

The result of each iteration (the count of random numbers selected before a duplicate was found) is saved in the Totals list. When all iterations are done, the code groups the results so that we know how many times each result occurred. Finally, the grouped results are output with a running percentage. Here’s the output from a sample run:

  1:  27,322 (  0.27 %)  (0.273220 %)
  2:  54,962 (  0.55 %)  (0.822840 %)
  3:  81,735 (  0.82 %)  (1.640190 %)
  4: 107,442 (  1.07 %)  (2.714610 %)
  5: 132,999 (  1.33 %)  (4.044600 %)
  6: 157,841 (  1.58 %)  (5.623010 %)
  7: 181,920 (  1.82 %)  (7.442210 %)
  8: 203,602 (  2.04 %)  (9.478230 %)
  9: 223,787 (  2.24 %)  (11.716100 %)
 10: 241,462 (  2.41 %)  (14.130720 %)
 11: 258,365 (  2.58 %)  (16.714370 %)
 12: 274,581 (  2.75 %)  (19.460180 %)
 13: 286,809 (  2.87 %)  (22.328270 %)
 14: 298,008 (  2.98 %)  (25.308350 %)
 15: 306,072 (  3.06 %)  (28.369070 %)
 16: 314,766 (  3.15 %)  (31.516730 %)
 17: 318,592 (  3.19 %)  (34.702650 %)
 18: 323,361 (  3.23 %)  (37.936260 %)
 19: 323,649 (  3.24 %)  (41.172750 %)
 20: 322,924 (  3.23 %)  (44.401990 %)
 21: 319,847 (  3.20 %)  (47.600460 %)
 22: 316,430 (  3.16 %)  (50.764760 %)
 23: 310,096 (  3.10 %)  (53.865720 %)
 24: 303,122 (  3.03 %)  (56.896940 %)
 25: 295,341 (  2.95 %)  (59.850350 %)
 26: 285,219 (  2.85 %)  (62.702540 %)
 27: 276,433 (  2.76 %)  (65.466870 %)
 28: 265,044 (  2.65 %)  (68.117310 %)
 29: 253,646 (  2.54 %)  (70.653770 %)
 30: 241,188 (  2.41 %)  (73.065650 %)
 31: 228,559 (  2.29 %)  (75.351240 %)
 32: 216,584 (  2.17 %)  (77.517080 %)
 33: 203,608 (  2.04 %)  (79.553160 %)
 34: 190,222 (  1.90 %)  (81.455380 %)
 35: 177,870 (  1.78 %)  (83.234080 %)
 36: 165,091 (  1.65 %)  (84.884990 %)
 37: 153,658 (  1.54 %)  (86.421570 %)
 38: 141,501 (  1.42 %)  (87.836580 %)
 39: 129,984 (  1.30 %)  (89.136420 %)
 40: 118,849 (  1.19 %)  (90.324910 %)
 41: 108,598 (  1.09 %)  (91.410890 %)
 42:  98,732 (  0.99 %)  (92.398210 %)
 43:  89,560 (  0.90 %)  (93.293810 %)
 44:  80,692 (  0.81 %)  (94.100730 %)
 45:  72,640 (  0.73 %)  (94.827130 %)
 46:  65,559 (  0.66 %)  (95.482720 %)
 47:  58,525 (  0.59 %)  (96.067970 %)
 48:  51,877 (  0.52 %)  (96.586740 %)
 49:  46,329 (  0.46 %)  (97.050030 %)
 50:  40,333 (  0.40 %)  (97.453360 %)
 51:  35,917 (  0.36 %)  (97.812530 %)
 52:  31,270 (  0.31 %)  (98.125230 %)
 53:  27,409 (  0.27 %)  (98.399320 %)
 54:  23,349 (  0.23 %)  (98.632810 %)
 55:  20,482 (  0.20 %)  (98.837630 %)
 56:  17,441 (  0.17 %)  (99.012040 %)
 57:  15,390 (  0.15 %)  (99.165940 %)
 58:  13,378 (  0.13 %)  (99.299720 %)
 59:  11,302 (  0.11 %)  (99.412740 %)
 60:   9,638 (  0.10 %)  (99.509120 %)
 61:   8,199 (  0.08 %)  (99.591110 %)
 62:   6,851 (  0.07 %)  (99.659620 %)
 63:   5,876 (  0.06 %)  (99.718380 %)
 64:   4,843 (  0.05 %)  (99.766810 %)
 65:   4,077 (  0.04 %)  (99.807580 %)
 66:   3,433 (  0.03 %)  (99.841910 %)
 67:   3,098 (  0.03 %)  (99.872890 %)
 68:   2,388 (  0.02 %)  (99.896770 %)
 69:   1,903 (  0.02 %)  (99.915800 %)
 70:   1,541 (  0.02 %)  (99.931210 %)
 71:   1,308 (  0.01 %)  (99.944290 %)
 72:   1,210 (  0.01 %)  (99.956390 %)
 73:     893 (  0.01 %)  (99.965320 %)
 74:     666 (  0.01 %)  (99.971980 %)
 75:     588 (  0.01 %)  (99.977860 %)
 76:     463 (  0.00 %)  (99.982490 %)
 77:     369 (  0.00 %)  (99.986180 %)
 78:     290 (  0.00 %)  (99.989080 %)
 79:     235 (  0.00 %)  (99.991430 %)
 80:     185 (  0.00 %)  (99.993280 %)
 81:     161 (  0.00 %)  (99.994890 %)
 82:     102 (  0.00 %)  (99.995910 %)
 83:     103 (  0.00 %)  (99.996940 %)
 84:      59 (  0.00 %)  (99.997530 %)
 85:      55 (  0.00 %)  (99.998080 %)
 86:      48 (  0.00 %)  (99.998560 %)
 87:      40 (  0.00 %)  (99.998960 %)
 88:      29 (  0.00 %)  (99.999250 %)
 89:      16 (  0.00 %)  (99.999410 %)
 90:      15 (  0.00 %)  (99.999560 %)
 91:      11 (  0.00 %)  (99.999670 %)
 92:       8 (  0.00 %)  (99.999750 %)
 93:       4 (  0.00 %)  (99.999790 %)
 94:       3 (  0.00 %)  (99.999820 %)
 95:       7 (  0.00 %)  (99.999890 %)
 96:       3 (  0.00 %)  (99.999920 %)
 97:       2 (  0.00 %)  (99.999940 %)
 98:       1 (  0.00 %)  (99.999950 %)
 99:       1 (  0.00 %)  (99.999960 %)
100:       1 (  0.00 %)  (99.999970 %)
101:       2 (  0.00 %)  (99.999990 %)
102:       1 (  0.00 %)  (100.000000 %)
Min items to collision = 1
Max items to collision = 102
Average items to collision = 24

This result agrees with the math. It shows some other interesting things, too.

Look at the how many times the second number selected was a duplicate of the first: 0.27%. And in more than one percent of the iterations, the fifth number selected was a duplicate! All told, the odds of finding a duplicate is 11% with just 10 numbers selected.

You can argue that birthdays aren’t exactly random, and you’d be right. Some days have more births than others. Although those real world differences are significant, they only affect the “birthday paradox” by a few points. Maybe the break even point is really 25 rather than 23.

You can also argue that the numbers returned by System.Random aren’t truly random. Again, you’d be right. You could code up a better random number generator that uses the cryptographic services to generate numbers that are more nearly random, but you’ll find that it won’t have much of an effect on the results.

What does that have to do with hash keys?

Let’s say you’re writing a program that has to keep track of a few million strings and you want a quick way to index them and eliminate duplicates. A fast and easy solution is to use hash codes as keys, right? After all, there are four billion different 32-bit hash codes, and you’re just going to use a few million.

You can probably see where this is going. What’s the likelihood that you’ll run into a duplicate? That’s right, the Birthday Paradox raises its ugly head here. Let’s look at the math and see what it tells us.

What is the probability of getting a collision when selecting one million items out of a pool of four billion (actually, 2^32, or 4,294,967,296)?

According to the Wikipedia article a coarse approximation, which is good enough for our purposes, is:

p = 1 – e^(-n^2/(2 * 2^32))

We can wrap that up in a neat little method:

static double DuplicateProb(long fieldSize, long nItems)
{
    double minusNSquared = -(nItems * nItems);
    return 1 - Math.Exp(minusNSquared / (2 * fieldSize));
}

A few tests will tell us if it’s working:

DuplicateProb(365, 10) = 0.128017828988458
DuplicateProb(365, 20) = 0.421863457482716
DuplicateProb(365, 23) = 0.515509538061517
DuplicateProb(365, 30) = 0.708547055710068
DuplicateProb(365, 50) = 0.967439570087906
DuplicateProb(365, 57) = 0.988329429309313
DuplicateProb(365, 100) = 0.999998876014983

Those numbers don’t agree exactly with the calculated probabilities in the Wikipedia article or with my experimental results above, but they’re close–within the range of what you’d expect for a coarse approximation.

So, then, about our one million items selected from a field of four billion:

DuplicateProb(4294967296, 1000000) = 1

What?!?!

How can that be possible? The probability of getting a duplicate within the first million items is so close to certainty that the computer calls it 1.0? This has to be a mistake, right?

It happens to be true. Unfortunately, it’s a bit difficult to prove with the birthday test program because Random.Next() only has a range of 0..2^31. I could modify the program to extend its range, but that’s really not necessary. Instead, I’ve run the program for increasing powers of two and present the results here:

Field SizeMinMaxAverage
2^1641,081316
2^20214,3901,281
2^30342142,95241,498
2^311,036193,52957,606

These are the results of 10,000 runs each. Doing 10 million runs of the 2^16 took quite a while, and the numbers weren’t much different. The only number that changed significantly was the Max value: the maximum number of items selected before a duplicate was found.

The table is more interesting when you replace the Min, Max, and Average values with their corresponding powers of two:

Field SizeMinMaxAverage
16410.088.30
204.3912.1010.32
308.4217.1315.34
3110.0217.5615.81

For every field size, the average number of items selected before a collision is approximately one half the power of two. Or, just a little more than the square root. That holds true for the birthday problem, too. The square root of 365 is about 19.1, and the 50% probability point is 23.

A really quick way to compute your 50% probability point would be to take the square root. That’s a bit aggressive, but it’ll put you in the ballpark.

More interesting is the Max value in the table. If you think in powers of two, then the Max value is approximately two bits more than the 50% value. That is, 2^30 is a billion items. The square root is 2^15, or 32,768. Two more bits is 2^17, or 128K (131,072). The Max value for 2^30, above, is 142,952: close enough to 2^17 for me.

Given the above, what do you think the numbers for 2^32 would be? The average will be around 83,000, and the Max will be close to 300,000.

That’s right, by selecting only 300,000 items from a field of four billion, you’re almost guaranteed to find a duplicate. You have a 1% probability of collision with just 6,000 items, and a 10% probability of collision with 21,000 items selected.

It’s true that hash codes are not purely random, but a good hash function will give you very close to a uniform distribution. The problem is worse if the hash codes aren’t evenly distributed.

In case you haven’t got the point yet, don’t try to use a hash code as a unique identifier. The likelihood of getting a duplicate key is just too high.

I know what you’re thinking: what about making a 64 bit hash code? Before you get too excited about the idea, take a look at the Probability Table from that Wikipedia article. With a 64-bit hash code, you have one chance in a million of getting a duplicate when you’re hashing six million items. For one million items, it’d be about one chance in 10 million. That sounds pretty unlikely, but it really isn’t when you consider how often some programs run.

Granted, if you were only indexing a million items, you’d probably use something like a hash table that can resolve collisions. It’s when you get into very large numbers of items that you begin worrying about the overhead imposed by your indexing method. Using a hash code sounds like a good option, until you discover the likelihood of generating a duplicate. Even with a 64-bit hash code, you have a 1% chance of generating a duplicate with only 610 million items.

Hash codes are great for quickly determining if two items can’t possibly match. That is, if two strings hash to different values, then they positively are not the same. However, hash codes are not a good solution for uniquely identifying items, because the likelihood of two different items hashing to the same value is surprisingly high.

Posted in Software development | Tagged | Comments Off on Birthdays, random numbers, and hash keys

Subtraction is not the same as comparison

(Originally posted in somewhat different form on 11/21/2016.)

All too often, I run across integer comparison functions that work, but have a latent bug. It’s not a bug that you’ll run into very often but when you do run into it, it’s potentially catastrophic.

A common idiom for comparison is to have a function that, given two integers x and y will return a negative number if x < y, a positive number if x > y, and 0 if x == y. For example:

  1. Compare(1, 2) returns -1 because 1 < 2
  2. Compare(1, 1) returns 0 because 1 == 1
  3. Compare(1, 0) returns 1 because 1 > 0

The correct way to implement such a function is with cascading comparisons, like this:

static int WorkingCompare(int x, int y)
{
if (x > y) return 1;
if (x < y) return -1;
return 0;
}

A clever programmer might figure out that you can get the same result by subtracting y from x. After all, (1-2) == -1, (1-1) == 0, and (1-0) == 1. So the comparison function becomes:

static int BrokenCompare(int x, int y)
{
return x - y;
}

The aforementioned clever programmer will probably test it with a few mixtures of positive and negative numbers, get the right answers, and drop it into his program. Why not? After all, it’s less code and a single subtraction is going to be a whole lot faster than a couple of comparisons and branches. Right?

Except that it’s broken. It doesn’t always work. Imagine if x = Int.MinValue and y = 1. Or x = int.MaxValue-100 and y = -1000. Integer overflow rears its ugly head and you get the wrong answer! Don’t believe me? Here’s a program to test it.

static void Main(string[] args)
{
    DoCompare(1, 2);
    DoCompare(-1, -2);
    DoCompare(int.MinValue, 4);
    DoCompare(int.MaxValue-100, -1000);
}

static void DoCompare(int x, int y) 
{
    Console.WriteLine("Compare {0} and {1}.", x, y);
    Console.WriteLine("  Working comparison: {0}", WorkingCompare(x, y));
    Console.WriteLine("  Broken comparison: {0}", BrokenCompare(x, y));
}

And the output:

Compare 1 and 2.
  Working comparison: -1
  Broken comparison: -1
Compare -1 and -2.
  Working comparison: 1
  Broken comparison: 1
Compare -2147483648 and 4.
  Working comparison: -1
  Broken comparison: 2147483644  // Incorrect: -2147483648 is not greater than 4.
Compare 2147483547 and -1000.
  Working comparison: 1
  Broken comparison: -2147482749  // Incorrect: 2147483547 is not less than -1000

If x is sufficiently negative and y is sufficiently positive, or x is sufficiently positive and y is sufficiently negative, integer overflow will result in an incorrect result. It’s almost a certainty that the clever programmer who wrote that broken comparison function never even considered integer overflow.

“But,” you say, “I know my program won’t be dealing with numbers that big (or small).” Yeah, sure. Not now. But it might in the future. Worse, that comparison function is probably in some class that three years from now will be lifted out of your program into another that doesn’t have the same limits on the input data. Or some other programmer is looking for an example of how to write a custom comparer for his class, sees yours, and copies it to his program. Everything seems fine. Until, a few weeks or months or years later the program runs into this bug and something breaks. Horribly. Perhaps it’s the custom comparer passed to a sorting function and things aren’t sorted correctly. Or, perhaps worse, the sort doesn’t terminate because a > b and b > c, but for some odd reason a < c. I’ve seen something similar happen. Let me tell you that tracking down that bug wasn’t fun.

Part of our job as programmers is to write code that’s bulletproof whenever practical. And I’m not talking bulletproof just now, but also in the future. We can’t foresee all possible scenarios, but code re-use is a very common thing. Sometimes that means that we forego optimization opportunities because the risk of implementing them is too high.

Not that I’ve seen many programs in which the small difference in the speed of an integer comparison function makes a big enough difference that I’d risk implementing broken code. If I ever did find myself in that situation (and there would have to be a very compelling reason), I would certainly put some big scary comment in there, explaining the limitations and warning others not to use the function for general purposes.

Posted in Software development | Tagged | Comments Off on Subtraction is not the same as comparison

Computer programs are not cats

(Originally published February 1, 2013)

I see questions like this fairly regularly:

I need to write a program that checks a database every five minutes, and processes any new records. What I have right now is this:

void Main()
{
    while (true)
    {
        CheckDatabaseForUpdates();
        Thread.Sleep(300000); // 5 minutes, in milliseconds
    }}

Is this the best way, or should I be using threads?

I’ve said before that Thread.Sleep is a red flag indicating that something is not quite right. And when a call to Sleep is in a loop, it’s a near certainty that there is a real problem. Most Sleep loops can be re-coded to use a timer, which then frees the main thread to do other things.

In this example, the “other thing” the main thread could be doing is waiting for a prompt from the user to cancel the program. As it stands, the only way to close the program would be to abnormally terminate the process (Control+C, for example, or through Task Manager). That could be a very bad thing, because the program might be in the middle of updating the database when you terminate it. That’s a virtually guaranteed path to data corruption.

I’m not going to show how to refactor this little program to use a timer because doing so is fixing a solution that is fundamentally broken at a higher level.

How can this program be broken?

I’m so glad you asked.

The program is broken because it’s doing a poor job of implementing a feature that already exists in the operating system. The program is occupying memory doing nothing most of the time. It naps, wakes up periodically, looks around, maybe does some work, and then goes back to sleep. It’s like a cat that gets up to stretch every five minutes and maybe bats a string around for a bit before going back to sleep.

Operating systems already have a way to implement catnap programs. In the Linux world, the cron daemon does it. Windows has Task Scheduler and the schtasks command line utility. These tools let you schedule tasks to execute periodically. Using them has several advantages over rolling your own delay loop.

  • You can change or terminate the schedule without having to recompile the program. Yes, you could make the program read the delay time from a configuration file, but that just makes the fundamentally broken program a little more complicated.
  • Task schedules are remembered when the computer is restarted, freeing you from having to start it manually or figure out how to put it in the startup menu.
  • If a catnap program crashes, it must be restarted manually. A scheduled task, on the other hand, will log the error, and run the next time.
  • The program doesn’t consume system resources when it’s not doing useful work. It’s like a cat that’s not even there when it’s sleeping. (Which would make for a cat that you don’t see very often, and only for very brief periods.)

Computer programs aren’t cats. If a program isn’t doing active work or maintaining some in-memory state that’s required in order to respond to requests, then it shouldn’t be running. If all your program does is poll a resource on a regular schedule, then you’ve written a catnap that is better implemented with the OS-provided task scheduler.

Posted in Software development | Tagged | Comments Off on Computer programs are not cats

Bit twiddling for the win!

(I don’t recall if I wrote about this before. I really need to resurrect my old blog entries.)

This was an interesting puzzle to work on. Some of the programmers I subsequently explained it to just got that glazed look and figured I was invoking black magic. Or higher math.

Imagine you have a list of structures, each of which consists of a string key, two integer values, and some other stuff. In essence:

struct Foo
    string Name;
    int key1;
    int key2;
    // other stuff that's not relevant to this discussion

For this discussion, let’s assume that int is a 32-bit signed integer.

You have a lot of these Foo structures, more than can keep in memory, but for reasons we don’t need to get into you maintain an index that’s sorted by key1 descending, and key2 ascending. That is, in C#:

index = list.OrderByDescending(x -> x.key1)
            .ThenBy(x->x.key2);

Except that (again, for reasons we don’t need to discuss) the key has to be a single value rather than two separate values.

Packing two int values into a long is no problem. But doing it so that the result sorts with key1 descending and key2 ascending? When I first heard this proposed I wasn’t sure that it was possible. But there was a glimmer of an idea . . .

What I’m about to discuss depends on something called “two’s complement arithmetic,” which defines how most computers these days work with positive and negative integers. I’m going to illustrate with 4-bit numbers but the concepts are the same for any size integer value, including the 32-bit and 64-bit numbers we work with regularly.

With 4 bits we can express 16 possible integer values. Unsigned, that’s values 0 through 15. Signed, using the two’s complement convention, the values we can represent are -8 through +7. The table below shows that in detail.

Binary                       Binary
Value   Unsigned  Signed     Value   Unsigned  Signed
0000    0         0          1000    8         -8
0001    1         +1         1001    9         -7
0010    2         +2         1010    10        -6
0011    3         +3         1011    11        -5
0100    4         +4         1100    12        -4
0101    5         +5         1101    13        -3
0110    6         +6         1110    14        -2
0111    7         +7         1111    15        -1

When viewed as signed numbers, the high bit (the bit in the leftmost position) is called the sign bit. When it’s set to 1, the number is considered negative.

Now, the problem. We’re given two signed integers, call them key1 and key2, each of which is four bits in length. We want to combine them into a single 8-bit value so that a natural sort (i.e. sorting the single 8-bit quantity) will result in the records being ordered with key1 descending and key2 ascending.

The first thought is to just pack key1 into the high four bits, and key2 into the low four bits, and treat it as an unsigned quantity. For example, imagine you have these records:

                         Binary      Unsigned
key1 = 4, key2 = 1     0100 0001       65
key1 = 4, key2 = 7     0100 0111       71
key1 = 4, key2 = -8    0100 1000       72
key1 = 4, key2 = -1    0100 1111       79
key1 = -5, key2 = 2    1011 0010       178

Sorting as unsigned quantities, the ordering is wonky. Both key1 and key2 go from 0 to 7, then from -8 to -1.

We need to do some bit twiddling. If we flip the sign bit on key2 and then sort, the table becomes:

                         Binary      Unsigned
key1 = 4, key2 = -8    0100 0000       64
key1 = 4, key2 = -1    0100 0111       71
key1 = 4, key2 = 1     0100 1001       73
key1 = 4, key2 = 7     0100 1111       79
key1 = -5, key2 = 2    1011 1010       90

key2 is now in ascending order! That is, for records that have key1 = 4, the key2 values are in ascending order.

So how do we make key1 sort in descending order? More bit twiddling!

Consider what happens if you flip all of the bits in key1:

Decimal  Binary  Inverted  Result   Decimal  Binary Inverted  Result
  -8      1000    0111        7        0      0000    1111      -1
  -7      1001    0110        6        1      0001    1110       -2
  -6      1010    0101        5        2      0010    1101       -3
  -5      1011    0100        4        3      0011    1100       -4
  -4      1100    0011        3        4      0100    1011       -5
  -3      1101    0010        2        5      0101    1010       -6
  -2      1110    0001        1        6      0110    1001       -7
  -1      1111    0000        0        7      0111    1000       -8 

If we invert all of the bits in key1, then a natural sort orders the keys in descending order.

It’s not magic at all. Given an understanding of two’s complement arithmetic and a little study, there’s nothing to it. I won’t claim to be the first person to come up with this, but I did develop it independently.

The explanation above is for two 4-bit quantities combined into a single 8-bit quantity, but it works just as well with two 32-bit quantities combined into a 64-bit quantity.

Putting it all together, we come up with this C# function. It should be easy enough to port to any other language that allows bit twiddling:

public static long getKey(int key1, int key2)
{
    long key = (uint)~key1;  // invert all the bits of key1
    key <<= 32;

    // flip the sign bit of key2
    uint flipped = (uint)key2 ^ 0x80000000;
    key |= flipped;
    return key;
}

And of course you can write the reverse function to retrieve the two original keys from the combined key.

Posted in Software development | Tagged | Comments Off on Bit twiddling for the win!

It really isn’t that difficult

I’ve been contributing to Stack Overflow for 14 years: pretty much ever since it started. And every year there are Computer Science students who come up with novel ways of screwing up parsing and evaluating arithmetic expressions.

It’s a problem common to all Computer Science curricula: given a string that contains an arithmetic expression (something like 2*(3+1)/4), write a program that evaluates the expression and outputs the answer. There are multiple ways to solve the problem but the easiest as far as I’m concerned is called the Shunting yard algorithm, developed by Edsger Dijkstra. It’s an elegantly simple algorithm that’s very easy to understand and almost trivial to implement once you understand it.

Here’s the funny thing. Computer Science courses often teach postfix expression evaluation (i.e. evaluating “Reverse Polish Notation” like 2 3 1 + * 4 /) as an example of how to use the stack data structure. Then they teach the Shunting Yard and show that by employing two stacks you can turn an infix expression into a postfix expression. (For example, 2*(3+1)/4 becomes 2 3 1 + * 4 /). Then they give the students an assignment that says, “Write a program to evaluate an infix expression.”

The students dutifully go home and write a program that glues the two examples together. That is, the first part converts the infix to postfix, and the second part evaluates the postfix expression to get the answer. Which works. But is kind of silly because a few simple changes to the output method for Shunting Yard let you evaluate the expression directly–without converting to postfix. In my experience, very few students figure this out.

I actually had a job interview where they asked me to write an expression evaluator. I wrote it to evaluate directly from infix, without going through the postfix step. Those white board inquisitions are always weird. At one point as I was furiously scribbling on the white board, the interviewer asked me where my postfix output was? When I told him that I wouldn’t be producing any, he insisted that I must. I don’t argue with an interviewer (more on that in a separate entry) and usually take their advice, but here I told him to just sit tight; that I knew this was going to work.

The post-coding walkthrough was fun. Usually it’s a formality in which I prove to the interviewer that the code I wrote works as intended. The interviewer already knows whether it works or not, because he’s seen the same algorithm written dozens of different times, knows all the different ways it can be screwed up, and knows what to expect from the walkthrough. But this time the interviewer was following along intently as I explained to him how it all worked. A senior software developer at a major company had never before seen the Shunting yard implemented without that unnecessary conversion to postfix. Weird.

I wonder why this oddity exists. Do Computer Science instructors not tell their students that there’s a way to do the evaluation directly? Or do the instructors themselves not know? And then there’s the other question: why would legions of C.S. students every year come to Stack Overflow looking for the answer to a question that’s easily answered with a simple Google search?

Well, okay, I have an answer to the last one. Laziness. It’s easier to post a question asking, “What’s wrong with my code?” than it is to understand how what the code actually does differs from what it’s supposed to be doing.

Posted in Software development | Tagged | Comments Off on It really isn’t that difficult

But that’s the way we’ve always done it!

Some years back I did a lot of research on and experimentation with the binary heap data structure. During that time I noticed that publicly available implementations placed the root of the heap at array[1], even in languages where arrays are 0-based. I found that incredibly annoying and wondered why they did that.

As far as I’ve been able to determine, the binary heap data structure originated with Robert J. Floyd in his 1962 article, “TreeSort” (Algorithm 113) in the August 1962 edition of Communications of the ACM. As with all algorithms published in CACM at the time, it was expressed in Algol: a language with 1-based arrays. The calculation to find the left child of any parent is: childIndex = parentIndex * 2. The right child is (parentIndex * 2) + 1, and the parent of any node can be found by dividing the node index by 2.

When C gained popularity as a teaching language in the 1980s (for reasons that escape me, but I digress), authors of algorithms texts converted their code from Pascal (again, 1-based arrays) to C (0-based arrays). And instead of taking the time to modify the index calculations, whoever converted the code made a terrible decision: keep the 1-based indexing and allocate an extra element in the array.

It’s almost certain that the person who converted the code was smart enough to change the index calculations. I think the reason they didn’t was laziness: they’d have to change the code and likely also change the text to match the code. So they took the easy route. And made a mess.

Programmers in C-derived languages (C++, Java, C#, JavaScript, Python, etc.) learn early on to start counting at 0 because arrays start at 0. If you want an array that holds 100 items, you allocate it: int a[100];, and the indexes are from 0 through 99. This is ingrained in our brains very early on. And then comes this one data structure, the binary heap, in which, if you want to hold 100 items, you have to allocate 101! The element at index 0 isn’t used.

That’s confusing, especially to beginning programmers who are the primary consumers of those algorithms texts.

Some argue that placing the root at array[1] makes the child and parent calculations faster. And that’s likely true. In a 0-based heap, the left child node index is at (parentIndex * 2) + 1, and the parent index of any child is (childIndex - 1)/2. There’s an extra addition when calculating the left child index, and an extra subtraction when calculating the parent. The funny thing is that in Algol, Pascal, and other languages with 1-based arrays, those “extra” operations existed but were hidden. The compiler inserted them when converting the 1-based indexing of the language to the 0-based indexing of the computer.

But in the larger context of what a binary heap does, those two instructions will make almost no difference in the running time. I no longer have my notes from when I made the measurements myself, but as I recall the results were noisy. The difference was so small as to be insignificant. Certainly they weren’t large enough in my opinion to justify replacing the code. If my program’s performance is gated on the performance of my binary heap, then I’ll replace the binary heap with a different type of priority queue algorithm. A paring heap, perhaps, or a skip list. The few nanoseconds potentially saved by starting my heap at index 1 just isn’t worth making the change.

The C++ STL priority_queue, the Java PriorityQueue, the .NET PriorityQueue, and python’s heapq all use 0-based binary heaps. The people who wrote those packages understand performance considerations. If there was a significant performance gain to going with a 1-based binary heap, they would have done so. That they went with 0-based heaps should tell you that any performance gain from a 1-based heap is illusory.

I am appalled that well-known authors of introductory algorithm texts have universally made this error. They should be the first to insist on clear code that embraces the conventions of the languages in which they present their code. They should be ashamed of themselves for continuing to perpetrate this abomination.

Posted in Software development | Tagged | Comments Off on But that’s the way we’ve always done it!