Coffee scoop and hillbilly

A few new carvings recently.  Debra wanted a new coffee scoop, so I made her two of them.

The scoop is carved from basswood, and finished with something called “butcher block conditioner”–a mix of beeswax and food-grade mineral oil.  I figured there wasn’t much sense in using an exotic wood for the scoop, as it’s going to get blackened from the coffee.  The bowl holds about a tablespoon.  The size comparison object is one of the new Presidential dollar coins.

This little hillbilly (about 3½ inches tall) is carved from a maple scrap.  Very hard wood.  I think I’ll stick to softer woods for my caricatures in the future and save the harder woods for things like dolphins and other forms that have fewer fine details.

Potential ambiguity is a mess

The idea behind data format standards is to ensure that disparate tools can communicate using a common “language.”  When the standard includes potential ambiguity, things get messy.

A case in point is JSON (Javascript Object Notation).  The standard for escaping characters in strings looks like this:

In English, that says that the only characters that must be escaped are quote (“), backslash¹ (\), and control characters.  An escaped character starts with backslash and the following character(s) must be one of those identified.  That’s all good except for one item:  the “solidus” (/).  That character can be escaped, but it does not have to be.  So in a string, “/” and “\/” mean the same thing.

Why?  The potential ambiguity here is the exception to the rule.  Somebody implementing a JSON processor might easily interpret that to mean that the slash (I’m sorry, “solidus”) must be escaped.


¹”solidus”? Learn something new every day. It’s too bad that they didn’t come up with a unique name for “\”.  “Reverse solidus” is harder to say than “backslash.”

Walnut dolphin

Somebody gave me a piece of black walnut that had a dolphin outline drawn on it.  She didn’t want to work with the walnut, which is somewhat harder than most carvers are comfortable with.  I cut out the pattern at the beginning of December, but didn’t get around to carving on it until last week.

Although the carving is done, I still have a bit of work to do.  I need to make a final pass with fine sandpaper, apply the finish (probably an oil/wax mixture), and mount it on the permanent base.  That little disk of walnut is just temporary.   The glue joint is Elmer’s Glue (the white stuff), so it should come off pretty easily if I soak it in water for a few minutes.

I goofed a bit on the proportions here.  The poor dolphin looks like everything behind the dorsal fin has atrophied.  The pattern was fine–I just took too much wood off and didn’t maintain the curve of the back end.  Live and learn.  The next one will be better.

I have three dolphin cutouts cut from mahogany that should be beautiful when I get around to carving them.  The largest one is three inches thick and about eight inches long.  But I’m going to do a few more small practice pieces from this walnut, and maybe a slightly larger one from basswood or mesquite before I attack the mahogany. I want to make sure I have my proportions right so I don’t goof up the larger pieces.

HttpListenerContext and Url-encoded query parameters

The HttpListener class in .NET lets you create a lightweight HTTP server without having to go through all the rigmarole¹ of installing and managing IIS. It’s incredibly easy to get a simple HTTP server up and working with HttpListener

But when it comes to handling query parameters, things break in a very strange way. The server I’m working on currently accepts requests to search for strings in video titles. Last night I got a bug report saying that it didn’t work when searching for Kanji (Japanese) characters. David was looking for the string “尺八”², and got no results back even though he knew that there were matching titles in the database.

When a browser sends a query string to the server, it encodes the string using the UTF-8 character encoding. So David’s search for “尺八” resulted in this request to my server: “/?q=%E5%B0%BA%E5%85%AB”. Which is correct.

Then things get weird.

The HttpListenerContext.Request object contains all the information about the request that came to the server. If I look at the relevant properties, I see the following:

Request.RawUrl = "/?q=%E5%B0%BA%E5%85%AB"
Request.Query["q"] = "尺八"
Request.Url = {http://localhost:8080/?q=尺八}

The problem here is that the Request.Query property is apparently interpreting the encoded query string parameter using something other than UTF-8. And, looking at the code for HttpListenerRequest.QueryString (part of the .NET runtime library) confirms that:

public NameValueCollection QueryString
{
    get
    {
        NameValueCollection nvc = new NameValueCollection();
        Helpers.FillFromString(nvc, this.Url.Query, true, this.ContentEncoding);
        return nvc;
    }
}

The problem is the this.ContentEncoding, which says, “use the Request object’s encoding to interpret this string.” That’s pretty strange. It’s hard to be sure, but I think that the current standard (RFC3986) says that query strings should be UTF-8 encoded. If that’s true, then this is a bug in the HttpListenerRequest implementation.

Fortunately, there’s an easy workaround. The Request.Url property is properly formed, so I can use its Query property to construct my own queryString collection and ignore Request.QueryString:

var queryString = HttpUtility.ParseQueryString(context.Request.Url.Query);
string q = queryString["q"];

As far as I know, this is the only way to properly handle encoded query strings in HttpListener. If you know of some way to make Request.QueryString work as expected (or can tell me why the current behavior isn’t a bug), I’d sure like to hear about it.


¹I always pronounced that word “rig-a-ma-role.”  But the word is “rig-ma-role”. Learn something new every day.

²WordPress lets you add Unicode characters when adding a new post, but if you pull the post up to edit it afterwards, the Unicode characters get turned into question marks. Also, the “new post” editor accepts Unicode characters directly but any editing done after that requires you to input the characters using HTML Unicode escapes, like <code>&#x5C3A;&#x516B;</code>.

New dog

This dog is carved from a piece of wood that I picked up from the discard pile at the local Woodcraft store.  I thought it was red oak, but it might instead be padauk.  Or something else.  Whatever it is, it was a pleasure to carve.  A little on the hard side, but no match for a sharp knife.

Update 01/15: The wood is called Lyptus.  It’s a commercial (plantation-grown) hybrid of two types of Eucalyptus tree.

For the eyes, I carved a thin dowel from basswood, rounded the end, cut off a thin slice, and glued it onto the carving.

More fun with regular expressions

I’ve said before that regular expressions make my brain hurt.  I’ve also been rather outspoken on a number of occasions regarding the misuse of regular expressions.  All too often, programmers faced with any kind of parsing problem immediately reach for their regex hammer and then spend an inordinate amount of time trying to use it like a pair of pliers.

That said, regular expressions do have their uses, especially when throwing together a quick and dirty prototype.  In my case, I have a list of about 40 million titles I want to search for occurrences of user-input text strings.  I’m building a prototype, so I’m not terribly worried at the moment with things like stemming or fuzzy matches.  I just want to know how many of the titles in my collection contain the terms “the beatles” or “stairway to heaven”, for example.

I could use naive string searching.  That is, just use the built in String.IndexOf method to do a case-insensitive search of each title for each text string.  So, for instance, I’d have:

if ((title.IndexOf("the beatles", StringComparison.InvariantCultureIgnoreCase) != -1) ||
    (title.IndexOf("stairway to heaven", StringComparison.InvariantCultureIgnoreCase) != -1))
{
    // found a match
}

That mostly works, and would probably be “good enough” for most cases.  But if I were searching for “the who”, it would match “the whosits”, as well.  In addition, if I’m searching for multiple terms, I have to do each search individually.

Regular expressions let me solve both of those problems.  I can search for multiple terms with a single call by creating this regular expression:

    (the beatles)|(stairway to heaven)

which says, “match ‘the beatles’ or ‘stairway to heaven'”.

That doesn’t solve the “whosits” problem, though.  So I need to specify that I want to match only on word boundaries.

In the language of regular expressions, the character escape “\W” says, “match a non-word character.”  For simplicity, let’s just say that a “word character” is an alphanumeric character:  A through Z, a through z, and 0 through 9.  Everything else–white space, punctuation, and special characters–are “non-word characters”.  So my regular expression becomes:

    \W((the beatles)|(stairway to heaven))\W

which you can read as, “match a non-word character, followed by ‘the beatles’ or ‘stairway to heaven’, followed by a non-word character.”

All well and good, right?  Not quite.  It won’t match if the string is found at the beginning or end of the title.  So you’d get a match if the title was, “Nowhere Man by The Beatles (1965)”, but it won’t match “The Beatles – Nowhere Man”.

That’s easily fixed.  We can say, “match at the beginning of the string OR a non-word character,” and something similar for the end:

    (^|\W)((the beatles)|(stairway to heaven))(\W|$)

The character “^” is regex-ese for “match at the beginning of the string,” and “$” is for matching at the end of the string.

That regular expression works.  Slowly.  It takes about 90 seconds to search 10 million titles.

It turns out that there’s another way to search only on word boundaries.  The metasequence “\b” says, “match a word boundary.”  Note that this is a metasequence rather than a character escape.  In a sense, “\b” matches the transitions between words and non-words.  My regular expression above can be rewritten as:

    \b((the beatles)|(stairway to heaven))\b

That’s right, the beginning of the string is one of those word/non-word transitions, as is the end of the string.

In .NET, the new regular expression produces exactly the same result as the old one.  The difference is in how long it takes:  15 seconds rather than 90 seconds.  That’s six times as fast!

Moral of the story: avoid alternation in regular expressions whenever possible.

By the way, “\b” doesn’t work the same in all regex flavors.  In PHP, Ruby, and Java, “\b” only works right when applied to ASCII characters.  Multi-byte (i.e. Unicode) encodings cause “\b” to break.

Regular expressions still make my brain hurt.

Categories

A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.