Odds ‘n Ends

A few notes after a day of knocking things off the “to do” list.

  • I’ve used QUIKRETE before, but never for setting a post. Just pour the dry concrete mix into the hole (after placing the post), and add one gallon of water for every 50 lbs of mix. The stuff sets in about 45 minutes, and you can apply stress to the post after only four hours. No mixing required. Ain’t technology wonderful?
  • Seeing as how I had only one hole to dig, I did it the old-fashioned way: with a post hole digger and a Texas toothpick. Note to self: wear gloves next time.
  • From the hammer’s point of view, a thumb looks just like a fence staple.
  • It’s always a good idea to remove the old part and take it to the auto parts store when you go shopping for its replacement. It’ll save you from having to make another trip when you realize that the part you got isn’t the part you need.
  • I shouldn’t have to remove a dozen screws with three different tools in order to replace a relay.
  • PVC union is an ingenious device. But remember to put thread compound on the threads of the device itself, in addition to the threads of the two pipes you’re attaching it to.

Major search engines support robots.txt standard

GoogleYahoo, and Microsoft’s Live Search recently announced standard support for the major robots.txt directives. This means that you can use the same syntax for robots.txt to control the activities of those three major search engine crawlers. The common directives are: Disallow, Allow, and Sitemaps.  In addition, all three support the use of wildcards (* and $) in specifying paths for Allow and Disallow. It’s interesting to note that Yahoo says they support “$ Wildcards,” whereas Google and Microsoft say that they support “* Wildcards” as well as “$ Wildcards.” From reading Yahoo’s documentation, though, I’d say that they also support “* Wildcards.”

All three also support several HTML META tags, such as NOINDEX and NOFOLLOW, that give content authors much tighter control over crawlers than can be accomplished with robots.txt. 

This isn’t exactly a new step. The three major search engines have been collaborating for the last few years, trying to make Webmasters’ jobs easier with respect to the major search engines. For example, back in February they announced common support for cross-submission of Sitemaps.

Unfortunately, all three also support individual extensions to the Robots Exclusion Protocol.  For example, Yahoo and Microsoft support the Crawl-Delay directive, which Google does not support. Both Google and Yahoo support some unique META tags that the others don’t support.

Even with the incompatibilities, this is a big step in the right direction. With unified support of the major robots.txt directives among the three major search engine crawlers, we can expect to see more support by smaller crawlers. I know that many authors of smaller-scale crawlers look to the majors to see what they should support. Having all three support the same directives in the same way, makes other developers’ jobs (including mine!) easier.

But ultimately it’s the Webmasters who benefit the most by giving them a standard way to control crawlers’ access to their sites.

One more time, the Internet is public

[Note:  As Michael Covington pointed out, there’s plenty of privacy on the Internet–just not on the World Wide Web.]

I know I’ve mentioned this before, but I keep running across people who don’t understand that there is no privacy on the Internet. If you’ve uploaded something to your Web site, it’s highly likely that Google, MSN, Yahoo, or any (or all) of the many other search engines out there has found it.  Even our Web crawler–a small-scale operation–finds things in hidden nooks and crannies of the Web that most people with browsers would never stumble upon.

For example, the other day a coworker was spot-checking some of the crawler’s latest finds and stumbled upon a site where the owner had uploaded what looks like (from examining the file names) a bunch of very private stuff. This all in an unprotected directory.  A person with a browser could go to that URL, get a listing of all files, and then browse to his heart’s content. Although it’s unlikely that a person browsing would stumble upon the directory, a crawler almost certainly will. Eventually.

When we run across something like that, we don’t actually browse, but rather find out how to contact the site owner and send him a very nice email suggesting that he either protect the directory or delete the files.

The day after discovering the site I mentioned above, we ran across the story of Alex Kozinski, a judge in the 9th Circuit whose personal porn stash was found publicly accessible online:

Kozinski, 57, said that he thought the site was for his private storage and that he was not aware the images could be seen by the public, although he also said he had shared some material on the site with friends. After the interview Tuesday evening, he blocked public access to the site.

Of particular interest in this case is that the judge was presiding over an obscenity trial (now postponed) that involves material that’s apparently similar to some of the material on the judge’s site. The judge also had some copyrighted music on the site, opening up the possibility of copyright violation.

No matter how far out in the country you live, if you stand naked in front of an uncovered window, somebody will eventually see you. Similarly, if you upload something to your Web site and don’t take active measures to prevent access, it will be found. Do not assume that it can’t be found because you never told anybody about it. That’s like putting a key under the doormat and figuring it’s safe because only you know it’s there.

Can’t configure Windows DNS resolver cache

In experimenting with the program I described yesterday, I got to fiddling with the DNS resolver cache, called dnscache. Briefly, dnscache saves the results from recent DNS queries so that it doesn’t have to keep querying the DNS server. Considering that a DNS query can take 100 milliseconds or more to resolve, this can save considerable time. For example, for your browser to load this Web page, it has to make many different requests to my server: one for the base page, one for the stylesheet, one for each image, etc. It wouldn’t be uncommon to require a dozen separate requests to get all the resources that make up the page. If each resource required a separate DNS request, it would take more than a second just for DNS!

I got to wondering just how large the DNS cache is. A little bit of searching brings up any number of pages claiming that you can “speed up your connection” by tweaking the DNS resolver cache parameters. Specifically, they talk about changing registry keys for the cache hash table size, maximum time to live, etc. There’s even a Microsoft TechNet article describing these parameters for Windows Server 2003 (and, by extension, Windows XP). It’s interesting to note that the information on most of the pages claiming to speed things up conflicts rather badly with the information in the TechNet article.

After reading the tweaks and the TechNet article, I figured I’d give it a shot. I fired up the Registry Editor, made the changes, and … is it working? How can I tell? I tried browsing a few Web sites, but I couldn’t see any difference.

A little more searching and I found the command ipconfig /displaydns. This writes the contents of the DNS resolver cache to the console. A little work with the FIND utility, and I was able to count the number of entries in the cache. 34 on my Windows XP box. Interesting, considering that I set the CacheHashTableSize registry entry to over 7,000. I fiddled and tweaked, restarted the DNS Client service, flushed the cache, rebooted my computer, faced Redmond and cursed, and generally tried everything I could think of. No matter what settings I used, I always ended up with between 30 and 40 entries in my DNS cache.

On my Windows Server 2008 machine at the office, I always got between 270 and 300 entries, no matter what I tried.

So that leaves me with the following possibilities:

  1. It’s not possible to change the size of the DNS resolver cache in Windows XP or Windows Server 2008.
  2. It is possible, but the documentation is wrong.
  3. The documentation is correct as far as it goes, but it’s incomplete.
  4. The documentation is correct and complete, but I’m too dumb to make sense of it.
  5. The documented registry entries actually changed the size of the cache, but ipconfig isn’t showing me all the entries that are in the cache.

At this point, all possibilities seem almost equally likely. I could do some indirect testing based on the amount of time it takes to resolve a series of DNS requests, but even that would be inconclusive. There are no documented API calls that allow me to examine the DNS cache or its size. (And the undocumented ones aren’t described well enough to be worth checking out.) My only means of seeing what’s in the cache is the ipconfig tool.

So I ask: does anybody know how to change the size of the Windows DNS resolver cache and prove that those changes actually work? Do I have to restart the DNS Client service? Reboot the machine? Set some super magic registry entry?

Any information greatly appreciated.

Is this really asynchronous?

I’ve been working on a relatively simple program whose purpose is to see just how fast I can issue Web requests. The idea is to get one machine hooked directly to an Internet connection and see how many concurrent connections it can maintain and how much bandwidth it can consume. A straight bandwidth test is easy: just start three or four Linux distribution downloads from different sites. That’ll usually max out a cable modem connection.

But determining the sustained concurrent connection rate is a bit more difficult. It requires that you issue a lot of requests, very quickly, for an extended period of time. By slowly increasing the number of concurrent connections and monitoring the bandwidth used, I should be able to find an optimum range of request rates: one that makes maximum use of bandwidth, but doesn’t cause requests to timeout.

My Web crawler does something similar, but it also does a whole lot of other things that make it impractical for use as a diagnostic tool.

I got the program up and limping today, and was somewhat surprised to find that it couldn’t maintain more than 15 concurrent connections for any length of time. Considering that my crawler can maintain 200 or more connections without a problem, I found that quite curious. It had to be something about the different way I was issuing requests.

Because this is a simple tool, I figured I’d use the .NET Framework’s WebClient component to issue the requests. In order to avoid the overhead of constructing a new WebClient for every request, I initialized 100 WebClient instances to be served from a queue, and then issued the requests in a loop, kind of like this:

while (!shutdown)
{
    if (currentConnections < MaxConnections)
    {
        WebClient cli = GetClientFromQueue();
        ++currentConnections;
        cli.DownloadStringAsync(GetNextUrlFromQueue());
    }
}

The actual code is a bit more involved, of course, but that’s the gist of it. The currentConnections counter gets decremented in the download completed event handler.

The important thing to note here is that I’m issuing asynchronous requests. The call to DownloadStringAsync executes on a thread pool thread. This code should issue requests at a blindingly fast rate, and keep the number of concurrent connections right near the maximum. Even with MaxConnections set to 50, the best I could do was 20 concurrent, and that for only a very short time. Most often I had somewhere between 10 and 15 concurrent connections.

After eliminating everything else, I finally got around to timing just how long it takes to issue that asynchronous request. The result was pretty surprising: in my brief tests, it took anywhere from 0 to 300 milliseconds to issue those requests. The average seemed to be around 100 or 150 ms. That would explain why I could only keep 10 or 15 connections open. If it takes 100 ms to issue a request, then I can only make 10 requests per second. Since it takes about 2 seconds (on average) to complete a request, the absolute best I’ll be able to do is 20 concurrent requests.

So I got to thinking, why would it take 100 milliseconds or more to issue an asynchronous Web request? And the only reasonable answer I could come up with was DNS: resolving the domain name. And it turns out I was right. I flushed the DNS cache and ran my test by requesting a small number of URLs from different domains. Sure enough, it averaged about 150 ms per request. I then ran the program again and it took almost no time at all to issue the requests. Why? Because the DNS cache already had those domain names resolved. Just to make sure, I flushed the DNS cache again and re-ran the test.

By the way, the HttpWebRequest.BeginGetResponse method (the low-level counterpart to WebClient.DownloadStringAsync) exhibits the same behavior. That’s not surprising, considering that WebClient calls HttpWebRequest to do its thing.

This is a fatal flaw in the design of the .NET Framework’s support for asynchronous Web requests. The whole idea of supplying asynchronous methods for I/O requests is to push the waiting off on to background threads so that the main thread can continue processing. What’s the use of providing an asynchronous method if you have to wait for a high latency task like DNS resolution to complete before the asynchronous request is issued? Why can’t the DNS resolution be done on the thread pool thread, just like the actual Web request is?

There is a way around the problem: queue a background thread to issue the asynchronous request. Yes, I know it sounds crazy, but it works. And it’s incredibly easy to do with anonymous delegates:

ThreadPool.QueueUserWorkItem(delegate(object state)
    {
        cli.DownloadStringAsync((Uri)state);
    }, GetNextUrlFromQueue());

That spawns a thread, which then issues the asynchronous Web request. The time waiting for DNS lookup is spent in the background thread rather than on the main processing thread. It looks pretty goofy, and unless it’s commented well somebody six months from now will wonder what I was smoking when I wrote it.

The perfect ground cover?

A few years back, Debra and I started adding large mulch areas around the trees in the yard. This was an effort to make things look a little better, as well as to reduce lawn maintenance. More mulch means less grass to mow. And mulch around the trees means that I don’t have to run the weed eater to knock down the grass that normally would grow around the trunks. The problem is that weeds and grass grow in the mulch, and if you don’t keep up with pulling them and adding a new layer of mulch every year or two, the grass will take over again.

Another option is to plant a good thick ground cover that will prevent grass and weeds from growing. One of the best such plants for this area is asian jasmine. Getting it established might be a challenge, but once established it’s very drought tolerant and requires little maintenance. Just trim it with the weed eater, or run the mower over it on the highest setting once or twice a year. The only thing that concerns me is the stated requirement of “moist, well-drained, well-prepared soil” for establishment. Such soil is in short supply in our yard.

As far as I’m concerned, the perfect ground cover would be grass that never grows higher than an inch or two. Why can’t some of these genetic engineering whizzes get to work on such a thing? Forget Frankenfoods. I imagine just about any homeowner would kill for a lush green lawn that he never had to mow.

Internet Explorer clipboard protection is broken

This morning I copied a URL from the browser to the clipboard and then tried to paste it into the email message I was writing in another browser window. Internet Explorer popped up this confirmation box:

I wouldn’t mind so much if it showed this box one time. But it shows the box for every new email I try to paste stuff to.

There are two things that annoy me about this confirmation box. The first is that the default button is “Don’t allow”. Obviously, somebody has a much higher opinion of the threat posed by indiscriminate clipboard pasting than I do. I just don’t agree that IE should be holding my hand here and trying to dissuade me from pasting data into an email. The default should be “Allow access”. For dang sure, I should be able to change the default. Better yet, I’d like to just turn the silly notification off. Does Windows have a, “Yes, I know what I’m doing” mode?

Worse, this confirmation box is broken for keyboard users. I’m pretty keyboard-centric, especially when I’m writing. I don’t need to remove my fingers from the keyboard in order to copy a URL from one browser window (or tab) to another. Alt+Tab, Ctrl+D, Ctrl+C, Alt+Tab, Ctrl+V. Done. When this confirmation box pops up, it changes “Done” into:

  1. “What the heck?”
  2. Press Enter before fully realizing that I just prevented myself from pasting into the email.
  3. Copy the draft email to the clipboard.
  4. Open Notepad.
  5. Paste the draft into Notepad.
  6. Close the draft email.
  7. Open a new email message or reply.
  8. Paste the draft back into the new email.
  9. Go find the URL I wanted to paste, and copy it to the clipboard.
  10. Attempt to paste the URL into the email.
  11. Read confirmation box and press the left arrow button to highlight the “Allow access” button.
  12. Nothing happens.
  13. Press the right arrow.
  14. Press Enter.

Whoever coded up this particular confirmation box got his arrow keys backwards.

I guess I am more secure with this new setup. It’s so painful that I’ll stop trying to paste things into my emails.

I understand that security is an issue, and to some extent IE has to protect users from themselves. But this is broken. Horribly. At minimum, the confirmation should have a link or checkbox that lets me turn the message off for pages that I identify. Like the “new email” page that I use dozens of times a day.

Webbots, Spiders, and Screen Scrapers

Considering what I’m doing for work, you can imagine that when I ran across Michael Schrenk‘s Webbots Spiders, and Screen Scrapers recently, I ordered a copy. The book is a tutorial on writing small Web bots that automate the collection of data from the Web.

Most of the book focuses on screen scrapers that download data from previously identified Web sites, parse the pages, and then store and present the data. There’s a little information on “spidering”–automatically following links from one page to another–but that’s not the primary purpose of the book. Which is probably a good thing. A Web-scale spider (or crawler) is fundamentally different than a screen scraper or a special-purpose spider that’s written to gather information from a small set of domains or very narrowly-defined pages.

The first six chapters explain why Web bots are useful, and walk you through the basics: downloading Web pages, parsing the contents, automating log in and form submission, and many other tasks that are involved in automated data collection. With plenty of PHP code examples, these chapters provide a good foundation for the next 12 chapters: Projects. In this section, we see examples of real Web bots that monitor prices, capture images, verify links, aggregate data, read email, and more. Again, with many code examples.

The first two sections cover about three-fifths of the book. If you read and follow along by trying the code examples, you’ll have a very good understanding of how to build many different types of Web bots.

The remainder of the book is divided into two sections. Part 3, Advanced Technical Considerations, briefly explains spiders, and then discusses some of the technical issues such as authentication and cookie management, cryptography, and scheduling your bots. This section has some code examples, but they aren’t the primary focus.

The fourth section, Larger Considerations, focuses on things like how to keep your bots out of trouble, legal issues, designing Web sites that are friendly to bots, and how to prevent bots from scraping your site. Again, these chapters have a few code samples, but the emphasis is on the larger issues–things to think about when you’re writing and running your bots.

Overall, I like the book. The writing is conversational, and the author obviously has a lot of experience building useful bots. The many code samples do a good job illustrating the concepts, and the projects cover the major types of bots most people would be interested in writing. Reading about the projects and some of the other ideas he presents opens up all kinds of possibilities.

The book succeeds very well in its stated mission: explaining how to build simple web bots and operate them in accordance with community standards. It’s not everything you need to know, but it’s the best introduction I’ve seen. The focus is on simple, single-threaded, bots. There’s some small mention of using multiple bots that store data in a central repository, but there’s no discussion of the issues involved in writing multithreaded or distributed bots that can process hundreds of pages per second.

I recommend that you read this book if you’re at all interested in writing Web bots, even if you’re not familiar with or intending to use PHP. But be sure not to expect more than the book offers.

.NET regular expressions are slow?

One of the benefits–or curses, depending on my mood and how urgently I need a solution–of programming computers is that I often start working on one thing and end up getting sidetracked by a piece of the problem.

Take today’s distraction, for example. I’m writing a program to experiment with some text classification using the downloaded Wikipedia database. A major part of the program is extracting terms from the individual articles. Since I don’t need anything too fancy (at least not quite yet), I figured that this would be a perfect application for regular expressions. So I coded up my term extractor and let it loose on the Wikipedia data, figuring it’d take an hour or two to process the 13 gigabytes of text.

It’s a good thing I’ve gotten into the habit of instrumenting my code and periodically outputting timing information. It was taking an average of 10 milliseconds to process each Wikipedia article. The file has about 5.8 million articles. A quick back of the envelope calculation says that’ll take 58,000 seconds, or 16 hours. I’ve been wrong before, but not often by an order of magnitude.

After removing some unnecessary code and postponing some processing that can be done against the aggregate, I cut the required time per page to about 4 milliseconds. Better, but still too much. Through the process of elimination, I finally narrowed it down to the loop that extracts terms from the document text using a regular expression. Stripping everything but the critical loop, it looks like this:

static Regex reTerm = new Regex("\\p{L}+", RegexOptions.Compiled);
static Stopwatch reElapsed = new Stopwatch();

static void DoReParse(string pageText)
{
    reElapsed.Start();

    Match m = reTerm.Match(pageText);
    while (m.Success)
    {
        m = m.NextMatch();
    }

    reElapsed.Stop();
}

The regular expression, \p{L}+, says, “match a string of one or more contiguous Unicode Letter characters.” reElapsed is a Stopwatch object that accumulates the time taken between the Start and Stop calls. When my program is done, I divide the accumulated time by the number of documents processed to get the average time per page. For the first 100,000 documents in the Wikipedia download, it averages about 1.5 milliseconds per page, which is pretty close to what I thought all of the processing would take–parsing included.

That seemed unreasonable, so I wrote my own parser that reads each character of the document text and pulls out the substrings of contiguous Unicode letter characters. It’s more code than the regular expression version, but it’s still pretty simple:

static void DoJmParse(string pageText)
{
    // Time extraction with direct parsing
    jmElapsed.Start();

    int i = 0;
    int len = pageText.Length;
    bool inWord = false;
    int start = 0;
    for (i = 0; i < len; i++)
    {
        if (!inWord)
        {
            if (char.IsLetter(pageText[i]))
            {
                start = i;
                inWord = true;
            }
        }
        else if (!char.IsLetter(pageText[i]))
        {
            string term = pageText.Substring(start, i - start);
            inWord = false;
        }
    }
    if (inWord)
    {
        string term = pageText.Substring(start);
    }

    jmElapsed.Stop();
}

Both of the methods shown above are stripped-down versions of the code. The complete code adds the extracted terms to a list so that I can compare what’s extracted by each method. In all cases (the first 100,000 pages in the Wikipedia download), both methods extract the same terms. But the hand-coded parser takes an average of 0.25 milliseconds per page. The regular expression parser takes six times as long.

I could understand my hand-coded routine being somewhat faster because it avoids the overhead of constructing Match objects and some of the other regular expression overhead. But six times? Something smells wrong. I have to think that I’m missing something.

I tried the usual things: fiddling with the regular expression options, calling Matches to get all of the matches in a single call, etc. All to no avail. Everything I tried had either no effect or increased the running time.

Then I thought that the difference had to do with Unicode combining characters or surrogate pairs. That is, characters that take up two code units rather than just one. My parser treats each code unit as a character, whereas the regular expression parser might be taking those multi-code unit characters into account. But a simple test doesn’t bear that out.

Consider this character string defined in C#:

const string testo = "\u0061\u0300xyz";

The first two characters, “\0061\0300”, define the character à, a lower-case “a” with a grave accent. From my understanding of Unicode, this should be treated as a single character. But when I run that string through the regular expression term extractor, I get two strings: “a”, and “xyz”. If the regular expression engine supports combining characters, I should get one string: “àxyz”. The documentation is suspiciously silent on the matter, but a close reading of Jeffrey Friedl’s Mastering Regular Expressions indicates that I shouldn’t expect the .NET regular expression engine to give me just the one string.

So I’m at a loss. I have no idea why the regular expression version of the term extractor is so much slower than my hand-coded version. For now, I’m going to abandon the regular expression, but I’d sure like to hear from anybody who can shed some light on this for me.