Jim’s Random Notes

June 5th, 2008

Webbots, Spiders, and Screen Scrapers

Considering what I’m doing for work, you can imagine that when I ran across Michael Schrenk’s Webbots Spiders, and Screen Scrapers recently, I ordered a copy. The book is a tutorial on writing small Web bots that automate the collection of data from the Web.

Most of the book focuses on screen scrapers that download data from previously identified Web sites, parse the pages, and then store and present the data. There’s a little information on “spidering”–automatically following links from one page to another–but that’s not the primary purpose of the book. Which is probably a good thing. A Web-scale spider (or crawler) is fundamentally different than a screen scraper or a special-purpose spider that’s written to gather information from a small set of domains or very narrowly-defined pages.

The first six chapters explain why Web bots are useful, and walk you through the basics: downloading Web pages, parsing the contents, automating log in and form submission, and many other tasks that are involved in automated data collection. With plenty of PHP code examples, these chapters provide a good foundation for the next 12 chapters: Projects. In this section, we see examples of real Web bots that monitor prices, capture images, verify links, aggregate data, read email, and more. Again, with many code examples.

The first two sections cover about three-fifths of the book. If you read and follow along by trying the code examples, you’ll have a very good understanding of how to build many different types of Web bots.

The remainder of the book is divided into two sections. Part 3, Advanced Technical Considerations, briefly explains spiders, and then discusses some of the technical issues such as authentication and cookie management, cryptography, and scheduling your bots. This section has some code examples, but they aren’t the primary focus.

The fourth section, Larger Considerations, focuses on things like how to keep your bots out of trouble, legal issues, designing Web sites that are friendly to bots, and how to prevent bots from scraping your site. Again, these chapters have a few code samples, but the emphasis is on the larger issues–things to think about when you’re writing and running your bots.

Overall, I like the book. The writing is conversational, and the author obviously has a lot of experience building useful bots. The many code samples do a good job illustrating the concepts, and the projects cover the major types of bots most people would be interested in writing. Reading about the projects and some of the other ideas he presents opens up all kinds of possibilities.

The book succeeds very well in its stated mission: explaining how to build simple web bots and operate them in accordance with community standards. It’s not everything you need to know, but it’s the best introduction I’ve seen. The focus is on simple, single-threaded, bots. There’s some small mention of using multiple bots that store data in a central repository, but there’s no discussion of the issues involved in writing multi-threaded or distributed bots that can process hundreds of pages per second.

I recommend that you read this book if you’re at all interested in writing Web bots, even if you’re not familiar with or intending to use PHP. But be sure not to expect more than the book offers.

May 26th, 2008

Clearing the book list

I’ve been meaning to review or at least mention the books I’ve been reading lately. I realized after I posted my negative review of Infinite Ascent that there are plenty of good books that I haven’t mentioned. So, here are capsule reviews of five books I’ve read recently–all picked up at either the remainder table at Half Price Books, or the bargain table at the big box retailer in the local mega shopping center.

Mario Livio’s The Golden Ratio: The Story of Phi, the World’s Most Astonishing Number is an engaging story. The book begins with a brief history of early arithmetic before diving into the discovery of and usefulness of what has become known as The Golden Ratio. From its first use in constructing pentagrams and the Platonic solids, to its uncanny appearance in nature, Livio shows the significance of the number 1.6180339…–the number that satisfies the equality: x2 – x = 1. Perhaps just as importantly, he debunks many myths about the Golden Ratio and its supposed mystical properties. Altogether a delightful read, and one that I recommend highly.

I’ve always been curious about how words come into being, how they change meanings, and how they eventually fall out of favor. I’m not a huge word nerd (and I mean that in the best possible way) like some of my friends, but I do enjoy learning about them. In The Life of Language: The fascinating ways words are born, live & die, authors Sol Steinmetz and Barbara Ann Kipfer take us on a tour through the English language, describing the many different ways words come into the language and how their pronunciations and meanings change over time. They also explain how to read the etymological information found in dictionaries–something my high school and college English teachers never bothered to teach. If they even knew. The writing style is a little bit dry in places, and the book is probably a third larger than it really has to be, but I quite enjoyed the read.

Who would have thought that jigsaw puzzles had such a rich history. Did you know that there are manufacturers of custom jigsaw puzzles that cost $5.00 or more per piece? People will pay $2,500 for a high quality wooden jigsaw puzzle of 500 pieces. I always thought that a jigsaw puzzle was little more than a trinket–something to pass the time. Anne D. Williams’ The Jigsaw Puzzle: Piecing Together a History opened my eyes to a whole new world of jigsaw puzzles, puzzle collectors and enthusiasts, and custom manufacturers. I’m not a huge jigsaw puzzle fan, but it was kind of interesting learning about this particular obsession that’s shared by a surprising number of people. Well written and mostly engaging, it was a good way to pass a few hours.

It’s hard to characterize Dava Sobel’s The Planets. It’s a tour of all the planets in our solar system, plus the Sun and Earth’s Moon, and including the recently demoted Pluto. The “tour” is somewhat superficial in that it doesn’t go into a whole lot of detail about any of the planets, but it does give the basic facts: size, distance from the Sun, orbital period, etc. For the planets known to the ancients, we learn how they were viewed throughout history. She also describes how the moons of other planets were discovered, and gives us some history of the discovery of Uranus, Neptune, Pluto, and many other objects. It’s a light read, well written and enjoyable.

I have to include one stinker in the list. I don’t know what possessed me to buy Apocalypse 2012: An Investigation into Civilization’s End, by Lawrence E. Joseph, and I can’t give a real good reason why I actually read it. But I’m kind of glad I did. Not because I believe the “prophecies” of doom, but because it’s such a fascinating mix of superstition, science, faulty reasoning, and plain old scare mongering. Looked at critically, the arguments just don’t hold water: there’s nothing there. But the bullshit is so skillfully disguised and beautifully rendered that the book is hard to put down. I was just amazed at how well the author was able to weave the story together. He didn’t do a perfect job, though. In several places I got the distinct impression that he was laughing his ass off as he wrote. It’s impossible that the person who wrote this book actually believes what he’s peddling. I won’t recommend the book as anything but an interesting and somewhat amusing study in pandering to a deluded audience. At that, it succeeds brilliantly.

May 23rd, 2008

Infinite Annoyance

Browsing the remainder table in Half Price Books a few weeks ago, I ran across David Berlinski’s Infinite Ascent: A short history of mathematics. The cover copy looked good, and a quick flip through a few pages was enough to convince me that it was worth the three bucks. At 180 pages, you’d expect it to be a pretty short read, and it might be for some. I found it tough going.

The book focuses on what the author (and others, I gather) considers “the ten most important breakthroughs in mathematics,” giving some biographical information about the people most closely associated with those discoveries, the historical context, and also an explanation of why the breakthroughs are important. At least, that’s how the first five chapters (Number, Proof, Analytic Geometry, The Calculus, and Complex Numbers) went. The next five chapters (Groups, Non-Euclidean Geometry, Sets, Incompleteness, The Present) seemed much less approachable.

I freely admit that some of my difficulty could be that I’m fairly comfortable with the topics discussed in the first five chapters, but with the exception of Sets I have no experience with or more than passing knowledge of the topics discussed in the later chapters. Somehow, though, I get the feeling that the fault is not entirely mine. I didn’t expect to gain a detailed understanding of Gödel’s incompleteness theorems by reading a short chapter, but I had hoped to learn something. Instead, I’m treated to prose like this:

The final cut–the director’s cut–now follows by means of the ventriloquism induced by Gödel numbering. This same formula just seen making an arithmetical statement in that subtle shade of fuchsia now acquires a palette of quite hysterical reds and sobbing violets, those serving to highlight the metamathematical scene presently unfolding, for while Bew(x) says something about the numbers, it also says that

x is a provable formula,

meaning that honey the number x is the number associated under the code with a provable formula, whereupon the director, lost in admiration for his own art, can mutter only that deep down it’s a movie about a movie.

That’s all pretty writing, but by the time I wade through the director’s psychedelic visions I’ve totally lost track of whatever mathematical subject we’re talking about. The first time I read that chapter, I put my lack of understanding down to having read it in bed, just before I fell asleep. The author’s point continues to elude me after a second reading. I learned more by skimming the Wikipedia article linked above than I did trying to puzzle out whatever Berlinski was trying to say.

Flipping through the book again after finishing it, I noticed that the style is pretentious throughout. The book suffers from too many inappropriate and incomprehensible metaphors, too much temporal hopping around in its short biographies, and too many paragraphs that jump off the page screaming, “Look, Ma, at how pretty I can write!” Like the director in the excerpt above, Berlinski seems lost in admiration of his own writing.

All in all, I’d say you’d be much better off reading Wikipedia articles about mathematics than trying to decipher the word splatter that Berlinski is trying to pass off as intelligent writing in Infinite Ascent. Not only is Wikipedia free, but you’ll learn a lot more and you won’t be tempted to track down the author and smack him upside the head for killing trees and wasting your time with his drivel.

Sometimes there’s a very good reason for a book to be on the remainder table.

|