Webbots, Spiders, and Screen Scrapers

Considering what I’m doing for work, you can imagine that when I ran across Michael Schrenk‘s Webbots Spiders, and Screen Scrapers recently, I ordered a copy. The book is a tutorial on writing small Web bots that automate the collection of data from the Web.

Most of the book focuses on screen scrapers that download data from previously identified Web sites, parse the pages, and then store and present the data. There’s a little information on “spidering”–automatically following links from one page to another–but that’s not the primary purpose of the book. Which is probably a good thing. A Web-scale spider (or crawler) is fundamentally different than a screen scraper or a special-purpose spider that’s written to gather information from a small set of domains or very narrowly-defined pages.

The first six chapters explain why Web bots are useful, and walk you through the basics: downloading Web pages, parsing the contents, automating log in and form submission, and many other tasks that are involved in automated data collection. With plenty of PHP code examples, these chapters provide a good foundation for the next 12 chapters: Projects. In this section, we see examples of real Web bots that monitor prices, capture images, verify links, aggregate data, read email, and more. Again, with many code examples.

The first two sections cover about three-fifths of the book. If you read and follow along by trying the code examples, you’ll have a very good understanding of how to build many different types of Web bots.

The remainder of the book is divided into two sections. Part 3, Advanced Technical Considerations, briefly explains spiders, and then discusses some of the technical issues such as authentication and cookie management, cryptography, and scheduling your bots. This section has some code examples, but they aren’t the primary focus.

The fourth section, Larger Considerations, focuses on things like how to keep your bots out of trouble, legal issues, designing Web sites that are friendly to bots, and how to prevent bots from scraping your site. Again, these chapters have a few code samples, but the emphasis is on the larger issues–things to think about when you’re writing and running your bots.

Overall, I like the book. The writing is conversational, and the author obviously has a lot of experience building useful bots. The many code samples do a good job illustrating the concepts, and the projects cover the major types of bots most people would be interested in writing. Reading about the projects and some of the other ideas he presents opens up all kinds of possibilities.

The book succeeds very well in its stated mission: explaining how to build simple web bots and operate them in accordance with community standards. It’s not everything you need to know, but it’s the best introduction I’ve seen. The focus is on simple, single-threaded, bots. There’s some small mention of using multiple bots that store data in a central repository, but there’s no discussion of the issues involved in writing multithreaded or distributed bots that can process hundreds of pages per second.

I recommend that you read this book if you’re at all interested in writing Web bots, even if you’re not familiar with or intending to use PHP. But be sure not to expect more than the book offers.