-
Archives
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
-
Meta
Category Archives: Web Crawling
Hey, you deleted my files!
We got a rather strongly worded message the other day from a Webmaster who was threatening legal action because our crawler deleted a bunch of files from his site. The news that our crawler is capable of deleting files was … Continue reading
Posted in Idiocy, Web Crawling
3 Comments
More URL Filtering
Last week I mentioned proxies and other URL filtering issues that we’ve encountered when crawling the Web. A problem that continually plagues us is repeated path components–URLs like these: http://www.example.com/mp3/mp3/mp3/mp3/mp3/song.mp3 http://www.example.com/mp3/mp3/mp3/mp3/mp3/mp3/song.mp3 I don’t know why some sites do that, but … Continue reading
Posted in Web Crawling
Comments Off
Proxy fits
Three years ago I mentioned anonymous proxies as a way to “anonymize” your Internet access. At the time I neglected to mention one of their primary uses: allowing you to surf sites that might be blocked by your friendly IT … Continue reading
Posted in Web Crawling
1 Comment
Crawler versus the URLs
When you start crawling the Web on even a small scale, you quickly learn that things aren’t nearly as neat and tidy as the RFCs would have you believe. After just a few weeks of writing code to handle all … Continue reading
Posted in Web Crawling
1 Comment
Major search engines support robots.txt standard
Google, Yahoo, and Microsoft’s Live Search recently announced standard support for the major robots.txt directives. This means that you can use the same syntax for robots.txt to control the activities of those three major search engine crawlers. The common directives … Continue reading
Posted in Web Crawling
Comments Off
One more time: the Internet is public
[Note: As Michael Covington pointed out, there's plenty of privacy on the Internet--just not on the World Wide Web.] I know I’ve mentioned this before, but I keep running across people who don’t understand that there is no privacy on … Continue reading
Posted in Internet, Web Crawling
2 Comments
Webbots, Spiders, and Screen Scrapers
Considering what I’m doing for work, you can imagine that when I ran across Michael Schrenk‘s Webbots Spiders, and Screen Scrapers recently, I ordered a copy. The book is a tutorial on writing small Web bots that automate the collection … Continue reading
Posted in Book Reviews, Programming, Web Crawling
Comments Off
Reducing bandwidth used by crawlers
Some site operators block web crawlers because they’re concerned that the crawlers will use too much of the site’s allocated bandwidth. What they don’t realize is that most companies that operate large-scale crawlers are much more concerned with bandwidth usage … Continue reading
Posted in Web Crawling
Comments Off
A variation on the homegrown DOS attack
Tuesday, in How to DOS yourself, I described how to erroneously configure an Apache server and cause what appears to be a denial of service attack. There’s another way to do it that is even more insidious. In Tuesday’s post … Continue reading
Posted in Web Crawling
Comments Off
How to DOS yourself
It’s surprising the things you’ll learn when you write a Web crawler. Today’s lesson: how to be both perpetrator and victim of your own denial of service attack. Not everybody likes crawlers accessing their sites. Most will modify their robots.txt … Continue reading
Posted in Web Crawling
1 Comment
Opt in or opt out?
I mentioned before that there is a small but very vocal group of webmasters who say that crawlers should stay off their sites unless specifically invited. It is their opinion that they shouldn’t have to include a robots.txt file in … Continue reading
Posted in Web Crawling
2 Comments
Why every site should have a robots.txt file
People often ask if they need a robots.txt file on their sites. I’ve seen some Web site tutorials that say, in effect, “don’t post a robots.txt file unless you really need it.” I think that is bad advice. In my … Continue reading
Posted in Web Crawling
Comments Off
More On Robots Exclusion
As I mentioned yesterday, the Robots Exclusion Standard is a very simple protocol that lets webmasters tell well-behaved crawlers how to access their sites. But the “standard” isn’t as well defined as some would have you think, and there’s plenty … Continue reading
Posted in Web Crawling
Comments Off
Struggling with the Robots Exclusion Standard
The Internet community loves standards. We must. We have so many of them. Many of those “standards” are poorly defined or, even worse, ambiguous. Or, in the case of robots.txt, subject to a large number of extensions that have become … Continue reading
Posted in Web Crawling
Comments Off
Web Search Ramblings
Most people reading this blog understand conceptually how Google and other search engines work. In brief, they have a program called a Web crawler that goes from one Web site to the next, downloading and storing pages, and extracting links … Continue reading
Posted in Computers, Web Crawling
Comments Off
You want it when?
The web crawler I’m working on, as I’ve mentioned before, is a distributed application. Currently it consists of a URL Server and multiple Crawlers. The basic idea is that the URL Server is a traffic director that tells each Crawler … Continue reading
Posted in Programming, Web Crawling
Comments Off
Bloom Filters in C#
As I’ve pointed out before, writing a Web crawler is conceptually simple: read a page, extract the links, and then go visit those links. Lather, rinse, repeat. But it gets complicated in a hurry. The first thing that comes to … Continue reading
Posted in Programming, Web Crawling
Comments Off
Multi-threaded programming
I’ve been head-down here working on the Web crawler and haven’t had much occasion to sit down and write blog entries. It’s been a very busy but interesting and rewarding time. A high performance distributed Web crawler is a rather … Continue reading
Posted in Programming, Web Crawling
1 Comment
Crawling Along
After you get your basic web crawler downloading pages and extracting links, you find yourself having to make a decision: how do you feed the harvested URLs back into the crawler? For instance, if I visit www.mischel.com and extract a … Continue reading
Posted in Web Crawling
3 Comments
Crawling the Web
I’m writing a Web crawler. Yeah, I know. It’s already been done. It seems like everybody’s done some Web crawling. But there’s a huge difference between dabbling at it and writing a scalable, high-performance Web crawler that can pull down … Continue reading
Posted in Web Crawling
1 Comment