<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Jim's Random Notes &#187; Web Crawling</title>
	<atom:link href="http://blog.mischel.com/category/web-crawling/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mischel.com</link>
	<description></description>
	<lastBuildDate>Wed, 01 Sep 2010 17:17:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Hey, you deleted my files!</title>
		<link>http://blog.mischel.com/2008/08/08/hey-you-deleted-my-files/</link>
		<comments>http://blog.mischel.com/2008/08/08/hey-you-deleted-my-files/#comments</comments>
		<pubDate>Fri, 08 Aug 2008 15:42:53 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Idiocy]]></category>
		<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/?p=157</guid>
		<description><![CDATA[We got a rather strongly worded message the other day from a Webmaster who was threatening legal action because our crawler deleted a bunch of files from his site.  The news that our crawler is capable of deleting files was &#8230; <a href="http://blog.mischel.com/2008/08/08/hey-you-deleted-my-files/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>We got a rather strongly worded message the other day from a Webmaster who was threatening legal action because our crawler deleted a bunch of files from his site.  The news that our crawler is capable of deleting files was quite a surprise to us.  Like other crawlers, ours just downloads HTML files, extracts links, and then visits those links.  There is no &#8220;delete a file&#8221; logic in there.  But if the crawler stumbles upon a link whose action is to delete a file, then visiting that link will indeed delete the file.</p>
<p>Further investigation in this particular case revealed a file management page that includes, among other things, links that have the form:  <span style="text-decoration: underline;">www.example.com/files/?delete=filename.txt</span>.  Surprisingly enough, clicking on that link deletes the file.  The file management page is not protected by a password, nor is there any kind of confirmation displayed before the file is permanently deleted.</p>
<p>Examining the logs, we saw accesses from other search engine crawlers.  We also learned from the Webmaster that some time back, a kid had &#8220;hacked in&#8221; to the site and deleted a bunch of files.</p>
<p>I&#8217;m a little surprised that anybody would create such a page and not provide any protection.  I&#8217;m <em>very</em> surprised to find out that a supposedly professional Web developer would do such a thing and not learn the lesson when a random surfer came in and deleted files.  And I&#8217;m shocked that, even after we explained this to the Webmaster, he insists that we can take this as an opportunity to learn from our &#8220;mistake&#8221; and &#8220;fix&#8221; the crawler so that it doesn&#8217;t happen again.</p>
<p>It&#8217;s unfortunate that our crawler visited those links, causing the files to be deleted.  But the mistake was on the part of the person who posted those destructive links.  The crawler was operating exactly as it should.  Exactly, in fact, as every major search engine crawler acts.  It&#8217;d be nice if we could imbue the crawler with enough intelligence to &#8220;understand&#8221; Web pages and know in advance what the effects of clicking a link will be.  But that kind of machine intelligence is far, far in the future.</p>
<p>If you post something on the Web, <em>it will be found</em>, unless you take active measures to protect it.  Posting a destructive link on an unprotected page and then blaming somebody else when the link is clicked by an &#8220;unauthorized&#8221; person is akin to running out into a busy street and then blaming your injuries on the driver of the bus that hits you.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/08/08/hey-you-deleted-my-files/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>More URL Filtering</title>
		<link>http://blog.mischel.com/2008/07/19/more-url-filtering/</link>
		<comments>http://blog.mischel.com/2008/07/19/more-url-filtering/#comments</comments>
		<pubDate>Sat, 19 Jul 2008 15:55:01 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/?p=140</guid>
		<description><![CDATA[Last week I mentioned proxies and other URL filtering issues that we&#8217;ve encountered when crawling the Web.  A problem that continually plagues us is repeated path components&#8211;URLs like these: http://www.example.com/mp3/mp3/mp3/mp3/mp3/song.mp3 http://www.example.com/mp3/mp3/mp3/mp3/mp3/mp3/song.mp3 I don&#8217;t know why some sites do that, but &#8230; <a href="http://blog.mischel.com/2008/07/19/more-url-filtering/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Last week I mentioned <a href="http://blog.mischel.com/2008/07/10/proxy-fits/">proxies</a> and other <a href="http://blog.mischel.com/2008/07/09/crawler-versus-the-urls/">URL filtering issues</a> that we&#8217;ve encountered when crawling the Web.  A problem that continually plagues us is repeated path components&#8211;URLs like these:</p>
<p><span style="text-decoration: underline;">http://www.example.com/mp3/mp3/mp3/mp3/mp3/song.mp3</span><br />
<span style="text-decoration: underline;">http://www.example.com/mp3/mp3/mp3/mp3/mp3/mp3/song.mp3</span></p>
<p>I don&#8217;t know why some sites do that, but a crawler can easily get caught in a trap and will generate such URLs indefinitely.  Or until our self-imposed URL length limit kicks in.  Most of the time when that happens, we discover that all the URLs resolve to the same file, and removing the repeated path component (i.e. creating <span style="text-decoration: underline;">http://www.example.com/mp3/song.mp3</span>) is the right thing to do.</p>
<p>A single repeated component is by far the most common, but we frequently see two or three repeated components:</p>
<p><span style="text-decoration: underline;">http://www.example.com/mp3/download/mp3/download/mp3/download/song.mp3</span><br />
<span style="text-decoration: underline;">http://www.example.com/mp3/Rush/download/mp3/Rush/download/song.mp3</span></p>
<p>It&#8217;s easy enough to write regular expressions that identify the repeated path components, and replacing the repeats with a single copy is trivial.  But it&#8217;s not a good general solution.  For example this blog (and many others) uses URLs of the form <span style="text-decoration: underline;">blog.mischel.com/yyyy/mm/dd/post-name/</span>, so the entry for July 7 is <span style="text-decoration: underline;">blog.mischel.com/2008/07/07/post-name/</span>.  Globally applying the repeated component removal rules would break a very large number of URLs.</p>
<p>This is one of the many URL filtering problems for which there is no good global solution.  Sometimes, repeated path components are legitimate.  We can use some heuristics based on the crawl history (i.e. if <span style="text-decoration: underline;">/mp3/song.mp3</span> generates <span style="text-decoration: underline;">/mp3/mp3/song.mp3</span>) to identify problem sites, but in the end we end up having to write domain-specific filtering rules.  Manually identifying and coding around the dozen or so worst offenders makes a big dent in the problem.</p>
<p>Another per-domain problem is that of session IDs encoded within the path, or with uncommon parameter names.  For example, we can easily identify and remove common ids like <tt>PHPSESSID=</tt> and <tt>sessionid=</tt>, but these URLs will escape the filter unscathed:</p>
<p><span style="text-decoration: underline;">http://www.example.com/file.html?exSession=123456xyzzy</span><br />
<span style="text-decoration: underline;">http://www.example.com/file.html?exSession=845038plugh</p>
<p>http://www.example.com/coolstuff/123456xyzzy/index.html</span></p>
<p><span style="text-decoration: underline;">http://www.example.com/coolstuff/845038plugh/index.html</span></p>
<p>It&#8217;s easy for humans to look at the first two URLs and determine that they likely go to the same place.  Same for the second pair.  The computer isn&#8217;t quite that smart, though, and making it that smart is very difficult.</p>
<p>Developing a system that automatically identifies problem URLs and generates filtering rules is a &#8220;big-R&#8221; research project&#8211;something that we don&#8217;t have time to work on at the moment.  Even if we were to develop such a thing, it&#8217;d be pretty fragile and would require constant monitoring and tweaking.  If a site&#8217;s URL format changes (something that happens with distressing frequency), the filtering rules become invalid.  Usually the effect will be letting through some stuff that should have been filtered, but in rare cases a change in the input data can lead to the filter rejecting a large number of URLs that it should have passed.</p>
<p>When I started this project, I knew that crawling the Web was non-trivial.  But it turns out that the URL filtering problem is much more complex than I expected the entire Web crawler to be.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/07/19/more-url-filtering/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Proxy fits</title>
		<link>http://blog.mischel.com/2008/07/10/proxy-fits/</link>
		<comments>http://blog.mischel.com/2008/07/10/proxy-fits/#comments</comments>
		<pubDate>Fri, 11 Jul 2008 03:41:26 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/?p=134</guid>
		<description><![CDATA[Three years ago I mentioned anonymous proxies as a way to &#8220;anonymize&#8221; your Internet access. At the time I neglected to mention one of their primary uses: allowing you to surf sites that might be blocked by your friendly IT &#8230; <a href="http://blog.mischel.com/2008/07/10/proxy-fits/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Three years ago I mentioned <a href="http://www.mischel.com/diary/2005/03/16.htm">anonymous proxies</a> as a way to &#8220;anonymize&#8221; your Internet access. At the time I neglected to mention one of their primary uses: allowing you to surf sites that might be blocked by your friendly IT department. For example, I know of at least one company that blocks access to <a href="http://slashdot.org">slashdot.org</a>.</p>
<p>You can often go around such blocks (not that I&#8217;m advocating such behavior) by using services such as <a href="http://sureproxy.com/">SureProxy.com</a>. When you go to SureProxy and enter the URL for slashdot, SureProxy fetches the page from slashdot and sends it to you. The URL you see will look something like this: <a href="http://sureproxy.com/nph-index.cgi/011110A/http/slashdot.org/">http://sureproxy.com/nph-index.cgi/011110A/http/slashdot.org/</a>. If SureProxy isn&#8217;t blocked by your IT department, then you end up seeing the slashdot page. (Along with whatever advertisements SureProxy adds to the page.)</p>
<p>I&#8217;m sure this kind of thing gives corporate IT departments headaches. Their headaches are nothing compared to the problems proxies pose for Web crawlers.</p>
<p>The primary problem is that the proxy changes the URLs in the returned HTML page. Every link on the page is modified so that it, too, goes through the proxy. If the crawler starts crawling those URLs, it will just build more and more, all of which go through the proxy. And since the proxy URL doesn&#8217;t look anything like the <em>real</em> URL (at least, not to the crawler), the crawler will end up viewing the same page many times: once through the real link, and once through every proxy that the link appears in.</p>
<p>Fortunately, it&#8217;s pretty easy to write code that will identify and eliminate the vast majority of proxy URLs. Most of the proxies I&#8217;ve encountered use <a href="http://www.jmarshall.com/tools/cgiproxy/">CGIProxy</a>&#8211;a free proxy script. The script itself is usually called <tt>nph-proxy.cgi</tt> or <tt>nph-proxy.pl</tt>, although I&#8217;ve also seen <tt>nph-go</tt> and <tt>nph-proy</tt>, among others. It&#8217;s easy enough to write a regular expression that looks for those file names, extracts the real URL, and discards the proxy URL. That takes care of the simple cases. The rest I&#8217;ll have to find and block manually.</p>
<p>I&#8217;ve also seen proxies (<a href="http://www.invisiblesurfing.com/">Invisible Surfing</a> is one) that use a completely different type of proxy script. They supply the target URL as an encoded query string parameter that looks something like this: <span style="text-decoration: underline;">http://www.invisiblesurfing.com/surf.php?q=aHR0cDovL3d3dy5taXNjaGVsLmNvbS9pbmRleC5odG0=</span>. I&#8217;m sure that with some effort I could decode the URLs hidden in the query string, once I determined that the URL was a proxy URL. <em>That</em> turns out to be a rather difficult problem. Until I come up with a reliable way for the crawler to identify these types of proxy URLs, I do some manual spot-checking of the URLs myself and manually block the domains. It&#8217;s like playing <a href="http://en.wikipedia.org/wiki/Whack-a-mole">Whac-A-Mole</a>, though, because new proxies appear all the time.</p>
<p>The other problem with crawling through proxies is that it makes the crawler ignore the <tt>robots.txt</tt> file on the target Web site. Since the crawler thinks it&#8217;s accessing the proxy site, it checks the proxy&#8217;s <tt>robots.txt</tt>. As a result, the crawler undoubtedly ends up accessing (and the indexer indexing) files that it never should have crawled.</p>
<p>Perhaps most surprising is that proxy sites don&#8217;t have <tt>robots.txt</tt> files that disallow all crawlers. I can see no benefit for the proxy site to allow crawling. The crawlers aren&#8217;t viewing the Web pages, so the proxy site doesn&#8217;t get the benefit of people clicking on their ads. All the crawler does is waste the proxy site&#8217;s bandwidth. If somebody out there understands the business of proxy sites and can explain why they don&#8217;t take the simple step of writing a simple <tt>robots.txt</tt>, please explain that to me in the comments, or by email. I&#8217;m very curious.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/07/10/proxy-fits/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Crawler versus the URLs</title>
		<link>http://blog.mischel.com/2008/07/09/crawler-versus-the-urls/</link>
		<comments>http://blog.mischel.com/2008/07/09/crawler-versus-the-urls/#comments</comments>
		<pubDate>Wed, 09 Jul 2008 16:04:04 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/?p=133</guid>
		<description><![CDATA[When you start crawling the Web on even a small scale, you quickly learn that things aren&#8217;t nearly as neat and tidy as the RFCs would have you believe.  After just a few weeks of writing code to handle all &#8230; <a href="http://blog.mischel.com/2008/07/09/crawler-versus-the-urls/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>When you start crawling the Web on even a small scale, you quickly learn that things aren&#8217;t nearly as neat and tidy as the RFCs would have you believe.  After just a few weeks of writing code to handle all the special cases and ambiguities that crop up, you&#8217;ll start to wonder how the Web manages to work at all.  Nowhere is this more evident than when working with URLs.</p>
<p>It&#8217;s a pleasant fantasy to believe that a document on the Web can be reached through one and only one URL.  That is, our training as programmers pushes us into the belief that the URL <span style="text-decoration: underline;">http://www.example.com/docs/resume.html</span> is <em>the</em> way to reference that particular document.  It might be the <em>preferred</em> way, but it&#8217;s certainly not the <em>only</em> way.  On most servers, for example, you can drop the &#8220;www&#8221;, so that <span style="text-decoration: underline;">http://example.com/docs/resume.html</span> will get you to the same place. We call this &#8220;the www problem.&#8221;</p>
<p>That&#8217;s just the simplest example.  Did you know that multiple slashes are irrelevant?  That is, <span style="text-decoration: underline;">http://www.example.com/////docs////resume.html</span> will go to the same place as the two URLs above. You can also do some path navigation within the URL so that <span style="text-decoration: underline;">http://www.example.com/docs/../docs/resume.html</span> goes to the same place as all the other examples I&#8217;ve shown.</p>
<p>You can also &#8220;escape&#8221; any character within a URL. For example, you can replace a slash (/) with the character string %2F, turning the original URL above into this: <span style="text-decoration: underline;">http://www.example.com%2Fdocs%2Fresume.html</span>. Most often, escaping is used to remove embedded spaces and special characters that have particular meanings in URLs. Sometimes escaping is done automatically when a user copies a link from a browser and pastes it into an HTML authoring program.</p>
<p>Above are just some of the simplest examples. I haven&#8217;t even started on query strings&#8211;parameters that you can pass after the path part of a URL. But even without query strings, the number of different ways you can address a particular document on the Web is essentially infinite. And yet a crawler is expected to, as much as possible, determine the &#8220;canonical&#8221; form of a URL and crawl only that. Crawling the same document multiple times wastes bandwidth (for both the crawler and the crawlee), and results in duplicate data that can only cause more problems for the processes that come along after the crawler has stored the page.</p>
<p>If you haven&#8217;t written a crawler, you might think I&#8217;m just contriving examples. I&#8217;m not. The www problem in particular is a very real issue that if not addressed can cause a crawler to read a very large number of pages twice: once with the www and once without the www. The other issues are not nearly as prevalent, but they are significant&#8211;so significant that every crawler author spends a huge amount of time trying to develop heuristics for URL canonicalization. Simply following the specification in <a href="http://www.rfc-editor.org/rfc/rfc3986.txt">RFC 3986</a> will get you <em>most</em> of the way there, but there are ambiguities that simply cannot be resolved.  So we do the best we can.</p>
<p>You might also wonder where these weird URLs come from.  The answer is, &#8220;everywhere.&#8221;  Scripts are high on the lists of culprits.  They can mangle URLs beyond belief.  For example, one script I encountered had the annoying feature of re-escaping a parameter in the query string.  The percent sign (%) is one of those characters that gets escaped because it has special meaning in URLs.</p>
<p>So imagine a script  reached from the URL <span style="text-decoration: underline;">http://www.example.com/script.php?page=1&amp;username=Jim%20Mischel</span>. The script appends the username variable to the query string for all links when it generates the page, but it escapes the string. So links harvested from the page have this form: <span style="text-decoration: underline;">http://www.example.com/script.php?page=2&amp;username=Jim%2520Mischel</span>. &#8220;%25&#8243; is the escape code for the percent sign. Now imagine following a chain of 10 links all generated by that script. You end up with <span style="text-decoration: underline;">http://www.example.com/script.php?page=10&amp;username=Jim%2525252525252525252520Mischel</span>.</p>
<p>What&#8217;s a poor crawler to do?</p>
<p>We do the best we can, and we have measures in place to identify such situations so that we can improve our canonicalization code. But it&#8217;s a never-ending battle. Whenever we think we&#8217;ve seen it all, we run into another surprise.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/07/09/crawler-versus-the-urls/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Major search engines support robots.txt standard</title>
		<link>http://blog.mischel.com/2008/06/19/major-search-engines-support-robotstxt-standard/</link>
		<comments>http://blog.mischel.com/2008/06/19/major-search-engines-support-robotstxt-standard/#comments</comments>
		<pubDate>Thu, 19 Jun 2008 16:07:31 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/?p=128</guid>
		<description><![CDATA[Google, Yahoo, and Microsoft&#8217;s Live Search recently announced standard support for the major robots.txt directives.  This means that you can use the same syntax for robots.txt to control the activities of those three major search engine crawlers.  The common directives &#8230; <a href="http://blog.mischel.com/2008/06/19/major-search-engines-support-robotstxt-standard/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html">Google</a>, <a href="http://www.ysearchblog.com/archives/000587.html">Yahoo</a>, and <a href="http://blogs.msdn.com/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx">Microsoft&#8217;s Live Search</a> recently announced standard support for the major <tt>robots.txt</tt> directives.  This means that you can use the same syntax for <tt>robots.txt</tt> to control the activities of those three major search engine crawlers.  The common directives are: <tt>Disallow</tt>, <tt>Allow</tt>, and <tt>Sitemaps</tt>.  In addition, all three support the use of wildcards (* and $) in specifying paths for <tt>Allow</tt> and <tt>Disallow</tt>.  It&#8217;s interesting to note that Yahoo says they support &#8220;$ Wildcards,&#8221; whereas Google and Microsoft say that they support &#8220;* Wildcards&#8221; as well as &#8220;$ Wildcards.&#8221;  From reading Yahoo&#8217;s documentation, though, I&#8217;d say that they also support &#8220;* Wildcards.&#8221;</p>
<p>All three also support several HTML META tags, such as <tt>NOINDEX</tt> and <tt>NOFOLLOW</tt>, that give content authors much tighter control over crawlers than can be accomplished with robots.txt. </p>
<p>This isn&#8217;t exactly a new step.  The three major search engines have been collaborating for the last few years, trying to make Webmasters&#8217; jobs easier with respect to the major search engines.  For example, back in February they <a href="http://www.searchenginejournal.com/google-yahoo-microsoft-unite-on-cross-submission-of-sitemaps/6435/">announced common support</a> for cross-submission of <a href="http://www.sitemaps.org/protocol.php">Sitemaps</a>.</p>
<p>Unfortunately, all three also support individual extensions to the Robots Exclusion Protocol.  For example, Yahoo and Microsoft support the <tt>Crawl-Delay</tt> directive, which Google does not support. Both Google and Yahoo support some unique META tags that the others don&#8217;t support.</p>
<p>Even with the incompatibilities, this is a big step in the right direction. With unified support of the major <tt>robots.txt</tt> directives among the three major search engine crawlers, we can expect to see more support by smaller crawlers. I know that many authors of smaller-scale crawlers look to the majors to see what they should support. Having all three support the same directives in the same way, makes other developers&#8217; jobs (including mine!) easier.</p>
<p>But ultimately it&#8217;s the Webmasters who benefit the most by giving them a standard way to control crawlers&#8217; access to their sites.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/06/19/major-search-engines-support-robotstxt-standard/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>One more time: the Internet is public</title>
		<link>http://blog.mischel.com/2008/06/16/one-more-time-the-internet-has-no-window-shades/</link>
		<comments>http://blog.mischel.com/2008/06/16/one-more-time-the-internet-has-no-window-shades/#comments</comments>
		<pubDate>Mon, 16 Jun 2008 17:11:05 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Internet]]></category>
		<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/?p=127</guid>
		<description><![CDATA[[Note:  As Michael Covington pointed out, there's plenty of privacy on the Internet--just not on the World Wide Web.] I know I&#8217;ve mentioned this before, but I keep running across people who don&#8217;t understand that there is no privacy on &#8230; <a href="http://blog.mischel.com/2008/06/16/one-more-time-the-internet-has-no-window-shades/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><span style="color: #ff3f3f;">[Note:  As </span><a href="http://www.covingtoninnovations.com/michael/blog/0806/#080618"><span style="color: #ff6347;">Michael Covington pointed out</span></a><span style="color: #ff3f3f;">, there's plenty of privacy on the Internet--just not on the World Wide Web.]</span></p>
<p>I know I&#8217;ve mentioned this before, but I keep running across people who don&#8217;t understand that <a href="http://www.mischel.com/diary/2004/12/01.htm">there is no privacy on the Internet</a>.  If you&#8217;ve uploaded something to your Web site, it&#8217;s highly likely that Google, MSN, Yahoo, or any (or all) of the many other search engines out there has found it.  Even our Web crawler&#8211;a small-scale operation&#8211;finds things in hidden nooks and crannies of the Web that most people with browsers would never stumble upon.</p>
<p>For example, the other day a coworker was spot-checking some of the crawler&#8217;s latest finds and stumbled upon a site where the owner had uploaded what looks like (from examining the file names) a bunch of very private stuff.  This all in an unprotected directory.  A person with a browser could go to that URL, get a listing of all files, and then browse to his heart&#8217;s content.  Although it&#8217;s unlikely that a person browsing would stumble upon the directory, a crawler almost certainly will.  Eventually.</p>
<p>When we run across something like that, we don&#8217;t actually browse, but rather find out how to contact the site owner and send him a very nice email suggesting that he either protect the directory or not upload that information.</p>
<p>The day after discovering the site I mentioned above, we ran across the story of Alex Kozinski, a judge in the 9th Circuit whose <a href="http://www.latimes.com/news/local/la-me-kozinski12-2008jun12,0,6220192.story">personal porn stash was found publicly accessible online</a>:</p>
<blockquote><p>Kozinski, 57, said that he thought the site was for his private storage and that he was not aware the images could be seen by the public, although he also said he had shared some material on the site with friends. After the interview Tuesday evening, he blocked public access to the site.</p></blockquote>
<p>Of particular interest in this case is that the judge was presiding over an obscenity trial (now postponed) that involves material that&#8217;s apparently similar to some of the material on the judge&#8217;s site.  The judge also had some copyrighted music on the site, opening up the possibility of copyright violation.</p>
<p>No matter how far out in the country you live, if you stand naked in front of an uncovered window, somebody will eventually see you.  Similarly, if you upload something to your Web site and don&#8217;t take active measures to prevent access, it <em>will</em> be found.  Do not assume that it can&#8217;t be found because you never told anybody about it.  That&#8217;s like putting a key under the doormat and figuring it&#8217;s safe because only you know it&#8217;s there.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/06/16/one-more-time-the-internet-has-no-window-shades/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Webbots, Spiders, and Screen Scrapers</title>
		<link>http://blog.mischel.com/2008/06/05/webbots-spiders-and-screen-scrapers/</link>
		<comments>http://blog.mischel.com/2008/06/05/webbots-spiders-and-screen-scrapers/#comments</comments>
		<pubDate>Thu, 05 Jun 2008 17:06:04 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Book Reviews]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/?p=121</guid>
		<description><![CDATA[Considering what I&#8217;m doing for work, you can imagine that when I ran across Michael Schrenk&#8216;s Webbots Spiders, and Screen Scrapers recently, I ordered a copy. The book is a tutorial on writing small Web bots that automate the collection &#8230; <a href="http://blog.mischel.com/2008/06/05/webbots-spiders-and-screen-scrapers/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Considering what I&#8217;m doing for work, you can imagine that when I ran across <a href="http://www.schrenk.com/">Michael Schrenk</a>&#8216;s <a href="http://www.amazon.com/gp/product/1593271204/">Webbots Spiders, and Screen Scrapers</a> recently, I ordered a copy.  The book is a tutorial on writing small Web bots that automate the collection of data from the Web.</p>
<p>Most of the book focuses on screen scrapers that download data from previously identified Web sites, parse the pages, and then store and present the data.  There&#8217;s a little information on &#8220;spidering&#8221;&#8211;automatically following links from one page to another&#8211;but that&#8217;s not the primary purpose of the book.  Which is probably a good thing.  A Web-scale spider (or crawler) is fundamentally different than a screen scraper or a special-purpose spider that&#8217;s written to gather information from a small set of domains or very narrowly-defined pages.</p>
<p>The first six chapters explain why Web bots are useful, and walk you through the basics:   downloading Web pages, parsing the contents, automating log in and form submission, and many other tasks that are involved in automated data collection.  With plenty of PHP code examples, these chapters provide a good foundation for the next 12 chapters:  Projects.  In this section, we see examples of real Web bots that monitor prices, capture images, verify links, aggregate data, read email, and more.  Again, with many code examples.</p>
<p>The first two sections cover about three-fifths of the book.  If you read and follow along by trying the code examples, you&#8217;ll have a very good understanding of how to build many different types of Web bots.</p>
<p>The remainder of the book is divided into two sections.  Part 3, Advanced Technical Considerations, briefly explains spiders, and then discusses some of the technical issues such as authentication and cookie management, cryptography, and scheduling your bots.  This section has some code examples, but they aren&#8217;t the primary focus.</p>
<p>The fourth section, Larger Considerations, focuses on things like how to keep your bots out of trouble, legal issues, designing Web sites that are friendly to bots, and how to prevent bots from scraping your site.  Again, these chapters have a few code samples, but the emphasis is on the larger issues&#8211;things to think about when you&#8217;re writing and running your bots.</p>
<p>Overall, I like the book.  The writing is conversational, and the author obviously has a lot of experience building useful bots.  The many code samples do a good job illustrating the concepts, and the projects cover the major types of bots most people would be interested in writing.  Reading about the projects and some of the other ideas he presents opens up all kinds of possibilities.</p>
<p>The book succeeds very well in its stated mission:  explaining how to build simple web bots and operate them in accordance with community standards.  It&#8217;s not <em>everything</em> you need to know, but it&#8217;s the best introduction I&#8217;ve seen.  The focus is on simple, single-threaded, bots.  There&#8217;s some small mention of using multiple bots that store data in a central repository, but there&#8217;s no discussion of the issues involved in writing multi-threaded or distributed bots that can process hundreds of pages per second.</p>
<p>I recommend that you read this book if you&#8217;re at all interested in writing Web bots, even if you&#8217;re not familiar with or intending to use PHP.  But be sure not to expect more than the book offers.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/06/05/webbots-spiders-and-screen-scrapers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reducing bandwidth used by crawlers</title>
		<link>http://blog.mischel.com/2008/05/23/reducing-bandwidth-used-by-crawlers/</link>
		<comments>http://blog.mischel.com/2008/05/23/reducing-bandwidth-used-by-crawlers/#comments</comments>
		<pubDate>Fri, 23 May 2008 15:40:20 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/?p=114</guid>
		<description><![CDATA[Some site operators block web crawlers because they&#8217;re concerned that the crawlers will use too much of the site&#8217;s allocated bandwidth. What they don&#8217;t realize is that most companies that operate large-scale crawlers are much more concerned with bandwidth usage &#8230; <a href="http://blog.mischel.com/2008/05/23/reducing-bandwidth-used-by-crawlers/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Some site operators block web crawlers because they&#8217;re concerned that the crawlers will use too much of the site&#8217;s allocated bandwidth.  What they don&#8217;t realize is that most companies that operate large-scale crawlers are <em>much</em> more concerned with bandwidth usage than the people running the sites that the crawlers visit.  There are several reasons for this concern:</p>
<ul>
<li>The visible Web is so large that no crawler can examine the entire thing in any reasonable amount of time.  At best estimates, even Google covers only about 25% of the visible Web.</li>
<li>The Web grows faster than the ability to crawl it grows.</li>
<li>It takes time (on average between one and two seconds) to find, download, and store a Web page.  Granted, a large crawler can download thousands of pages per second, but it still takes time.</li>
<li>It requires more time, storage, and CPU power to store, parse, and index a downloaded page.</li>
</ul>
<p>I suspect that the large search engines can give you a per-page dollar cost for locating, downloading, storing, and processing.  That per-page cost would be very small, but when you multiply it by 25 billion (or more!) pages it&#8217;s a staggering amount of money&#8211;a cost that&#8217;s incurred every time they crawl the Web.  As you can imagine, they have ample incentive to reduce unnecessary crawling as much as possible.  In addition, time and bandwidth spent downloading unnecessary pages means that some previously undiscovered pages are not visited.</p>
<p>The HTTP specification includes someting called a <em>conditional GET</em>.  It&#8217;s a way for a client to request that the server send the page only if it meets some criteria.  The specification identifies several different criteria, one of which is called <em>If-Modified-Since</em>.  If the client has seen the page before and has saved the page and the date it received the page, then the client can send a request to the server that says, in effect, &#8220;If the page has changed since <em>this date</em>, then send me the page.  Otherwise just tell me that the page hasn&#8217;t changed.&#8221;  The <em>this date</em> would be replaced with the actual date that the client last saw the page.</p>
<p>If the server supports <em>If-Modified-Since</em> (which almost all do), there is a big difference in how much bandwidth is used.  If the Web page has not been modified, the server responds with a standard header and a 304 NotModified status code:  total payload maybe a few hundred bytes.  That&#8217;s a far cry from the average 30 kilobytes for an HTML page, or the hundreds of kilobytes for a page that has complicated scripts and lots of content.</p>
<p>The only catch is that server software (Apache, IIS, etc.) only support <em>If-Modified-Since</em> for static content:  pages that you create and store as HTML on your site.  If your site is dynamically generated with PHP, ASP, Java, etc., then the script itself has to determine if the content has changed since the requested date, and act accordingly by sending the proper response.  If your site is dynamically generated, it&#8217;s a good idea to ask your developers if it supports <em>If-Modified-Since</em>.</p>
<p>Crawlers aren&#8217;t the only clients that use <em>If-Modified-Since</em> to save bandwidth.  All the major browsers cache content, and can be configured to do conditional GETs.</p>
<p>The direct savings of using <em>If-Modified-Since</em> can be small when compared to the indirect savings.  Imagine that your site&#8217;s front page contains links to all the other pages on your site.  If a crawler downloads the main page, it&#8217;s going to extract the links to all the other pages and attempt to visit them, too.  If you don&#8217;t support <em>If-Modified-Since</em>, the crawler will end up downloading every page on your site.  If, on the other hand, you support <em>If-Modified-Since</em> and your front page doesn&#8217;t change, the crawler won&#8217;t download the page and thus won&#8217;t see links to the other pages on the site.</p>
<p>Don&#8217;t take the above to mean that your site won&#8217;t be indexed if you don&#8217;t change the main page.  Large-scale crawlers keep track of the things they index, and will periodically check to see that those things still exist.  The larger systems even keep track of how often individual sites or pages change, and will check for changes on a fairly regular schedule.  If their crawling history shows that a particular page changes every few days, then you can expect that page to be visited every few days.  If history shows that the page changes very rarely, it&#8217;s likely that the page won&#8217;t be visited very often.</p>
<p>Smaller-scale crawlers that don&#8217;t have the resources to keep track of the change frequency for billions of Web sites will typically institute a blanket policy that controls the frequency that they revisit pages&#8211;once per day, once per week, etc.</p>
<p>Supporting <em>If-Modified-Since</em> is a very easy and inexpensive way to reduce the load that search engine crawlers put on your servers.  If you&#8217;re publishing static content, then most likely you&#8217;re already benefitting from this.  If your Web site is dynamically generated, be sure that your scripts recognize the <em>If-Modified-Since</em> header and respond accordingly.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/05/23/reducing-bandwidth-used-by-crawlers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A variation on the homegrown DOS attack</title>
		<link>http://blog.mischel.com/2008/05/15/a-variation-on-the-homegrown-dos-attack/</link>
		<comments>http://blog.mischel.com/2008/05/15/a-variation-on-the-homegrown-dos-attack/#comments</comments>
		<pubDate>Thu, 15 May 2008 15:31:43 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/2008/05/15/a-variation-on-the-homegrown-dos-attack/</guid>
		<description><![CDATA[Tuesday, in How to DOS yourself, I described how to erroneously configure an Apache server and cause what appears to be a denial of service attack. There&#8217;s another way to do it that is even more insidious. In Tuesday&#8217;s post &#8230; <a href="http://blog.mischel.com/2008/05/15/a-variation-on-the-homegrown-dos-attack/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Tuesday, in <a href="http://blog.mischel.com/2008/05/13/how-to-dos-yourself/" title="How to DOS yourself">How to DOS yourself,</a> I described how to erroneously configure an Apache server and cause what appears to be a denial of service attack.  There&#8217;s another way to do it that is even more insidious.</p>
<p>In Tuesday&#8217;s post I showed how to configure error documents.  There&#8217;s apparently another way to configure things so that, rather than returning an error status code (403 Forbidden, 404 Not Found, etc.), the server returns a 302 Redirect status code.  The redirect tells the client (i.e. the browser or crawler) that the page requested can be found at a new location.  That new location is returned along with the 302 Redirect status code.</p>
<p>When a browser sees the 302 status code, it issues a request for the new page.</p>
<p>Now, imagine what happens if you block an IP address from accessing your site (see Tuesday&#8217;s article) and you configure the server to return a redirect status code when somebody tries to access from that blocked IP address:</p>
<ol>
<li>Client tries to access <tt>http://yoursite.com/index.html</tt></li>
<li>Server  notices the blocked IP address and says, &#8220;return 403 Forbidden.&#8221;</li>
<li>Custom error handling returns a 302 Redirect pointing to <tt>http://yoursite.com/forbidden.html</tt>.</li>
<li>Browser receives redirect status code and issues a request for <tt>http://yoursite.com/forbidden.html</tt></li>
<li>Go to step 2.</li>
</ol>
<p>The browser and server now enter a cooperative infinite loop, with the browser saying &#8220;Show me the <tt>forbidden.html</tt> page,&#8221; and the server saying, &#8220;View <tt>forbidden.html</tt> instead.&#8221;</p>
<p>This is more insidious because from the server&#8217;s point of view it looks like the client is perpetrating a denial of service attack by continually attempting to access the same document.  But the client is simply following the server&#8217;s directions.</p>
<p>Web crawlers won&#8217;t fall into this trap because they keep track of the pages they&#8217;ve visited or tried to visit.  A Web crawler will see the first redirect and attempt to access the <tt>forbidden.html</tt> page, but on the next redirect the crawler will see that it&#8217;s already attempted that page, and give up.</p>
<p>Not all browsers are that smart.  Firefox tries a few times and then stops, showing an error message that says:</p>
<blockquote><p><span>Firefox has detected that the server is redirecting the request for this address in a way that will never complete.</span></p></blockquote>
<p>Internet Explorer, on the other hand, appears to continue trying indefinitely.</p>
<p>I don&#8217;t know enough about Apache server configuration to give an example of redirecting on error.  I do know it&#8217;s possible, though, because I discovered such a redirect loop recently while investigating a problem report.  Unfortunately, the Webmaster in question was not willing to share with me the pertinent sections of his <tt>.<font>htaccess</font></tt> file.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/05/15/a-variation-on-the-homegrown-dos-attack/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to DOS yourself</title>
		<link>http://blog.mischel.com/2008/05/13/how-to-dos-yourself/</link>
		<comments>http://blog.mischel.com/2008/05/13/how-to-dos-yourself/#comments</comments>
		<pubDate>Wed, 14 May 2008 05:11:52 +0000</pubDate>
		<dc:creator>Jim</dc:creator>
				<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.mischel.com/2008/05/13/how-to-dos-yourself/</guid>
		<description><![CDATA[It&#8217;s surprising the things you&#8217;ll learn when you write a Web crawler. Today&#8217;s lesson: how to be both perpetrator and victim of your own denial of service attack. Not everybody likes crawlers accessing their sites. Most will modify their robots.txt &#8230; <a href="http://blog.mischel.com/2008/05/13/how-to-dos-yourself/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s surprising the things you&#8217;ll learn when you write a Web crawler. Today&#8217;s lesson: how to be both perpetrator and victim of your own denial of service attack.</p>
<p>Not everybody likes crawlers accessing their sites. Most will modify their <tt>robots.txt</tt> files first, which will prevent polite bots from crawling. But blocking impolite bots requires that you configure your server to deny access based on IP address or user-agent string. Some Web site operators, either because they don&#8217;t know any better or because they want to prevent bots from even accessing <tt>robots.txt</tt>, prefer to use the server configuration file for <em>all</em> bot-blocking. Doing so is easy enough, but you have to be careful or you can create a home-grown denial of service attack.</p>
<p>The discussion below covers Web sites running the Apache server. I don&#8217;t know how to effect IP blocks or custom error pages using IIS or any other Web server.</p>
<p>There are two ways (at least) to prevent access from a particular IP address to your Web site. The two ways I know of involve editing the <tt>.<span>htaccess</span></tt> file, which usually is stored in the root directory of your Web site. [Note: The filename really does start with a period. For some reason, WordPress doesn't like me putting that filename in a post without putting some HTML noise around it. So for the rest of this entry, I'll refer to the file as <tt>htaccess</tt>, without the leading period.] As this isn&#8217;t a tutorial on <tt>htaccess</tt> I suggest that you do a Web search for &#8220;htaccess tutorial&#8221;, or consult your hosting provider&#8217;s help section for full information on how to use this file.</p>
<p>The simple method of blocking a particular IP address, available on all versions of Apache that I know of, is to use the &lt;Files&gt; directive. This <tt>htaccess</tt> fragment will block an IP address:</p>
<p><code>&lt;Files *&gt;<br />
order deny,allow<br />
deny from abc.def.ghi.jkl<br />
&lt;/Files&gt;</code></p>
<p>Of course, you would replace <tt>abc.def.ghi.jkl</tt> in that example with the actual IP address you want to block. If you want to block multiple addresses, you can specify them in separate <tt>deny</tt> directives, one per line. Some sites say that you can put multiple IP addresses on a single line. I don&#8217;t know if that works. There also is a way to block ranges of IP addresses.</p>
<p>If you do this, then any attempted access from the specified IP address will result in a &#8220;403 Forbidden&#8221; error code being returned to the client. The Web page returned with the error code is the default error page, which is very plain (some would say ugly), and not very helpful. Many sites, in order to make the error pages more helpful or to make them have the same look and feel as the rest of the site, configure the server to return a custom error page. Again, there are <tt>htaccess</tt> directives that control the use of custom error pages.</p>
<p>If you want a custom page to display when a 403 Forbidden is returned, you create the error page and add a line telling Apache where the page is and when it should be returned. If your error page is stored on your site at <tt>/forbidden.html</tt>, then adding this directive to <tt>htaccess</tt> tells Apache to return that page along with the 403 error:</p>
<p><code>ErrorDocument 403 /forbidden.html</code></p>
<p>Now, if somebody visits your site from the denied IP address, the server will return the custom error page along with a 403 Forbidden status code. It really does work. As far as I&#8217;ve been able to determine, nothing can go wrong with this configuration.</p>
<p>I said before that there are at least two ways prevent access from a particular IP address. The other way that I know of involves using an Apache add-on called <a href="http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html">mod_rewrite</a>, a very useful but also very complicated and difficult to master module with which you can do all manner of wondrous things. I don&#8217;t claim to be an expert in <tt>mod_rewrite</tt>. But it appears that you can block an IP address by adding this command:</p>
<p><code>RewriteCond %{REMOTE_ADDR} ^abc\.def\.ghi\.jkl$<br />
RewriteRule .* [F]</code></p>
<p>Again, you would replace the <tt>abc</tt>, <tt>def</tt>, etc. with the actuall IP address numbers. As I understand it, this rule (assuming that <tt>mod_rewrite</tt> is installed and working) will prevent all accesses to your site from the given IP address. But there&#8217;s a potential problem.</p>
<p>If you have a custom 403 error document, the above can put your server into an infinite loop. According to <a href="http://www.webmasterworld.com/forum92/353.htm">this forum post</a> at <a href="http://www.webmasterworld.com/">Webmaster World</a>:</p>
<blockquote><p>A blocked request is redirected to /forbidden.html, and the server tries to serve that instead, but since the user-agent or ip address is still blocked, it again redirects to the custom error page&#8230; it gets stuck in this loop.</p></blockquote>
<p>There you have it: you are the perpetrator and victim of your own denial of service attack.</p>
<p>The forum post linked above shows how to avoid that problem.</p>
<p>I&#8217;ve seen some posts indicating that the infinite loop also is possible if you use the simple way of doing the blocking and error redirects. I haven&#8217;t been able to verify that. If you&#8217;re interested, check out <a href="http://www.webmasterworld.com/robots_txt/3595652.htm">this post</a>, which also offers a solution if the problem occurs.</p>
<p>How I came to learn about this is another story. Perhaps I can relate it one day.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mischel.com/2008/05/13/how-to-dos-yourself/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
