<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
<channel>
	<title>Comments on: Crawling Along</title>
	<atom:link href="http://blog.mischel.com/2007/03/05/crawling-along/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mischel.com/2007/03/05/crawling-along/</link>
	<description>Musings on technology and life</description>
	<pubDate>Tue, 06 Jan 2009 04:01:23 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Jim</title>
		<link>http://blog.mischel.com/2007/03/05/crawling-along/comment-page-1/#comment-4</link>
		<dc:creator>Jim</dc:creator>
		<pubDate>Tue, 06 Mar 2007 13:07:23 +0000</pubDate>
		<guid isPermaLink="false">http://blog.mischel.com/2007/03/05/crawling-along/#comment-4</guid>
		<description>Roy

Thanks for the note.  You might be right about the database.  When I 'dismissed' databases, I did so in the context of performing a single query for each harvested link.  I know from experience that such a thing does not perform well.

I hear you on the colors.  I rather like the layout of this theme, but I definitely need to change the colors.

Odd about the "no comments" thing.  When I looked at the entry I didn't see the comments.  Not until after I moderated them.  But thanks for letting me know.  I'm kinda new at this WordPress thing. . .</description>
		<content:encoded><![CDATA[<p>Roy</p>
<p>Thanks for the note.  You might be right about the database.  When I &#8216;dismissed&#8217; databases, I did so in the context of performing a single query for each harvested link.  I know from experience that such a thing does not perform well.</p>
<p>I hear you on the colors.  I rather like the layout of this theme, but I definitely need to change the colors.</p>
<p>Odd about the &#8220;no comments&#8221; thing.  When I looked at the entry I didn&#8217;t see the comments.  Not until after I moderated them.  But thanks for letting me know.  I&#8217;m kinda new at this WordPress thing. . .</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roy Harvey</title>
		<link>http://blog.mischel.com/2007/03/05/crawling-along/comment-page-1/#comment-3</link>
		<dc:creator>Roy Harvey</dc:creator>
		<pubDate>Tue, 06 Mar 2007 02:44:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.mischel.com/2007/03/05/crawling-along/#comment-3</guid>
		<description>I am typing this while looking at the comment I posted a couple of hours ago.  However the main page says there are not comments, AND THE PAGE WITH MY COMMENT SAYS THERE ARE NO COMMENTS!!!.  The first is understandable perhaps, but the second is passing strange.

Roy</description>
		<content:encoded><![CDATA[<p>I am typing this while looking at the comment I posted a couple of hours ago.  However the main page says there are not comments, AND THE PAGE WITH MY COMMENT SAYS THERE ARE NO COMMENTS!!!.  The first is understandable perhaps, but the second is passing strange.</p>
<p>Roy</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roy Harvey</title>
		<link>http://blog.mischel.com/2007/03/05/crawling-along/comment-page-1/#comment-2</link>
		<dc:creator>Roy Harvey</dc:creator>
		<pubDate>Tue, 06 Mar 2007 01:19:54 +0000</pubDate>
		<guid isPermaLink="false">http://blog.mischel.com/2007/03/05/crawling-along/#comment-2</guid>
		<description>This sure would be easier on a white board!   8-)

I think you dismissed using a database a bit too quickly.  (I'm a database guy, of course I think that!)  Suppose.....

URLs to be read are extracted from the db in batches, size to be determined by experimentation but at least hundreds and probably thousands of URLs at a time.  As each web page is crawled the links found are written to a flat file.  When the batch of URLs is done being processed the corresponding batch of linked URLs is closed and picked up by a background process that applies them to the database, inserting new rows.  The insertion process can be written a variety of ways, but being a SQL guy I would import into a staging table using some bulk load utility, then use an INSERT... SELECT FROM STAGING WHERE NOT EXISTS () to apply the changes to the master table.

The database server would be a different machine from the crawler(s), spreading the workload.  The crawler(s) would not have to pause for the sort/merge, as the database would be dealing with that in the background tasks.  The process of coming up with a new batch of URLs for crawling and the process of updating with the results of a crawl need not have significant blocking.

Another thought.  Harvested URLs fall into two categories - the same domain and a foreign domain.  I suggest that the file of harvested URLs be two files, one each for local and foreign links.  Local links would be applied to the datbase on a higher priority (if the priority is to pick a specific domain clean) or a lower priority (if the idea is to spread the crawl across the web as quickly as possible).  In fact there could be multiple, tuneable rules for how the URLs are picked from the DB for processing, at least allowing for bias toward either the deep vs broad choice.

Not that any of the above is of any real use, but you got me thinking.

By the way, typing into a white box against a black background is a bit hard on the eyes.

Have fun!

Roy Harvey
Beacon Falls, CT</description>
		<content:encoded><![CDATA[<p>This sure would be easier on a white board!   8-)</p>
<p>I think you dismissed using a database a bit too quickly.  (I&#8217;m a database guy, of course I think that!)  Suppose&#8230;..</p>
<p>URLs to be read are extracted from the db in batches, size to be determined by experimentation but at least hundreds and probably thousands of URLs at a time.  As each web page is crawled the links found are written to a flat file.  When the batch of URLs is done being processed the corresponding batch of linked URLs is closed and picked up by a background process that applies them to the database, inserting new rows.  The insertion process can be written a variety of ways, but being a SQL guy I would import into a staging table using some bulk load utility, then use an INSERT&#8230; SELECT FROM STAGING WHERE NOT EXISTS () to apply the changes to the master table.</p>
<p>The database server would be a different machine from the crawler(s), spreading the workload.  The crawler(s) would not have to pause for the sort/merge, as the database would be dealing with that in the background tasks.  The process of coming up with a new batch of URLs for crawling and the process of updating with the results of a crawl need not have significant blocking.</p>
<p>Another thought.  Harvested URLs fall into two categories - the same domain and a foreign domain.  I suggest that the file of harvested URLs be two files, one each for local and foreign links.  Local links would be applied to the datbase on a higher priority (if the priority is to pick a specific domain clean) or a lower priority (if the idea is to spread the crawl across the web as quickly as possible).  In fact there could be multiple, tuneable rules for how the URLs are picked from the DB for processing, at least allowing for bias toward either the deep vs broad choice.</p>
<p>Not that any of the above is of any real use, but you got me thinking.</p>
<p>By the way, typing into a white box against a black background is a bit hard on the eyes.</p>
<p>Have fun!</p>
<p>Roy Harvey<br />
Beacon Falls, CT</p>
]]></content:encoded>
	</item>
</channel>
</rss>
