After you get your basic web crawler downloading pages and extracting links, you find yourself having to make a decision: how do you feed the harvested URLs back into the crawler? For instance, if I visit www.mischel.com and extract a link to blog.mischel.com, how do I feed that new link back to the crawler so that it will download and examine that page? At first glance, it doesn’t seem like a problem at all: simply add the new link to the queue. If only it were so easy.
There are several problems one must deal with. First, you have to make sure that the crawler hasn’t already visited that page, and that the new url isn’t already in the queue. If you just add every harvested link to the queue, you’ll soon find yourself in an infinite loop. Not a problem, right? Just keep a list of visited links. Except there are billions of URLs. If you’re going to crawl any significant portion of the web, you won’t be able to keep the queue or the visited URLs list in memory. Forget storing the things in a traditional database. SQL Server, mySQL, and the like won’t perform well enough to keep your bandwidth saturated.
You could devise a hashing scheme and keep a big array (on the order of four to eight gigabytes) of bits in memory. That would solve the problem at the risk of some small number of false positives–URLs that hash to the same key. If you have a distributed crawler, you can soup up one machine to do all the URL management tasks and have the individual crawler machines send their harvested links to it. This “URL server” makes sure that no link is visited twice, and also controls which URLs the individual crawlers visit. You can draw a pretty picture that shows how it all works together. It’s feasible, but seriously complicated.
There is a simpler method. The URL server tells each crawling process which URLs to visit. They report their harvested links back to the server, which simply writes them to disk. When all of the crawlers have exhausted the links that they’ve been given, they exit. The URL server then sorts the harvested links file, eliminates duplicates by merging with the existing visited links file, and then starts another crawler pass with the newly discovered links. Granted, that sort and merge is going to take a lot of time, but the job can be split up among many machines.
This approach also has the benefit of providing a stopping point where some other process can take the collected data up to this point and start processing. After all, the point of crawling the web is to collect information, not just have the crawlers out there harvesting links.
There is a major drawback to this approach. When the crawlers are idle, you’re not using any of that expensive bandwidth you’re paying for. It is possible to use a hybrid approach, where the URL server fires off a sort/merge (on a separate machine) after collecting some large number of links, and continues to keep the crawlers busy while the other machine is doing its thing. When it’s done, that other machine gives the URL server another batch of links to process. This keeps the bandwidth saturated while still simplifying the process of identifying duplicate URLs.
Of course, if you have something else to do with the bandwidth while the sort/merge is going on, then the crawl-process-crawl scheme comes in very handy indeed.
This sure would be easier on a white board! 8-)
I think you dismissed using a database a bit too quickly. (I’m a database guy, of course I think that!) Suppose…..
URLs to be read are extracted from the db in batches, size to be determined by experimentation but at least hundreds and probably thousands of URLs at a time. As each web page is crawled the links found are written to a flat file. When the batch of URLs is done being processed the corresponding batch of linked URLs is closed and picked up by a background process that applies them to the database, inserting new rows. The insertion process can be written a variety of ways, but being a SQL guy I would import into a staging table using some bulk load utility, then use an INSERT… SELECT FROM STAGING WHERE NOT EXISTS () to apply the changes to the master table.
The database server would be a different machine from the crawler(s), spreading the workload. The crawler(s) would not have to pause for the sort/merge, as the database would be dealing with that in the background tasks. The process of coming up with a new batch of URLs for crawling and the process of updating with the results of a crawl need not have significant blocking.
Another thought. Harvested URLs fall into two categories - the same domain and a foreign domain. I suggest that the file of harvested URLs be two files, one each for local and foreign links. Local links would be applied to the datbase on a higher priority (if the priority is to pick a specific domain clean) or a lower priority (if the idea is to spread the crawl across the web as quickly as possible). In fact there could be multiple, tuneable rules for how the URLs are picked from the DB for processing, at least allowing for bias toward either the deep vs broad choice.
Not that any of the above is of any real use, but you got me thinking.
By the way, typing into a white box against a black background is a bit hard on the eyes.
Have fun!
Roy Harvey
Beacon Falls, CT
I am typing this while looking at the comment I posted a couple of hours ago. However the main page says there are not comments, AND THE PAGE WITH MY COMMENT SAYS THERE ARE NO COMMENTS!!!. The first is understandable perhaps, but the second is passing strange.
Roy
Roy
Thanks for the note. You might be right about the database. When I ‘dismissed’ databases, I did so in the context of performing a single query for each harvested link. I know from experience that such a thing does not perform well.
I hear you on the colors. I rather like the layout of this theme, but I definitely need to change the colors.
Odd about the “no comments” thing. When I looked at the entry I didn’t see the comments. Not until after I moderated them. But thanks for letting me know. I’m kinda new at this WordPress thing. . .