Web search ramblings

I suspect that most people reading this blog understand conceptually how Google and other search engines work. In brief, they have a program called a Web crawler that goes from one Web site to the next, downloading and storing pages, and extracting links to other pages. A separate process reads the stored documents and creates an inverted index that is (conceptually) similar to the index at the back of a book except that it indexes every single word in the document. When a user does a “search,” the search engine need only look up the terms in the index and return a list of documents that contain those terms.

I’ve waved my hand over significant technical detail, but the details of the implementation are not the point. The point is that many people–perhaps a majority of Internet users–do not have even this level of understanding. Many think that when they pose a query to a search engine, the search engine searches the Web in real time. Those of us who understand a little bit about the Internet and the inner workings of search engines might find that idea absurd, but the term “search engine” does imply that some kind of searching is going on. What we call a search engine is more correctly a Web index.

Except that it’s not an index of the entire Web. In fact, not even Google indexes a majority of the visible Web. The best estimates I’ve seen put the number of publicly visible Web pages at somewhere between 50 and 100 billion. Researchers estimate that nobody indexes even 20% of them. If you give it a little thought, you can understand why.

The data I’ve gathered in crawling more than 100 million Web pages over the last few months indicates that the average Web page size is about 30 kilobytes. A one-megabit Internet connection can pull down 100 kilobytes per second, or about 3.3 Web pages per second. Large search engines, of course, have much faster Internet connections–on the order of gigabits. But even a gigabit connection can only pull down 3,300 pages per second. At that rate, it would take about 35 days to download 10 billion documents.

Granted, some pages are updated less frequently than others, and search engines have been optimized to take that into account. Still, it’s not possible right now for any search engine to have a current index of the entire visible Web. At best, a general search engine can have a reasonably current–say 24 hours–index of the most popular one percent of the Web. Everything else has to wait until the crawler gets around to it.

Note that I said “general search engine.” Targeted search engines that index specific topics or particular subsets of the Web are becoming more popular because they can keep their indexes more up to date. Some of them can update their indexes several times per day. Not only is their information more current, but it’s also more focused and more likely to give you higher quality results in whatever narrow field it targets. The drawback, of course, is that you won’t find information about Twinkies on the Steam Train search site. (I made that up. I have no idea if there really is a Steam Train search site.)

The large search engine companies understand this, of course. Google, for example, has introduced Google Custom Search, which allows you to create what is, in effect, your own custom search engine. This is the first step in what I think will be a very large emerging market.