Imagine that you have a web site that, among other things, allows your users to search for media (audio and video) using a simple query language. So, if you want to find Britney Spears videos, you’d just type britney spears in the search box and click the Search button. Simple, right?
Disclaimer:
The examples below mention particular artists whose content appears legitimately on YouTube and other media sites, and can be legally obtained with the blessing of the copyright holders.
Although it’s possible that content from these artists can also be obtained illegally from other sites, I do not advocate that practice. I do not support the use of any Internet search technology to obtain music, video, or other electronic media illegally.
Companies that operate search engines do not knowingly index such illegal content. Reputable companies remove links to illegal content as required by the Digital Millennium Copyright Act (DMCA), when the existence of that content is made known in accordance with the DMCA’s notification procedures.
Except it turns out that britney and spears are pretty common spam terms in metadata (the keywords and description fields of YouTube videos, for example). People will upload all manner of stuff to YouTube and put bogus terms in the description in an attempt to get people to watch the video. To reduce the number of irrelevant or inappropriate results returned (it’s probably impossible to eliminate irrelevant content), you decide to index the metadata by field and allow the user to say which fields are searched. So, if they want just those videos that have “Britney” and “Spears” in the title field, they would type britney spears IN Title. That doesn’t eliminate all of the spam, but it reduces it quite a bit.
It turns out that you have to make the IN case sensitive. Otherwise you’d never be able to search for the word “in” in any metadata. The same is true for any word that you use in your query language. For example, if wanted all the videos that contain “Britney” or “Spears”, we’d write britney OR spears IN Title.
Still, not too hard, right? But what if you want to search the Title field and the Description field? At first you’d think you could write: britney spears IN Title OR Description. You could make that work until you take into account the possibility of more complex query expressions. For example, let’s say you wanted a list of all videos that claim to be a Led Zeppelin song, or some version of Stairway to Heaven. One possible query would be:
led zeppelin IN Artist OR Description OR stairway heaven IN Title
Whereas that query might look reasonable to a non-programmer, writing a computer program to properly handle the general case of queries like that is non-trivial. The query can be parsed in several different ways. Three of which are:
(led zeppelin IN Artist OR Description) OR (stairway heaven IN Title)
(led zeppelin IN Artist) OR )(Description OR stairway) heaven IN Title)
(led zeppelin IN Artist) OR (description OR (stairway heaven) IN Title)
All three of those interpretations are perfectly valid. Applying rules of operator precedence can disambiguate some of the cases, but if you go through the exercise you’ll find out that IN has to have lower precedence than OR, and if you do that, then you end up with:
(led zeppelin IN Artist OR (Description OR stairway heaven)) IN Title
You end up having to either decorate the field names (i.e. “@Artist”) or group them with brackets or parentheses (i.e IN [Artist or Description]).
All of this is doable, and not especially heavy lifting as far as parsing is concerned. But then you have to explain it to a non-technical user and make it easy for the non-technical user to use. Otherwise, only programmers will want to (or even be able to) use it.
I’ve heard many a programmer (myself included, come to think of it) complain about a search facility that doesn’t allow complex queries. We look at it from a programmer’s perspective and think it’d be trivial to implement a comprehensive query facility. And in most cases they’re probably right. You could develop a query system that anybody with a couple years’ of programming experience could use without trouble and get exact results. And when you flipped the switch to turn it on, you’d hear crickets. Most users don’t understand Boolean algebra or the difference in precedence between AND and OR. Trust me, people will go somewhere else to get their information rather than have to think of how to ask for it.
What users really want is a DWIM mode: Do What I Mean. They want to type word soup into the search and get back exactly what they were looking for, with no false hits (i.e. asking for beatles the music group and getting back something about dung beetles because somebody misspelled “beetle”).
But DWIM doesn’t exist. Not today, and not for a long time (perhaps ever) in the future. As a result, we have to restrict what the user can type and very carefully specify how things will be interpreted. We have to make it easy for the most common cases, but able to do moderately complex and powerful things. That balance is difficult to achieve, and no matter what you come up with, somebody will complain. You can only hope that the number of users you delight will vastly outweigh those whom you annoy.