We got a rather strongly worded message the other day from a Webmaster who was threatening legal action because our crawler deleted a bunch of files from his site. The news that our crawler is capable of deleting files was quite a surprise to us. Like other crawlers, ours just downloads HTML files, extracts links, and then visits those links. There is no “delete a file” logic in there. But if the crawler stumbles upon a link whose action is to delete a file, then visiting that link will indeed delete the file.
Further investigation in this particular case revealed a file management page that includes, among other things, links that have the form: www.example.com/files/?delete=filename.txt. Surprisingly enough, clicking on that link deletes the file. The file management page is not protected by a password, nor is there any kind of confirmation displayed before the file is permanently deleted.
Examining the logs, we saw accesses from other search engine crawlers. We also learned from the Webmaster that some time back, a kid had “hacked in” to the site and deleted a bunch of files.
I’m a little surprised that anybody would create such a page and not provide any protection. I’m very surprised to find out that a supposedly professional Web developer would do such a thing and not learn the lesson when a random surfer came in and deleted files. And I’m shocked that, even after we explained this to the Webmaster, he insists that we can take this as an opportunity to learn from our “mistake” and “fix” the crawler so that it doesn’t happen again.
It’s unfortunate that our crawler visited those links, causing the files to be deleted. But the mistake was on the part of the person who posted those destructive links. The crawler was operating exactly as it should. Exactly, in fact, as every major search engine crawler acts. It’d be nice if we could imbue the crawler with enough intelligence to “understand” Web pages and know in advance what the effects of clicking a link will be. But that kind of machine intelligence is far, far in the future.
If you post something on the Web, it will be found, unless you take active measures to protect it. Posting a destructive link on an unprotected page and then blaming somebody else when the link is clicked by an “unauthorized” person is akin to running out into a busy street and then blaming your injuries on the driver of the bus that hits you.