Today I got a big XML file full of yummy audio and video links that my Web crawler will just love to slurp up. Not thinking, I wrote a quick grep command to extract some of the links and send them to the crawler. Later it dawned on me that some of those links are broken because the XML is entity encoded. That is, this link:
http://www.example.com/videos/?id=23&format=hd
Will be encoded so that “&” becomes “&”. Any character that is “special” in XML will end up being entity encoded like that. Oops.
A quick search for “xml grep” led me to XMLStarlet: a command line XML toolkit that lets you examine, query, fold, spindle, and mutilate XML files from the command line. I don’t know nearly as much as I should about XPath, XSLT, and XML in general, but after a few minutes of looking at examples and struggling with the syntax, I managed to pull those URLs out of the XML file and send them off to the crawler.
Granted, I spent a heck of a lot more time on this than I would have just writing a quick C# program to extract that one element from the file in question. But my C# program would have worked for this situation only. I already have other plans for XMLStarlet.
Highly recommended. If you ever find yourself having to manipulate XML files outside of your application, you need this tool.