My latest crawler modifications require me to scrape Web pages that host videos so that I can obtain metadata (title, description, date posted, etc.) that we place in our index. Unfortunately, there’s no standard way for sites to present such information. ESPN and Vimeo have HTML <meta> tags that provide some info, but I have to go parsing through the body of the document to find the date. (And yes, I’m aware that Vimeo has an API that will make this a moot point. I’ll be investigating that soon.)
Other sites are much worse in that they provide no metadata in the HTML. For example, one site’s video page is very code-heavy. Requiring that the page be reloaded every time you request a new video would require a lot of network traffic. Their design instead uses JavaScript to request a particular video’s metadata from a server. Loading a new video involves downloading just a few kilobytes of data.
I spent some time this afternoon searching through the a video page HTML and the associated JavaScript, looking for the magic incantation that would get me the data I’m looking for. The amount of code involved is staggering, and I quickly went crosseyed trying to decipher it before I hit on the idea of hooking up a sniffer to see if I could identify the HTTP request that gets the data.
It took me all of five minutes to download and install Free Http Sniffer, request a video from the site in question, and locate the magic line in the 230 or so requests that the page makes when it loads. Problem solved. Now all I have to do is write code that’ll transform a video page url into a request for the metadata, and I’m set.
I have no idea why I didn’t think of the sniffer earlier. I’d used one before for a similar purpose. I suspect I’ll be making heavy use of it in the near future as I expand the number of sites that we crawl for media.