Sniffing network traffic

My latest crawler modifications require me to scrape Web pages that host videos so that I can obtain metadata (title, description, date posted, etc.) that we place in our index.  Unfortunately, there’s no standard way for sites to present such information.  ESPN and Vimeo have HTML <meta> tags that provide some info, but I have to go parsing through the body of the document to find the date.  (And yes, I’m aware that Vimeo has an API that will make this a moot point.  I’ll be investigating that soon.)

Other sites are much worse in that they provide no metadata in the HTML.  For example, one site’s video page is very code-heavy.  Requiring that the page be reloaded every time you request a new video would require a lot of network traffic.  Their design instead uses JavaScript to request a particular video’s metadata from a server.  Loading a new video involves downloading just a few kilobytes of data.

I spent some time this afternoon searching through the a video page HTML and the associated JavaScript, looking for the magic incantation that would get me the data I’m looking for.  The amount of code involved is staggering, and I quickly went crosseyed trying to decipher it before I hit on the idea of hooking up a sniffer to see if I could identify the HTTP request that gets the data.

It took me all of five minutes to download and install Free Http Sniffer, request a video from the site in question, and locate the magic line in the 230 or so requests that the page makes when it loads.  Problem solved.  Now all I have to do is write code that’ll transform a video page url into a request for the metadata, and I’m set.

I have no idea why I didn’t think of the sniffer earlier.  I’d used one before for a similar purpose.  I suspect I’ll be making heavy use of it in the near future as I expand the number of sites that we crawl for media.