Friday, December 21, 2012

The GSA that watched the world

Everything seemed to be going great.  Point your GSA at youtube.com/user/[channel], and add youtube.com/watch?.* to the follow and crawl, and like magic you get just the videos on your channel.  "Related videos" are put on the page with javascript or flash, and the GSA doesn't find them.

Life is good.

Then, suddenly, you run a search on your well-configured GSA and, lo and behold, you now have nothing but cat-related videos instead of the videos from your channel related to proper cell-phone configuration.  And, to make matters worse, documents from your website are being pushed out by an ever-growing corpus of youtube pages.  Cinema has taken over.

Where did it go so wrong?

As it turns out, Google recently made a change to the youtube interface, and instead of using javascript to construct the related videos section, the page now comes with links in good old fashioned <a> tags, which the search appliance is more than happy to follow.  And, since all videos, from cats to cars to explosions, have the same url format, you're not going to be able to filter them, other than to list every single one manually in the crawl patterns.  That might work if you had a dozen videos, but you have hundreds--cell phone configuration is serious business, after all.

The solution to this is a long and difficult road, but it can be navigated with sufficient determination.  The problem can be solved with a page-rewriting crawl proxy, set between the GSA and youtube.  When crawling any page matching the /watch?v= video pattern, you simply look at the contents of the page, and check if you have a link to "www.youtube.com/user/[channelname]".  Because every page currently has only one link to /user/* with the www.youtube.com domain name attached, this is the information you need to decide if the page is yours or not.

If the video links to your channel, let it through unchanged.  If it doesn't, replace the page with a blank document.  The GSA will have references to a small handful of blank video documents, but these will not show up in the search results.

Order has been restored!

After implementing this, it’s probably a good idea to force a re-crawl of youtube.com before kicking back with a victory coffee, to hasten the elimination of the unwanted videos which have accumulated.  An even better plan would be to remove youtube.com from follow-and-crawl, let the GSA discard it all, and then start crawling it fresh.  Sometimes heavy-handed tactics are required to bring a GSA that has become addicted to watching silly videos back into line.

