Here we go for the second installment.
The topic for today is online video.
- Online video
You may be familiar with certain sites that allow you to view your favorite TV episodes or watch a poor squirrel being launched into the woods off of some guys deck via a salad strainer and 20 feet of bungee cord. Well, we’ve been asked by one of our clients to scrape the source video URL, title, rating, description, etc. of thousands of online animal torture videos and general moving multi-colored malaise.
Those of you already familiar with screen-scraper are acquainted with the usual routine of starting off by proxying a site using screen-scraper’s proxy server. Well, it so happens that screen-scraper uses an HTTP proxy. It also so happens that most online videos are served over a protocol other than HTTP (eg. mms, http to mms, rtsp, http to rtmp, rtmp, rtmpe, rtmps, rtmpt, etc., etc.).
Those of you already familiar with online videos probably know that you view them via the Adobe Flash player. screen-scraper’s built-in client is not a Flash player. So, you wonder, how does screen-scraper scrape online videos?
Challenges:
Source video URL discovery is particularly challenging for the reasons described above and requires a new set of tools to make it happen. Over time our tool set has evolved to include different video stream recording software, Proxy/TCP revealers, and various multimedia players…
- Streamtransport: Very convenient URL grabber
- URLSnooper: Kind of clunky by reveals many URLs
- rtmpdump, rtmpsvr, rtmpsuck The authoritative source for RTMP download and discovery
- screen-scraper Proxy: Our ever-improving HTTP-only proxy server
- Charles Proxy: Very nice HTTP-only Proxy
- Fiddler: Unusually good M$ TCP snooper
- VLC: Swiss Army knife of media players
- Applian FLVPlayer: If it’s RTMP and it’s playable…this will play it
- Akamai Flash support player: Testbed for one of the Web’s largest CDNs.
Once discovered we create a pretty typical scraping session to recurs over a site scraping the visible title, description, etc.; as well as, the non-visible pieces that make up an online video source URL. For example…
- Proxy:
http://news.bbc.co.uk/sport2/hi/football/world_cup_2010/video/default.stm
- Note “connection” node:
http://news.bbc.co.uk/media/emp/8680000/8682600/8682671.xml
- Compile URL:
rtmp://72.246.119.70:80/ondemand?_fcs_vhost=cp45414.edgefcs.net&undefined/public/flash/sport/football/553...
- Test via Akamai
Extracting embedded video meta-data is required because seldom will a site state outright what the format, codec, dimensions, length, etc. of their online videos. We use a combination of software to download a portion of the video in order to get to the meta-data.
- wget: Download a URL
- wpro: Download a non-HTTP, non-RTMP URL
- rtmpdump: Download an RTMP URL
- mediainfo: Reveal the meta-data
The ability to easily manage multiple scraping sessions is key because we are currently scraping from around 26 online video portals. To do this we have built a web-based Tomcat controller to coordinate across multiple servers located anywhere in the world. You can manually, or by way of a scheduler, start each scraping session, add additional screen-scraper instances and point to multiple mySQL databases.
Once the data is in a database the video information can be accessed by way PHP, Java, Coldfusion or a number of other technologies, making Blondstar only a click or two away.
Thank you for sharing your info. I really appreciate your efforts