GSoC/GCI Archive
Google Summer of Code 2009 Berkman Center at Harvard University

Media Cloud

by Srivani Narra for Berkman Center at Harvard University

Crawling for media content on the web is a special-case of web-crawling. The semantics of the feed (RSS, ATOM etc.) allows us to do a smarter pagination. Extracting story-text from a web-page can be done in many ways - word-density analysis (identify the section of the web-page that contains the highest density of words), page-layout analysis (identify the section of the web-page that contains unique content). Pig Latin can be used to provide rich APIs to the users of Media Cloud.