GSoC/GCI Archive
Google Code-in 2014 Wikimedia Foundation

Citoid/html-metadata: Scrape metadata from html in your choice of 4 formats

completed by: m4tx

mentors: Andre Klapper, Mvolz

citoid is a Node.js application (written in Javascript) that retrieves information about a webpage, book, journal article, etc. given a URL to the webpage or some other identifier, like DOI (digital object identifier). There are installation instructions and more information available at https://www.mediawiki.org/wiki/Citoid; however, for the purposes of this project you don't need to install or use Citoid.

We get most of our metadata from another open source project, Zotero's translation-server. However, we also have a native webscraper in citoid, lib/scrape.js, which currently has very limited functionality.

To add more functionality to scrape.js (which currently just gets the contents of <title></title> and a few other properties), we'd like to take advantage of several other metadata standards that exist. These are:

OpenGraph (currently supported- don't pick this one!) https://phabricator.wikimedia.org/T1069

HighWire: https://phabricator.wikimedia.org/T76225

Embedded RDF: https://phabricator.wikimedia.org/T7622

CoINS: https://phabricator.wikimedia.org/T76223

Dublin Core: https://phabricator.wikimedia.org/T76224

As such, we're developing a node library that will be able to scrape all of these different types of metadata from html, https://github.com/mvolz/html-metadata

You can see in the file https://github.com/mvolz/html-metadata/blob/master/index.js that scrapeCOinS, scrapeHighWire, scrapeEmbeddedRDF, and scrapeDublinCore are all not implemented. 

Choose one of the functions to implement, and comment which with function you've chosen to implement on this page when you've done so. More details about each different data format can be found in the phabricator link next to the type listed above to help you choose. 

(please be advised that while the general wikimedia directions advise you to use gerrit for version control; as this is a Node.js library, not mediawiki specific software, you should use github.com to commit your work, as this library is not on gerrit. html-metadata is not currently published as a Node module but will be at some point.)