Citoid/html-metadata: Scrape metadata from html in your choice of 4 formats
completed by: m4tx
mentors: Andre Klapper, Mvolz
We get most of our metadata from another open source project, Zotero's translation-server. However, we also have a native webscraper in citoid, lib/scrape.js, which currently has very limited functionality.
To add more functionality to scrape.js (which currently just gets the contents of <title></title> and a few other properties), we'd like to take advantage of several other metadata standards that exist. These are:
OpenGraph (currently supported- don't pick this one!) https://phabricator.wikimedia.org/T1069
Embedded RDF: https://phabricator.wikimedia.org/T7622
Dublin Core: https://phabricator.wikimedia.org/T76224
As such, we're developing a node library that will be able to scrape all of these different types of metadata from html, https://github.com/mvolz/html-metadata
You can see in the file https://github.com/mvolz/html-metadata/blob/master/index.js that scrapeCOinS, scrapeHighWire, scrapeEmbeddedRDF, and scrapeDublinCore are all not implemented.
Choose one of the functions to implement, and comment which with function you've chosen to implement on this page when you've done so. More details about each different data format can be found in the phabricator link next to the type listed above to help you choose.
(please be advised that while the general wikimedia directions advise you to use gerrit for version control; as this is a Node.js library, not mediawiki specific software, you should use github.com to commit your work, as this library is not on gerrit. html-metadata is not currently published as a Node module but will be at some point.)