GSoC/GCI Archive
Google Code-in 2014 Wikimedia Foundation

Citoid: Create dataset in JSON of different possibilities for user-entered IDs to use in testing

completed by: Anish V.

mentors: Mvolz

citoid is a Node.js application (written in Javascript) that retrieves information about a webpage, book, journal article, etc. given a URL to the webpage or some other identifier, like DOI (digital object identifier). It uses another open source project, Zotero's translation-server, also written in Javascript, to do a lot of the work. There are installation instructions and more information available at https://www.mediawiki.org/wiki/Citoid

You don't need to get citoid running in order to do this work, but you should use gerrit to add your file to the /test_files directory in the citoid repository using git, see https://www.mediawiki.org/wiki/Citoid#Get_the_code

Users may try entering a URL, ISBN, ISSN, PMC, PMID, MID, DOI and possibly other IDs not listed here to identify a magazine, article, book, or webpage. We need an Array of possibilities to make sure a) we are correctly identifying which ID it is and b) we are correctly extracting the ID.

Create a JSON file containing examples of the following IDs:  (ISBN, ISSN, PMC, PMID, DOI) as users might enter them, with different capitalisation, spaces i.e. (true positives). You might try thinking about where a user might be copying them from (i.e. amazon.com or sciencedirect.com) or how they might be typing them in manually (i.e. from a book) Also make an Array of identifiers that looks like your chosen ID but are not valid (true negatives). You might want each ID in a separate file or you might choose to put them all into one JSON object.

Students are required to read Wikimedia's general instructions at https://www.mediawiki.org/wiki/Google_Code-in_2014#Instructions_for_GCI_students first.