GSoC/GCI Archive
Google Code-in 2014 Wikimedia Foundation

Citoid: Create dataset in JSON of different possibilities for user-entered IDs to use in testing

completed by: Anish V.

mentors: Mvolz

citoid is a Node.js application (written in Javascript) that retrieves information about a webpage, book, journal article, etc. given a URL to the webpage or some other identifier, like DOI (digital object identifier). It uses another open source project, Zotero's translation-server, also written in Javascript, to do a lot of the work. There are installation instructions and more information available at

You don't need to get citoid running in order to do this work, but you should use gerrit to add your file to the /test_files directory in the citoid repository using git, see

Users may try entering a URL, ISBN, ISSN, PMC, PMID, MID, DOI and possibly other IDs not listed here to identify a magazine, article, book, or webpage. We need an Array of possibilities to make sure a) we are correctly identifying which ID it is and b) we are correctly extracting the ID.

Create a JSON file containing examples of the following IDs:  (ISBN, ISSN, PMC, PMID, DOI) as users might enter them, with different capitalisation, spaces i.e. (true positives). You might try thinking about where a user might be copying them from (i.e. or or how they might be typing them in manually (i.e. from a book) Also make an Array of identifiers that looks like your chosen ID but are not valid (true negatives). You might want each ID in a separate file or you might choose to put them all into one JSON object.

Students are required to read Wikimedia's general instructions at first.