GSoC/GCI Archive
Google Code-in 2012 Apertium

script to generate list of apertium pairs

completed by: conor-f

mentors: Francis Tyers, Jonathan

Write a script, preferably in python3, that generates a list of apertium language pairs that are in svn.

The script should query svn directly and store everything into a big dictionary of dictionaries (which can then be iterated over for various purposes) which includes the following information:

  • the ISO code of each of the two languages involved;
  • whether in trunk, staging, nursery, or incubator;
  • when the language pair was started;
  • the directionality of the pair (this may be difficult and could be broken off into another task); and
  • the number of stems (this may be difficult and could be broken off into another task).

An example of the output would be something like the following dictionary (this is just a few examples, and some of the data is probably wrong; e.g., the stem counts are guesses):

allLanguages = { "trunk":

   { "from": "af", "to": "nl", "direction": "<>", "updated": date(2012, 10, 10), stems = "8500"},

   { "from": "bf", "to": "fr", "direction": ">", "updated": date(2012, 12, 04), stems = "8500"},

   ...

"staging":

   { "from": "ca", "to": "sc", "direction": "<>", "updated": date(2011, 08, 30), stems = "8500"},

   ...

"nursery":

   ...

}

etc.

There should also be an option allowing all language codes in the other standard converted to ISO 639-2 or 639-3.  That is, a parameter should be able to be set to have all -2 codes encountered to be converted to -3, or vice versa.