GSoC/GCI Archive
Google Summer of Code 2011

Apertium

Web Page: http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code

Mailing List: mailto:apertium-stuff@lists.sourceforge.net

The Apertium project develops a free/open-source platform for machine translation and language technology. We try to focus our efforts on lesser-resourced and marginalised languages, but also work with larger languages. The platform, including data for a large number of language pairs, a translation engine and auxiliary tools is being developed around the world, largely in universities and companies (e.g. Prompsit Language Engineering), but also independent free-software developers play a huge role. There are currently 27 published language pairs within the project (including a number of "firsts" — for example Spanish—Occitan, Breton—French, and Basque—Spanish among others), and several more in development.

Projects

  • Adopting New Language Pair : Bengali - English The goal of this project is to improve the existing Bengali - English Language pair to a release quality. Specifically, completing both the monolingual and bilingual dictionary and writing the necessary transfer rules. Finally, testvocing is also necessary to ensure correctness. This project would be the continuation of the apertium 2009 project on "conversion of Anubadok to apertium platform"
  • Apertium-sl-es: machine translation between Slovene and Spanish Currently Apertium does not have a release-quality for the translation system of the Slovenian and Spanish language pair. I will expand the Apertium's functionality by adding it.
  • Apertium-tr-az: machine translation between Turkish and Azerbaijani Apertium tr-az is a new language pair using Apertium platform to translate from Turkish to Azerbaijani. The project will make use of a morphological analyzer for Turkish (already available as TRmorph), a morphological analyzer for Azerbaijani (I'm developing it starting from TRmorph) and a set of rules that I'm implementing.
  • Apertium-tr-ky: New Turkish-Kyrgyz language pair. New Apertium Turkish-Kyrgyz language pair is going to be developed. As a part of this project Turkish-Kyrgyz bilingual dictionary will be extracted from StarDict dictionary. Morphological analyzer/generator for Kyrgyz language will be developed from scratch.
  • Implementation of a new language pair apertium-sh-mk I plan to reimplement the sh half of the apertium-sh-mk language pair, the bilingual dictionary and transfer rules towards Macedonian. There is some previous work done on the SC part, including some handy methods for han- dling the difference between the standards. However, some linguistic paradigms are missing, and the documentation on the entire implementation is quite scarce. To be more productive, I will reimplement the entire dic- tionary from scratch, using the old code as a rough guide.
  • Improvements to postedition interface Improve the pre and post-editing Apertium translation web environment where the user has a range of tools available in order to modify the text before sending it to Apertium and after getting its translation.
  • New Maltese-Hebrew language pair New Apertium Maltese-Hebrew language pair, providing unidirectional translation of Maltese to Hebrew.
  • Quality control framework Implementation of a quality control framework using Python, with modules for: statistics, graphing, regression testing, corpus generation, corpus testing, coverage testing and average ambiguity.
  • VM for the transfer module The objective of this project is to compile the transfer rules files to a pseudo-assembly defined for this task and make a light interpreter for the pseudo-assembly generated. Therefore the scope of the project is to build these three main components: the instruction set, the compiler for the transfer files and the interpreter for the final pseudo-assembly generated.