GSoC/GCI Archive
Google Summer of Code 2014

Apertium

License: GNU General Public License (GPL)

Web Page: http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code

Mailing List: apertium-stuff@lists.sourceforge.net

The Apertium project develops a free/open-source platform for machine translation and language technology. We try to focus our efforts on lesser-resourced and marginalised languages, but also work with more widely-spoken languages. The platform, including data for a large number of language pairs, a translation engine and auxiliary tools is being developed around the world, largely in universities and companies (e.g. Prompsit Language Engineering), but independent free-software developers also play a huge role. There are currently 33 published language pairs within the project (including a number of "firsts" — for example Aragonese—Spanish, Spanish—Occitan, Breton—French, and Basque—Spanish among others), and several more in development. Apertium has a special focus in lowering the barrier for the creation of linguistic resources for any language, ideally to be used for MT, but also reusable for other purposes (e.g. grammar checking, morphological analysis, PoS tagging, etc.).

Projects

  • Adopt the Urdu-Hindi Language Pair This idea is to develop machine translation system using Apertium framework for the Urdu-Hindi langauge pair writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
  • Adopting an unreleased English-Kazakh language pair Nowadays, machine translation is very common and fast tool, which we use for understanding foreign languages. My idea is focused on developing English-Kazakh machine translation on Apertium, a free/open-source machine translation platform. I plan to continue working on this project, reach a good WER and coverage. I believe that my work will be useful for the society and other developments.
  • Adopting an unreleased language pair -- Serbo-Croatian<->English The goal of this project is to adopt the Serbo-Croatian<->English from the incubator and make it a working language pair.
  • Adopting an unreleased language pair of Kazakh <-> Karakalpak languages. The project aims to complete bidirectional machine translation system between Kazakh and Karakalpak languages on Apertium system.
  • Apertium on Pidgin & XChat The objective is to develop a way to interface Apertium with chat clients Pidgin and XChat (IRC), so that the users of said clients are able to translate both their messages before sending them and the messages received.
  • Apertium-tat-rus – machine translation system from Tatar to Russian The goal of this project is to develop a Tatar-to-Russian machine translator.
  • Assimilation evaluation toolkit for Apertium language pairs Assimilation (understanding the gist) of texts is believed to be the main application of online machine translation. Language pairs may be evaluated to see how well they perform in this task. The proposed toolkit will automatically generate tasks for human evaluation given parallel texts and a corresponding language pair. The toolkit will include different types of tasks and will be available through different interfaces.
  • Bring a Hindi-English language pair up to state-of-the-art quality The project is basically aims at drastically improving the performance both in terms of coverage & translation quality of Hindi-English language pair. Throughout the project I will be working with dictionaries, transfer grammar rules, corpora, etc. The objective is to make this language pair state-of-the-art and release it. Currently Hindi-English MT system is in nursery, I plan to make this language pair ready for release by the end of this project.
  • Bringing tur-kir, kaz-kir, and tur-uzb pairs out of nursery The goal of this project is to bring three Turkic-Turkic translation systems (Turkish-Kyrgyz, Kazakh-Kyrgyz, and Turkish-Uzbek) to production quality. These pairs are currently in Apertium's nursery, and need varying amounts of attention in order to function as intended.
  • Complex multiwords A project to enhance the current complex multiwords compiler. It will improve the way we deal with multiply inflected multiwords.
  • Fuzzy-match repair from Translation Memory For a given sentence S in a source language and it's translation T in another language, the idea is to find the translation of another sentence S'. The condition that S and S' must hold is that S and S' must have high Fuzzy-match score (or Low Edit Distance) between them. Then depending upon what changes from S to S' we employ a set of repair operations to T to get our T'.
  • Improving support for non-standard text input In the current trend non-standard language usage is more common on platforms like IRC, Twitter, Youtube, forums, etc. We want to translate this data to our desired language, hence spreading the message. Sadly even a little inconsistency in the data can make the Machine translation go wrong in various ways. This project aims to convert data into a standard text which can be accurately translated using the MT systems of Apertium.
  • Make the English-Esperanto pair state-of-the-art The aim of this project is to enhance the translation quality of English-Esperanto language pair, especially in the EN>EO direction.
  • Malayalam English Language pair English Malayalam pair language pair for apertium machine translator
  • Optimise the VM for transfer The project aims to speed up the VM for transfer code (the transfer step being the slowest of the translation pipeline at the moment).