GSoC/GCI Archive
Google Code-in 2013 Apertium

Write a dictionary-based tokeniser for Asian languages (Chinese) [2]

completed by: Zinc Sulfide

mentors: Mikel L. Forcada, Francis Tyers

The objective of this task is to write a tokeniser for Chinese. A tokeniser takes a sentence and splits it into words. One of the challenges of building a tokeniser for Chinese is that spaces are not used to separate words. The tokeniser will have some generic code for reading and writing output, and will use one or more algorithms to determine how to segment the sentence. Read more about this task here: http://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies