GSoC/GCI Archive
Google Code-in 2010 The Apertium project

Convert Java code for decomposing compound words into C++

completed by: Kristaba

mentors: Francis Tyers, Kevin Brubeck Unhammer

The task is to take the implementation of decompounding in lttoolbox-java

 https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox-java/src/org/apertium/lttoolbox/process/FSTProcessor.java

see method:    public String compoundAnalysis2(String input_word) 

And port it to C++. 

 

The corresponding C++ file is

 https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox/lttoolbox/fst_processor.cc

Your port of compoundAnalysis2 should replace the (deprecated) method wstring FSTProcessor::decompose(wstring w) 

 

Decompounding means splitting an unknown word into various parts, all of which could be known words on their own. See http://wiki.apertium.org/wiki/Compounding. But we require that words which may be possible compound parts have a certain tag, either compoundOnlyLSymbol or compoundRSymbol (so we don't try to find compounds of just anything).

The method pruneCompounds ensures compounds have no more than compound_max_elements parts, and always end in a part which contains the compoundRSymbol symbol. 

 

You will need to work with your mentor so as to maintain equivalent functionality. There is a set of tests at http://apertium.codepad.org/aB0kcLMO