GSoC/GCI Archive
Google Code-in 2013 Apertium

Support non-ASCII characters in flex lexers

completed by: Dalimil Hájek

mentors: Mikel L. Forcada, Francis Tyers, Kirill Krylov

Currently, flex lexers generated by the 'create-lexer.py' script[1] do not support non-ASCII characters. The objective of this task is to adjust the regular expressions so that they do.

Some ideas may be gleaned from the regular expressions in the Apertium code:

  attr_items[L"lem"] = L"(([^<]|\"\\<\")+)";
  attr_items[L"lemq"] = L"\\#[- _][^<]+";
  attr_items[L"lemh"] = L"(([^<#]|\"\\<\"|\"\\#\")+)";
  attr_items[L"whole"] = L"(.+)";
  attr_items[L"tags"] = L"((<[^>]+>)+)";

 

This task will also involve making the lexers and the format-parse.py script work properly (e.g. allow) spaces in lemmas. And also make sure that lemmas are specified correctly, e.g. only in ( ) when they are optional and only with " " when they have spaces inside.

 

1. https://svn.code.sf.net/p/apertium/svn/branches/transfer4/scripts/create-lexer.py