Google Code-in 2012 Organization Apertium Task scrape Mongolian noun paradigms into yaml file

completed by: Richard Tynan

mentors: Francis Tyers, Jonathan

There are charts of Mongolian (=Khalkha) noun paradigms at the following url:

Your job is to write a script (preferably in python3) to scrape those paradigms into a yaml files (for testing of morphological transducers) like those at https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-cv-tr/tests/

The script should produce files according to the following guidelines:

each sub-paradigm type should be a separate file, named e.g. "normal nouns - ending with consonants.yaml" and "normal nouns - ending with vowels.yaml" (it would be good to case-convert to all-lowercase),
each word should be a section in the Tests section of the file, e.g. "гар = time:",
transcriptions (in []s) should be ignored,
empty case forms should be skipped (e.g., no "Pl" form for classroom),
case forms highlighted in blue should be skipped,
all formatting of individual letters should be ignored (e.g., bolded н is common—the '''s around the character should be done away with),
variable forms should include all (and only) the forms given (no "—"s) (this will probably be the hardest part of designing this script),
the script should be able to deal with new sub-paradigms, but it can (doesn't have to) ignore the "to sort" section
the entries for the forms should be tagged as <n> with other tags coming from the form given, and the base form should be the Nom form for each noun; e.g.:

note that the Pl form is actually <nom><pl>
the header of the yaml files should point to ../khk.autogen.hfst for Gen and ../khk.automorf.hfst for Morph.