GSoC/GCI Archive
Google Code-in 2012 Apertium

Create tagged corpus of Armenian from EANC

completed by: conor-f

mentors: Francis Tyers, Jonathan

http://eanc.net/EANC/library/library.php?interface_language=en

 

This page has a series of texts in the following HTML format:

 

<span titles="գիշեր (N inanim)&#9;sg,dat,def&#9;night">Գիշերվան</span>	
<span titles="մութ (N inanim)&#9;sg,nom,def&#9;dark, obscure, vague">մութը</span>
<span titles="գետ (N inanim)&#9;sg,gen,nmlz,def&#9;river &#10;գետին (N inanim)&#9;sg,nom,def&#9;earth, soil">գետինն</span>
<span titles="առնել (V tr)&#9;cvb,pfv&#9;take, buy">առել</span>
<span titles="է (V intr)&#9;past,sg,3&#9;be">էր</span>:

The objective of this task is to convert the format to 'lttoolbox' analysis format, like this:

^Գիշերվան/գիշեր<n><nn><sg><dat><def>$

^մութը/մութ<n><nn><sg><nom><def>$

^գետինն/գետ<n><nn><sg><gen><nmlz><def>/գետին<n><nn><sg><nom><def>$

^առել/առնել<vblex><tv><cvb><pfv>$

^էր/է<vblex><iv><past><sg><3>$