GSoC/GCI Archive
Google Code-in 2012 Apertium

figure out what's causing the newline issues in the RFE/RL scraper

completed by: Sushain Cherivirala

mentors: Francis Tyers, Jonathan

The scraper we use to build corpora to test transducers has recently acquired some issues with newlines.

Namely, it adds &#13; where \n (and sometimes <br />s) used to be, and doesn't add newlines after things in <p>...</p> and <div>...</div> blocks.  However, it used to do all this stuff correctly.

Your task is to track down what's causing this problem and find a work-around that [ideally] doesn't involve string replaces or looping through elements.  It could be due to a new "feature" in lxml, or it could be something introduced in recent modifications of the scraper classes.

If you haven't worked with the scraper before, you should talk to us about how to test stuff.  Ideally the person who chooses this task, however, already has experience working with / developing parts of the scraper.