GSoC/GCI Archive
Google Code-in 2013 Apertium

better wikipedia extractor script

completed by: Ben Stobaugh

mentors: Jonathan Washington, Francis Tyers

Make a single script that performs all the steps listed at Wikipedia Extractor. That is, it should take a wikipedia dump file as input and output a file that is for all intents and purposes identical to what is output by the last step listed on the wiki. There should be no intermediate files stored anywhere, and it should not use any more memory than absolutely necessary, but feel free to use as much of the existing code as you need. You may wish to consult guampa's [much-improved] fork of the WikiExtractor script at [6], though it doesn't do everything itself either.