Research existing MS Office text extractors
completed by: qxcv
mentors: Bastian Blank, Reimar Bauer, Thomas Waldmann, Prashant Kumar, Eugene Syromyatnikov
Research existing solutions for extracting text from proprietary Microsoft file formats.
For moin2, we already have quite some converters (including Open Document Format [OpenOffice / LibreOffice]), but nothing for Microsoft Office formats. Now we need to create a survey of the GPL2+ license compatible code that can extract text from these proprietary file formats.
We need to know:
- is a license compatible to GPL2+ used?
- for python libraries e.g.: GPL, BSD, MIT, ... (not: Apache License 2)
- in general: a free software license, not any proprietary license
- the programming language used
- strongly preferred is library code in python (we can just call it)
- also maybe working is a commandline tool (supported platforms?) that we can call as a subprocess
- windows-only solutions are not wanted
- compatibility with different file formats (mainly Word but also Excel and Powerpoint)
- compatibility with different versions (i.e. .DOC and .DOCX)
- reliability (is it well-maintained code, is it recently updated?)
Deliverable: wiki page
Many Moin users would like to have a platform-independant, pure python way to extract text for indexing.
Researching existing code base is a first step on this direction.
You'll need to do a lot of search on the Web. Discuss with moin devs online on IRC.
This task refers to moin2 (http://moinmo.in/MoinMoin2.0)!
http://moimo.in/MoinMoinChat - please join us on IRC #moin-dev
You can discuss this issue in the MoinMoin wiki: http://moinmo.in/EasyToDo/TextExtractors