Google Summer of Code 2015 Organization CCExtractor development Project Multi-language Forced Alignment in a Heterogenous Corpus

Multi-language Forced Alignment in a Heterogenous Corpus

by Sai Krishna for CCExtractor development

The current transcripts corresponding to the videos are both imperfect (OOV and lag). This project seeks to correct the transcripts by developing techniques to first detect errors in alignment and then produce correction algorithms to reduce the frequency of these errors. By combining different techniques, an accurate forced alignment package will be generated, which will be able to operate in adverse conditions found in both the transcript and audio.