GSoC/GCI Archive
Google Summer of Code 2014 DBpedia & DBpedia Spotlight

Distributed extraction of Wikipedia data dumps for DBpedia

by Nilesh Chakraborty for DBpedia & DBpedia Spotlight

The DBpedia project “extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies”. Large-scale data processing can be given a big performance boost if it is distributed over a cluster of computers. The aim of this project is to parallelize the download of Wikipedia dumps using different tools, and distribute their extraction using Apache Spark over multiple machines to ensure speed and scalability.