GSoC/GCI Archive
Google Summer of Code 2012

DBpedia Spotlight

Web Page:

Mailing List:

Linked Data has been revolutionizing the way applications interact with the Web. While the Web2.0 technologies opened up much of the “guts” of websites for third-parties to reuse and repurpose data on the Web, they still require that developers create one client per target API. With Linked Data technologies, all APIs are interconnected via standard Web protocols and languages.

DBpedia is a project aiming at exposing knowledge from Wikipedia as Linked Data. One can navigate this Web of facts with standard Web browsers, automated crawlers or select subsets with SQL-like query languages (e.g. SPARQL). DBpedia exists in 97 different languages, and is interlinked with many other databases such as Freebase, New York Times, CIA Factbook, etc.

This new Web of interlinked databases provides useful knowledge that can complement the textual Web in many ways. See, for example, how bloggers tag their posts or assign them to categories in order to organize and interconnect their blog posts. Or see how BBC has created the World Cup 2010 website by interconnecting textual content and facts from their knowledge base. By the way, they use DBpedia.

DBpedia Spotlight is an open source (Apache license) text annotation tool that connects text to Linked Data by marking names of things in text (we call that Spotting) and selecting between multiple interpretations of these names (we call that Disambiguation). For example, “Washington” can be interpreted in more than 50 ways including a state, a government or a person. You can already imagine that this is not a trivial task, especially when we're talking 3.64 million “things” of 320 different “types” with over half a billion “facts” (July 2011).

But we think we're doing quite well. And we could use your help to do even better! See our ideas:


  • A database-backed core system with an improved model for estimation of annotation probability With my proposal, I want to focus on two issues: the general performance of DBpedia Spotlight and the annotation accuracy. To adress the performance issue, changes to the core architecture are suggested that allow more control in the implementation and storage of the required data. For an improvement in annotation accuracy, I propose the implementation of an entity mention model with topical information.
  • DBpedia Spotlight for collective linking of entities in HTML pages DBpedia Spotlight is a tool that can automatically annotate mentions of DBpedia resources from text documents. In the information age, more and more contents are published on the Internet. It is valuable to introduce DBpedia functionality for users to conveniently annotate the web page during browsing. In addition, some recent researches have indicated that collective disambiguation (consider the disambiguation decisions of related mentions in a context as a whole) will result in a better performance than merely context based disambiguation. Introducing collective disambiguation techniques to DBpedia Spotlight may help enhance the overall annotation quality.
  • Hadoop Indexing and Concept-Space Disambiguation Models for DBpedia Spotlight My project proposal is divided into two sections: (1) creating a Hadoop indexing system for DBpedia Spotlight and (2) implementing three novel approaches to disambiguation: Latent Semantic Analysis (LSA), Explicit Semantic Analysis (ESA), and Salient Semantic Analysis (SSA). These concept-space disambiguation modules will be used to rank the possible URIs for spotted entities based on context.
  • Topical Classification The proposed project has its focus on building an incrementally learning topical classifier for dbpedia spotlight.