GSoC/GCI Archive
Google Summer of Code 2015

Apache Software Foundation

License: Apache License, 2.0

Web Page:

Mailing List: No central list, see the lists of Apache projects at and Students can approach the GSOC Admins via

Established in 1999, the all-volunteer Foundation oversees nearly one hundred fifty leading Open Source projects, including Apache HTTP Server — the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 350 individual Members and 3,000 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(3)(c) not-for-profit charity, funded by individual donations and corporate sponsors including Citrix, Facebook, Google, Yahoo!, Microsoft, AMD, Basis Technology, Cloudera, Go Daddy, Hortonworks, HP, Huawei, InMotion Hosting, IBM, Matt Mullenweg, PSW GROUP, SpringSource/VMWare, and WANDisco.

Our ideas page can be filtered by the labels documented at


  • [Apache Olingo]Implement OData Json Metadocument Serializer/Parser This project scope will be limited to implementation of specification - OData JSON Format for CSDL [1] in Apache Olingo Client and Server Libraries. Apache Olingo is a protocol implementation of the OData V4 specification. [1]
  • [Apache Taverna] Databundle viewer for web Taverna is an open source domain independent Workflow Management System - a suite of tools used to design and execute scientific workflows. The Taverna suite includes the Taverna Engine, Taverna Workbench and Taverna Server. Taverna Databundle is a file format returned by the Taverna Server containing the results of a Taverna workflow run and its intermediate values and provenance metadata. This GSOC project proposes to create a web-based presentation of a workflow run.
  • Apache Flink: Asynchronous Iterations and Updates Apache Flink provides fast data processing capabilities. However, to incorporate several Machine Learning algorithms in the ML library, which can at most be approximated only in a distributed setting, it becomes prudent to provide an excellent iteration framework. Furthermore, while processing large amounts of data, no resource should be wasted, and no node should sit idly while other are still finishing their work, to synchronize with them. Instead an asynchronous iteration framework is needed.
  • Apache Kafka Output Connector for ManifoldCF ManifoldCF is an effort to provide an open source framework for connecting source content repositories to target repositories or indexes. Kafka is a distributed, partitioned, replicated queue service. Apache Kafka is being used for a number of uses cases. One of them is to use Kafka as a feeding system for streaming BigData processes. A Kafka output connector for ManifoldCF could be used for streaming or dispatching crawled documents or metadata and put them in a BigData processing pipeline.
  • Apache PIG - Move grunt from javacc to ANTLR The project is more on the lines of moving gruntparser to queryparser. Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The parser is very difficult to work with, and next to impossible to understand or modify. ANTLR provides a much cleaner, more standard way to generate parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we continue to add features to Pig.
  • Apache Spark: Enhance MLlib's Python API The Python API of MLlib has a few important features missing as when compared to the Scala backend. My project involves addition of these features, fixing related issues and improvement of the Scala backend as well. The more important of these features include 1. Support save / load across all models. 2. Support for evaluation metrics 3. Support for streaming ML algorithms. 4. Support for distributed linear algebra 5. Simplifying API using DataFrames.
  • Apache Taverna language command line tool Apache Taverna Language is a set of Java APIs for managing and converting Taverna workflow definitions and workflow run data bundles. It is a part of the Apache Taverna workflow system. This project is to develop a command line tool to use with Taverna language for managing and converting workflows and databundles.
  • Atlassian Confluence repository and Authority connector for Apache Manifold Confluence is team collaboration software. ManifoldCF is an effort to provide an open source framework for connecting source content repositories keeping the security policy of them. This project aims to build a connector that can deal with Confluence because it will be very interesting for the organizations, since they will be able to get all the Confluence information along with the rest of source content repositories, keeping the security policies defined on each repository.
  • AWS Load Balancing Support for Stratos Stratos has load balancer extension API which can be used to plug-in any third party load balancer into Stratos. This project is to implement a load balancer extension to support AWS Load Balancing service via its API
  • Benchmarking Resource Usage of Airavata’s Applications Apache Airavata is a software framework for scheduling and executing scientific jobs and workflows on distributed computing resources. The primary goal of this project is to integrate a decision engine and benchmark framework into Airavata. By using these features, Airavata will be able to estimate wait and execution times for the various computing resources. Users will then be able to identify and select the optimal resource for a job.
  • Clustering [ODE-563] This project is about supporting cluster implementation in ODE Related jira issue -
  • Development of an android application for Taverna workflow management An android based application for managing Taverna workflows. Workflows are formatted as .t2flow documents that can be interpreted by a Taverna server and processed. these documents represent projects created by users to monitor a certain phenomenon or activity. Hence an android based application helps to manage these workflows when on the run or not around the project work area.
  • Evaluate Apache Airavata Metadata storage and explore alternative solutions Airavata modules currently capture large amount of metadata about applications and executions. This project will formally capture the requirements for metadata management, and conduct experiments to determine the appropriate data model, storage scheme and storage technology to efficiently and effectively provide a meta-data management solution for Airavata.
  • Exact and Approximate Statistics for Data Streams and Windows in Flink Flink streaming provides flexible functions to work with windows of data streams. My project involves calculating statistics of windows, and also the entire data stream. This is a relatively low-hanging fruit, but it might attract many users to the library. The exact calculation of some statistics would require memory proportional to the number of elements in the input. However, there exist efficient algorithms using less memory for calculating the same statistics only approximately.
  • Extend CONSTRUCT to build quads (JENA-491) This project will add the use of GRAPH inside a CONSTRUCT template, so that it can generates quads or RDF datasets from a CONSTRUCT query, inspired by the Jena JIRA issue:
  • Extending visualization of Zeppelin with Rich GUI and Charting Manager Zeppelin is a collaborative data analytic and visualization tool for distributed, general-purpose data processing systems such as Apache Spark and Apache Flink. It has two main features, the data analytic phase and the data visualization phase. This project is an improvement or a re-design of the Data Visualization Component. It successfully eliminates all the limitations and drawbacks of the existing charting visualization component.
  • GCE Load Balancing Support for Stratos The project is to create an extension for Apache statos for give the google compute engine load balancing support for stratos. By using this extension we can manage the web traffic to instances deployed in Google Compute Engine by stratos by using the GCE load balancer provided by google. This is the link for the jira issue[1]. [1].
  • GenApp Integration with Apache Airavata GenApp is a modular framework for multiscale science computations. Apache Airavata is a software framework for executing and managing computational jobs and workflows on distributed computing resources. The primary goal of this project is to extend GenApp for java Application and integrating it with Airavata to be able to execute long running computational jobs.Also, we need to fix previously integrated html5 version to work with new Airavata.
  • General bug fixing for Apache Derby Apache Derby an Apache DB subproject, is a relational database implemented entirely in Java. Over the years, Derby has been an active Apache project with many contributors and committers, still a significant number of challenging and interesting bugs remain in the Derby source code. The objective of this project is to resolve some of the challenging bugs currently present in the source code and contribute towards a more stable version of Apache Derby.
  • Improve PDFDebugger of PDFBOX [PDFBOX-2530] PDFDebugger is a tool from PDFBox that helps the user to inspect the structure of a pdf file. Present PDFDebugger provides some useful features but also lacks some features those are necessary in PDF file debugging, e.g. a hex viewer. This project aims to add these functionalities to PDFDebugger with some other features that may make PDFDebugger more likable to users e.g. styled content stream for bracketing operators.
  • Integrate Airavata Java Client SDK with GridChem Client GridChem is a science gateway that enables computational experiments on multiple supercomputer resources. Currently GridChem connects to an Axis2 based Middleware to pass computational tasks. Idea of this project is to port this communicational model to support with Apache Airavata java client SDK.
  • Integrating Apache Mesos with Science Gateways via Apache Airavata Science Gateways federate resources from multiple organizations. Most gateways solve the problem of scheduling across multiple organizations in an ad hoc, per-gateway fashion. HPC environments statically partition their resources to avoid interference between multiple applications. Static partitioning is known to lead to low utilization. This project will evaluate and provide recommendations for how Apache Mesos can be used with Apache Airavata for efficient scheduling of gateway jobs.
  • Integrating DataCat system with Apache Airavata DataCat is the software outcome of a final year research and development project done by a group of undergraduate students (including myself) from University of Moratuwa, Sri Lanka [1]. In brief it is a data cataloging system which can catalog scientific data products, specially which get generated in computer simulation applications running in a cyber infrastructure. The goal of this project is to integrate DataCat system with Apache Airavata and deploy it in production GridChem system [2].
  • Integration Of GenApp with Apache Airavata GenApp is a modular framework for rapid generation of scientific applications.Apache Airavata is a software framework for executing computational jobs on distributed resources.The primary goal of this project is to make Airavata submission from Android and make the new Airavata compatible with the older version thus enabling submission from Qt3,4,5,which will be enhanced to include capabilities currently included only under HTML5. The modules also need to be integrated with Airavata's workflow.
  • Interactive web-based Aurora CLI tutorial Make a cli interactive for understand how to works Apache Aurora, this project is inspired in We make a tutorial for explain step by step how to get started with Apache Aurora and you can run commands without installing anything.
  • Introducing “curve fitting” for stat prediction algorithm of Autoscaler Jira for the Project - Currently Autoscaler component of the Stratos takes scaling up or scaling down decisions based on the health statistics from the CEP. But the statistics send from the CEP is not accurate. So this project aims on the improvement of the prediction algorithm of the Autoscaler, by introducing the curve fitting.
  • JDBC blobstore abstraction for jclouds Currently jclouds supports storing local data on disk through the filesystem provider, but unfortunately it doesn't fully work on every operative system. A JDBC Blobstore would be a truly portable alternative to the filesystem blobstore for local storage and would also provide access to remote databases for remote storage, allowing jclouds users to store data and metadata in any database with a JDBC driver.
  • Microformats2 Support for Apache Any23 Anything To Triples (Any23) is a tool which can be used to extract structured data in web documents supporting many different input formats. Currently the Any23 only supports the microfomats which was superseded by microformats2. This project will directly involve implementation for support microformats2 specification [1] in Any23 core. Original microformats support is retained separately. [1]
  • Native Android Client for Apache Wave Apache Wave is tightly coupled client-server platform. In this project through an Android app, allowing Wave to reach into the mobile domain. But it not straightforward, So first we need to decouple client-server and then we need implement Android client library to access Wave server and finally we need develop simple android application to demonstrate Apache Wave on Android
  • PHOENIX-1660 Implement missing math built-in functions Apache Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. The idea of this project will be to add missing math build-in functions for Phoenix. Firstly, check the typical math functions that are implemented in relational database systems for reference. Secondly, go though the code, and summary all the missing math build-in functions. Finally, implement them for Phoenix according to the guide, and finish related testcases.
  • Phoenix/PHOENIX-1661 Implement built-in functions for JSON I'm interested in the project of implementing JSON built-in functions for Phoenix, which is to convert other types to JSON as well as expand outmost JSON. To do this, I would therefore implement the same for Phoenix in Java based on the JSON built-in functions implemented in Postgres and then test the accuracy for all of them. I also have some idea to do more further optimization in for Phoenix, such as optimizing SELECT DISTINCT using skip scan filter.
  • Phoenix: Implementing missing array built-in functions Apache Phoenix is a high performance relational database layer over HBase for low latency applications. Array data type is supported in Apache Phoenix. But there are only few built in functions currently implemented for manipulation of arrays in Phoenix. But in other relational database systems like Postgresql have a rich set of built in functions to manipulate arrays. The purpose of this project is to implement the missing array built-in functions which are applicable for Phoenix arrays.
  • Proposal for Summer of Code 2015 - Open Meetings: Dmitry Bezheckov I am introducing WebRTC supporting for Open Meetings. These include adding WebRTC data channel and binding it with Red5 server.
  • Proposal to Implement GeoSPARQL in marmotta I'm very interested in the GSoC 2015 project of MARMOTTA
  • Provide a tool for Visualizing Phoenix Tracing Information Apache Phoenix is a high performance relational database layer on top of HBase. Tracing had been introduced to Apache Phoenix. My project is to take tracing to the next level and provide visualization to the tracing information. This project will select the best open source charting and visualizing libraries available and integrate with Phoenix to add the required functionality to the system.
  • Python based Command Line Tool (CLI) for Stratos Currently Apache Stratos has a CLI written in Java. This proposal is to implement a similar CLI in Python to make it much more light-weight. The CLI implements commands for all the REST API methods available in 4.1.0 release.
  • Replace OODT's XMLPRC with Avro's RPC Introduction OODT is a framework that allows distributed storage and cataloging of objects. To achieve its scalability OODT uses the concept of grid computing. Idea Currently OODT uses XMLRPC to enable communication between computers is the grid. "Our version of this library currently sits at 2.0.1, which was released on 28th December 2005." Apache Avro is a more moders data serialization system which also has a RPC implementation.
  • Securing AIRAVATA API The goal of this project is to design and implement the solution for securing AIRAVATA API. Particularly, this includes authenticating and authorizing end users in to AIRAVATA API. One of the challenges in this project is to design a unified solution which can be easily adapted to set of identified use cases that are based on different identity management scenarios. The proposed solution addresses all such use cases and involves open identity management standards.
  • Showing health statistics in GUI (STRATOS-1244) Apache Stratos is a highly-extensible Platform-as-a-Service (PaaS) which include multitenancy, Multi fractured auto scaling, Cloud bursting and Scalable dynamic load balancing. It consumes a lot of processing power and memory but this product currently doesn't have a UI to show health statistics of its product (clusters) and instances.
  • Spark Backend Support for Gora (GORA-386) Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. Apache Spark is a fast and general engine for large-scale data processing. It runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Apache Gora should support a backend for Apache Spark.
  • Support Sitemap Crawler in Nutch 2.x The url’s can be got from only pages that were scanned before in nutch crawler system. This method is expensive. Also, the degrees of importance and “change frequance” of these urls are not known only guessed. But, it is possible to find the whole of urls in a up-to-date sitemap file. For this reason, sitemap files in website should be crawled. Nutch project will have that support of sitemap crawler thanks to this development.
  • Supporting Hadoop data and cluster management for VXQuery My proposal is about the VXQuery issue of retrieving data from HDFS and adding a YARN cluster management system. I explain my ideas of how I will deal with these problems and how much time I believe I will need for each one. In addition I propose another feature for VXQuery that I believe can help make the project more useful. That is making VXQuery available as a module for Hadoop.
  • Unsupervised Word Sense Disambiguation The objective of Word Sense Disambiguation (WSD) is to determine which sense of a word is meant in a particular context. Apache OpenNLP currently lacks a generalized WSD module, therefore, the purpose of this project is to design and implement this module. Since techniques can be either unsupervised or supervised, another target would be to implement algorithms of common unsupervised techniques, which could serve as examples for any future contributor willing to add and compare other approaches.
  • WebM support in Apache OpenMeetings I want to add WebM support in the Apache OpenMeetings software. It would provide ability to use modern video streaming technologies in Apache OpenMeetings, such as WebRTC.
  • Wider spectrum of data consumers/producers for Apache Samza Apache Samza is a distributed stream-processing framework that can be deployed on top of Apache Yarn and uses Kafka as its main messaging system. The motivation of this project is to provide Samza the ability to consume/produce data from/to two very popular messaging systems, ActiveMQ and Amazon Kinesis. Although both systems are very different, Samza has a well-defined API and Samza's requirements are concretely reflected in the Samza-Kafka module which will be used as reference for the project
  • Word Sense Disambiguation - Supervised Techniques The objective of Word Sense Disambiguation is to determine which sense of a word is meant in a particular context. Apache OpenNLP currently lacks a WSD module, therefore, the purpose of this project is to design and build a WSD module that implement the algorithms of common supervised techniques. The implemented techniques could serve as examples for any future contributor that would like to add other approaches.
  • XMark Benchmark Support for VXQuery For this project, I plan to have the XMark Benchmark work on VXQuery. I plan to use test-driven development and split my project into four phases. The first phase involves building a test suite. The second phase involves determining and logging the errors by creating issues to the JIRA Issue list. The third phase involves implementing a mentor-determined optimal solution for the failing queries. The last phase involves documenting the supported XMark queries on the VXQuery website.