GSoC/GCI Archive
Google Summer of Code 2010 Apache Software Foundation

Pig - binary comparator for secondary sort

by azaroth for Apache Software Foundation

When Hadoop sorts the keys in the shuffle phase, it will use a binary (raw) comparator, if available. The binary comparator does not deserialize the key into an object and compares directly the byte encoding for better performance. Pig uses the binary comparator when the key is of simple type, but not for tuples. This is important when doing secondary sort, because Pig relies on Hadoop to sort both main and secondary key. Using a binary comparator for tuples will produce a significant speedup.