• secondary sorting in spark

    Secondary sorting is the technique that allows for ordering by value(s) (in addition to sorting by key) in the reduce phase of a Map-Reduce job. For example, you may want to anyalize user logons to your application. Having results sorted by day and time as well as user-id (the natural...


  • Spark Corner Cases

    In the last two posts, we covered alternatives to using the groupByKey method, aggregateByKey and combineByKey. In this post we are going to consider methods/situtaions you might not encounter for your everyday Spark job, but will come in handy when the need arises. Stripping The First Line (or First N...


  • Spark PairRDDFunctions: CombineByKey

    Last time we covered one of the alternatives to the groupByKey function aggregateByKey. In this post we’ll cover another alternative PairRDDFunction - combineByKey. The combineByKey function is similar in functionality to the aggregateByKey function, but is more general. But before we go into details let’s review why we’d even want...


  • Spark PairRDDFunctions - AggregateByKey

    One of the great things about the Spark Framework is the amout of functionality provided out of the box. There is a class aimed exclusively at working with key-value pairs, the PairRDDFunctions class. When working data in the key-value format one of the most common operations to perform is grouping...


  • Partially Applied Functions in Java

    Last year I completed an intro to functional progamming course on edX. The language used in the course was haskell. I found working in haskell enjoyable. One of my favorite features is functions taking more than one parameter can be partially applied functions automatically. For example, if you have a...