• Spark And Guava Tables

    Last time we covered Secondary Sorting in Spark. We took airline performance data and sorted results by airline, destination airport and the amount of delay. We used id’s for all our data. While that approach is good for performance, viewing results in that format loses meaning. Fortunately, the Bureau of...


  • secondary sorting in spark

    Secondary sorting is the technique that allows for ordering by value(s) (in addition to sorting by key) in the reduce phase of a Map-Reduce job. For example, you may want to anyalize user logons to your application. Having results sorted by day and time as well as user-id (the natural...


  • Spark Corner Cases

    In the last two posts, we covered alternatives to using the groupByKey method, aggregateByKey and combineByKey. In this post we are going to consider methods/situtaions you might not encounter for your everyday Spark job, but will come in handy when the need arises. Stripping The First Line (or First N...


  • Spark PairRDDFunctions: CombineByKey

    Last time we covered one of the alternatives to the groupByKey function aggregateByKey. In this post we’ll cover another alternative PairRDDFunction - combineByKey. The combineByKey function is similar in functionality to the aggregateByKey function, but is more general. But before we go into details let’s review why we’d even want...


  • Spark PairRDDFunctions - AggregateByKey

    One of the great things about the Spark Framework is the amout of functionality provided out of the box. There is a class aimed exclusively at working with key-value pairs, the PairRDDFunctions class. When working data in the key-value format one of the most common operations to perform is grouping...