• Learning Scala Implicits with Spark

    A while back I wrote two posts on avoiding the use of the groupBy function in Spark. While I won’t re-hash both posts here, the bottom line was to take advantage of the combineByKey or aggreagateByKey functions instead. While both functions hold the potential for improved performance and efficiency in...


  • Spark And Guava Tables

    Last time we covered Secondary Sorting in Spark. We took airline performance data and sorted results by airline, destination airport and the amount of delay. We used id’s for all our data. While that approach is good for performance, viewing results in that format loses meaning. Fortunately, the Bureau of...


  • secondary sorting in spark

    Secondary sorting is the technique that allows for ordering by value(s) (in addition to sorting by key) in the reduce phase of a Map-Reduce job. For example, you may want to anyalize user logons to your application. Having results sorted by day and time as well as user-id (the natural...


  • Spark Corner Cases

    In the last two posts, we covered alternatives to using the groupByKey method, aggregateByKey and combineByKey. In this post we are going to consider methods/situtaions you might not encounter for your everyday Spark job, but will come in handy when the need arises. Stripping The First Line (or First N...


  • Spark PairRDDFunctions: CombineByKey

    Last time we covered one of the alternatives to the groupByKey function aggregateByKey. In this post we’ll cover another alternative PairRDDFunction - combineByKey. The combineByKey function is similar in functionality to the aggregateByKey function, but is more general. But before we go into details let’s review why we’d even want...