Random Thoughts on Coding - page 5

Spark And Guava Tables

Last time we covered Secondary Sorting in Spark. We took airline performance data and sorted results by airline, destination airport and the amount of delay. We used id’s for all our data. While that approach is good for performance, viewing results in that format loses meaning. Fortunately, the Bureau of...

October 9, 2015

in Scala, Spark, Mapreduce, Hadoop

secondary sorting in spark

Secondary sorting is the technique that allows for ordering by value(s) (in addition to sorting by key) in the reduce phase of a Map-Reduce job. For example, you may want to anyalize user logons to your application. Having results sorted by day and time as well as user-id (the natural...

October 2, 2015

in Scala, Spark, Mapreduce, Hadoop

Spark Corner Cases

In the last two posts, we covered alternatives to using the groupByKey method, aggregateByKey and combineByKey. In this post we are going to consider methods/situtaions you might not encounter for your everyday Spark job, but will come in handy when the need arises. Stripping The First Line (or First N...

August 14, 2015

in Scala, Spark

Spark PairRDDFunctions: CombineByKey

Last time we covered one of the alternatives to the groupByKey function aggregateByKey. In this post we’ll cover another alternative PairRDDFunction - combineByKey. The combineByKey function is similar in functionality to the aggregateByKey function, but is more general. But before we go into details let’s review why we’d even want...

August 4, 2015

in Scala, Spark

Spark PairRDDFunctions - AggregateByKey

One of the great things about the Spark Framework is the amout of functionality provided out of the box. There is a class aimed exclusively at working with key-value pairs, the PairRDDFunctions class. When working data in the key-value format one of the most common operations to perform is grouping...

July 31, 2015

in Scala, Spark