Random Thoughts on Coding - page 5

Learning Scala Implicits with Spark

A while back I wrote two posts on avoiding the use of the groupBy function in Spark. While I won’t re-hash both posts here, the bottom line was to take advantage of the combineByKey or aggreagateByKey functions instead. While both functions hold the potential for improved performance and efficiency in...

October 16, 2015

in Scala, Spark, Mapreduce, Hadoop

Spark And Guava Tables

Last time we covered Secondary Sorting in Spark. We took airline performance data and sorted results by airline, destination airport and the amount of delay. We used id’s for all our data. While that approach is good for performance, viewing results in that format loses meaning. Fortunately, the Bureau of...

October 9, 2015

in Scala, Spark, Mapreduce, Hadoop

secondary sorting in spark

Secondary sorting is the technique that allows for ordering by value(s) (in addition to sorting by key) in the reduce phase of a Map-Reduce job. For example, you may want to anyalize user logons to your application. Having results sorted by day and time as well as user-id (the natural...

October 2, 2015

in Scala, Spark, Mapreduce, Hadoop

Spark Corner Cases

In the last two posts, we covered alternatives to using the groupByKey method, aggregateByKey and combineByKey. In this post we are going to consider methods/situtaions you might not encounter for your everyday Spark job, but will come in handy when the need arises. Stripping The First Line (or First N...

August 14, 2015

in Scala, Spark

Spark PairRDDFunctions: CombineByKey

Last time we covered one of the alternatives to the groupByKey function aggregateByKey. In this post we’ll cover another alternative PairRDDFunction - combineByKey. The combineByKey function is similar in functionality to the aggregateByKey function, but is more general. But before we go into details let’s review why we’d even want...

August 4, 2015

in Scala, Spark