A while back I wrote two posts on avoiding the use of the groupBy
function in Spark. While I won’t re-hash both posts here, the bottom line was to take advantage of the combineByKey or aggreagateByKey functions instead. While both functions hold the potential for improved performance and efficiency in our Spark jobs, at times creating the required arguments over and over for basic use cases could get tedious. It got me to thinking is there a way of providing some level of abstraction for basic use cases? For example grouping values into a list or set. Simultaneously, I’ve been trying to expand my knowlege of Scala’s more advanced features including implicits and TypeClasses. What I came up with is the GroupingRDDFunctions class that provides some syntactic sugar for basic use cases of the aggregateByKey
function by using Scala’s implicit class functionality.
Spark and Guava Tables
Last time we covered Secondary Sorting in Spark. We took airline performance data and sorted results by airline, destination airport and the amount of delay. We used id’s for all our data. While that approach is good for performance, viewing results in that format loses meaning. Fortunately, the Bureau of Transportation site offers reference files to download. The reference files are in CSV format with each line consisting of key-value pair. Our goal is to store the refrence data in hashmaps and leverage broadcast variables so all operations on different partitions will have easy access to the same data. We have four fields with codes: airline, origin city airport, orgin city, destination airport and destination city. Two of our code fields use the same reference file (airport id), so we’ll need to download 3 files. But is there an easier approch to loading 3 files into hashmaps and having 3 separate broadcast variables? There is, by using Guava Tables.
Secondary Sorting in Spark
Secondary sorting is the technique that allows for ordering by value(s) (in addition to sorting by key) in the reduce phase of a Map-Reduce job. For example, you may want to anyalize user logons to your application. Having results sorted by day and time as well as user-id (the natural key) will help to spot user trends. The additional ordering by day and time is an example of secondary sorting. While I have written before on Secondary Sorting in Hadoop, this post is going to cover how we perform secondary sorting in Spark.
Spark Corner Cases
In the last two posts, we covered alternatives to using the groupByKey
method, aggregateByKey
and combineByKey
. In this post we are going to consider methods/situtaions you might not encounter for your everyday Spark job, but will come in handy when the need arises.
Stripping The First Line (or First N Lines) of a File
While doing some testing, I wanted to strip off the first line of a file. Granted I could have easily created a copy with the header removed, but curiosity got the better of me. Just how would I remove the first line of a file? The answer is to use the mapPartitionsWithIndex
method.
Spark PairRDDFunctions: CombineByKey
Last time we covered one of the alternatives to the groupByKey
function aggregateByKey
. In this post we’ll cover another alternative PairRDDFunction – combineByKey
. The combineByKey
function is similar in functionality to the aggregateByKey
function, but is more general. But before we go into details let’s review why we’d even want to avoid using groupByKey
. When working in a map-reduce framework such Spark or Hadoop one of the steps we can take to ensure maximum performance is to limit the amount of data sent accross the network during the shuffle phase. The best option is when all operations can be performed on the map-side exclusively, meaning no data is sent at all to reducers. In most cases though, it’s not going to be realistic to do map-side operations only. If you need to do any sort of grouping, sorting or aggregation you’ll need to send data to reducers. But that doesn’t mean we still can’t attempt to make some optimizations.