Secondary sorting is the technique that allows for ordering by value(s) (in addition to sorting by key) in the reduce phase of a Map-Reduce job. For example, you may want to anyalize user logons to your application. Having results sorted by day and time as well as user-id (the natural key) will help to spot user trends. The additional ordering by day and time is an example of secondary sorting. While I have written before on Secondary Sorting in Hadoop, this post is going to cover how we perform secondary sorting in Spark.
In the last two posts, we covered alternatives to using the
combineByKey. In this post we are going to consider methods/situtaions you might not encounter for your everyday Spark job, but will come in handy when the need arises.
Stripping The First Line (or First N Lines) of a File
While doing some testing, I wanted to strip off the first line of a file. Granted I could have easily created a copy with the header removed, but curiosity got the better of me. Just how would I remove the first line of a file? The answer is to use the
Last time we covered one of the alternatives to the
aggregateByKey. In this post we’ll cover another alternative PairRDDFunction –
combineByKey function is similar in functionality to the
aggregateByKey function, but is more general. But before we go into details let’s review why we’d even want to avoid using
groupByKey. When working in a map-reduce framework such Spark or Hadoop one of the steps we can take to ensure maximum performance is to limit the amount of data sent accross the network during the shuffle phase. The best option is when all operations can be performed on the map-side exclusively, meaning no data is sent at all to reducers. In most cases though, it’s not going to be realistic to do map-side operations only. If you need to do any sort of grouping, sorting or aggregation you’ll need to send data to reducers. But that doesn’t mean we still can’t attempt to make some optimizations.
One of the great things about the Spark Framework is the amout of functionality provided out of the box. There is a class aimed exclusively at working with key-value pairs, the PairRDDFunctions class. When working data in the key-value format one of the most common operations to perform is grouping values by key. The PairRDDFunctions class provides a
groupByKey function that makes grouping by key trivial. However,
groupByKey is very expensive and depending on the use case, better alternatives are available. In a
groupByKey call, all key-value pairs will be shuffled accross the network to a reducer where the values are collected together. In some cases the
groupByKey is merely a starting point to perform additional operations (sum, average) by key. In other cases, we need to collect the values together in order to return a different value type. Spark provides some alternatives for grouping that can provide either a performance improvement or ease the ability to combine values into a different type. The point of this post is to consider one of these alternate grouping functions.
Last year I completed an intro to functional progamming course on edX. The language used in the course was haskell. I found working in haskell enjoyable. One of my favorite features is functions taking more than one parameter can be partially applied functions automatically. For example, if you have a function expecting 3 parameters you can pass only the first parameter a function expecting the other two is returned. But you could supply only one more parameter and a function that accepts the final one will be returned (I believe this is the default behavior for all functions in haskell). I have used partially applied functions before from working in scala, but for some reason, this time the power and implications made more of an impression on me. For a better explaination of functions and partially applied functions in haskell, go to Learn You a Haskell. Now we need to move on to the point of this post, how can we achieve this behavior in java 8.