In the last two posts, we covered alternatives to using the groupByKey
method, aggregateByKey
and combineByKey
. In this post we are going to consider methods/situtaions you might not encounter for your everyday Spark job, but will come in handy when the need arises.
Stripping The First Line (or First N Lines) of a File
While doing some testing, I wanted to strip off the first line of a file. Granted I could have easily created a copy with the header removed, but curiosity got the better of me. Just how would I remove the first line of a file? The answer is to use the mapPartitionsWithIndex
method.
1 2 3 4 5 6 7 8 9 10 |
|
Here are the results from printing the resulting RDD
1 2 |
|
The mapPartitionsWithIndex
method passes the index (zero based) of the partition and an iterator containing the data to a user provided function (here skipLines
). Using this example we could remove any number lines or work with the entire partition at one time (but that’s the subject of another post!)
Avoiding Lists of Iterators
Often when reading in a file, we want to work with the individual values contained in each line separated by some delimiter. Splitting a delimited line is a trivial operation:
1
|
|
But the issue here is the returned RDD will be an iterator composed of iterators. What we want is the individual values obtained after calling the split
function. In other words, we need anArray[String]
not an Array[Array[String]]
. For this we would use the flatMap
function. For those with a functional programming background, using a flatMap operation is nothing new. But if you are new to functional programming it’s a great operation to become familiar with.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
When we run the program we see these results:
1 2 3 4 |
|
As we can see the map
example returned an Array
containing 3 Array[String]
instances, while the flatMap
call returned individual values contained in one Array
.
Maintaining An RDD’s Original Partitoning
When working in Spark, you quickly come up to speed with the map
function. It’s used to take a collection of values and ‘map’ them into another type. Typically when working with key-value pairs, we don’t need the key to remain the same type, if we need to keep the key at all. However, there are cases where we’ll need to keep the original RDD partitioning. This presents a problem; when calling the map
function there is no guarantee that the key has not been modified. What’s our option? We use the mapValues
function (on the PairRDDFunctions class) which allows us to change the values and retain the original key and partitioning of the RDD.
1 2 3 4 5 6 7 8 |
|
And we’ll see the results from our simple example are exactly the same:
1 2 3 |
|
Conclusion
This has been a short tour of some methods we can use when we encounter an unfamiliar situation writing Spark jobs. Thanks for your time.