A while back I wrote two posts on avoiding the use of the
groupBy function in Spark. While I won’t re-hash both posts here, the bottom line was to take advantage of the combineByKey or aggreagateByKey functions instead. While both functions hold the potential for improved performance and efficiency in our Spark jobs, at times creating the required arguments over and over for basic use cases could get tedious. It got me to thinking is there a way of providing some level of abstraction for basic use cases? For example grouping values into a list or set. Simultaneously, I’ve been trying to expand my knowlege of Scala’s more advanced features including implicits and TypeClasses. What I came up with is the GroupingRDDFunctions class that provides some syntactic sugar for basic use cases of the
aggregateByKey function by using Scala’s implicit class functionality.
Scala Implicits in Brief
While a full explanation of Scala’s implicits is beyond the scope of this post, here’s a quick description. When the Scala compiler finds a variable or expression of the wrong type, it will look for an
implicit function, expression or class to provide the correct type. The
implicit function (or class) needs to be in the current scope for the compiler to do it’s work. This is typically accomplished by importing a Scala object that contains the implicit definition(s). In this case the
GroupingRDDFunctions class is wrapped in the
GroupingRDDUtils object. Here’s the class declaration:
1 2 3 4 5
To use GroupingRDDFunctions just use the following import statement:
The methods defined on
- sumWithTotal – provides a tuple with the sumation of numeric value as a Double along with the total count of items to create the sum
Here’s some examples of using
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
There is really nothing special happening here. We are simply wrapping an RDD instance and providing the ability to use the methods listed above on that RDD instance. Within the
GroupingRDDFunctions class we are still leveraging the
Implicit Parameter Conversion
As another example of implicit useage let’s take a look at the
averageByKey function. In the code below, we compute the average by key by applying the
averagingFunction to the results returned from the
sumWithTotal method. But if we look closely, our keys and values are generics of ‘K’ and ‘V’, but all of these functions work on doubles.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
So what happens if the values provided are integers instead of doubles? Also take a look at the
incrementCountSumValue method, how can a value of type ‘V’ be added to the double value of the tuple? This is good example of using an implicit function. The compiler will look for and find the
intToDouble function and apply to the parameter of the
incrementCountSumValue method. If the value is an integer, it’s implicity converted to a double, otherwise we return a double.
While the use of implicits in Scala needs to be judicious, the example presented here represents a good use-case in my opinion. We are adding some useful behavior to a class, just by adding an import statement. Plus it’s very easy to inspect the implicit class to see what’s going on under the covers.
- Scala Doc Description of Implicit Classes
- Good Explanation of Implicit Precedence
- Good Explanation of Implicits and Type Classes
- Neophytes guide to Scala – Implicits, Type Classes
- tresata/spark-sorted uses implicits to add functionality.
- GroupingRDDFunctions source code
- GroupingRDDFunctions unit test