Random Thoughts on Coding

Whatever comes to mind at the moment.

Working With Java 8 Optionals

In this post we are going to cover working with the Optional class introduced in Java 8. The introduction of Optional was new only to Java. Guava has had a version of Optional and Scala has had the Option type for some time. Here’s a description of Optional from the Java API docs:

A container object which may or may not contain a non-null value.

Since so many others have done a good job of describing the Optional type, we won’t be doing so here. Rather for this post, we are going to cover how to use the Optional type without resorting to directly accessing the value contained or doing explicit checks if a value is present.

Guava ImmutableCollections, Multimaps and Java 8 Collectors

In this post we are going to discuss creating custom Collector instances. The Collector interface was introduced in the java.util.stream package when Java 8 was released. A Collector is used to “collect” the results of stream operations. Results are collected from a stream when the terminal operation Stream.collect method is called. While there are default implementations available, there are times we’ll want to use some sort of custom container. Our goal today will be to create Collector instances that produce Guava ImmutableCollections and Multimaps.

Learning Scala Implicits With Spark

A while back I wrote two posts on avoiding the use of the groupBy function in Spark. While I won’t re-hash both posts here, the bottom line was to take advantage of the combineByKey or aggreagateByKey functions instead. While both functions hold the potential for improved performance and efficiency in our Spark jobs, at times creating the required arguments over and over for basic use cases could get tedious. It got me to thinking is there a way of providing some level of abstraction for basic use cases? For example grouping values into a list or set. Simultaneously, I’ve been trying to expand my knowlege of Scala’s more advanced features including implicits and TypeClasses. What I came up with is the GroupingRDDFunctions class that provides some syntactic sugar for basic use cases of the aggregateByKey function by using Scala’s implicit class functionality.

Spark and Guava Tables

Last time we covered Secondary Sorting in Spark. We took airline performance data and sorted results by airline, destination airport and the amount of delay. We used id’s for all our data. While that approach is good for performance, viewing results in that format loses meaning. Fortunately, the Bureau of Transportation site offers reference files to download. The reference files are in CSV format with each line consisting of key-value pair. Our goal is to store the refrence data in hashmaps and leverage broadcast variables so all operations on different partitions will have easy access to the same data. We have four fields with codes: airline, origin city airport, orgin city, destination airport and destination city. Two of our code fields use the same reference file (airport id), so we’ll need to download 3 files. But is there an easier approch to loading 3 files into hashmaps and having 3 separate broadcast variables? There is, by using Guava Tables.

Secondary Sorting in Spark

Secondary sorting is the technique that allows for ordering by value(s) (in addition to sorting by key) in the reduce phase of a Map-Reduce job. For example, you may want to anyalize user logons to your application. Having results sorted by day and time as well as user-id (the natural key) will help to spot user trends. The additional ordering by day and time is an example of secondary sorting. While I have written before on Secondary Sorting in Hadoop, this post is going to cover how we perform secondary sorting in Spark.