I’ve known about tmux for some time now, but I kept putting off working with it. Lately I’ve started looking at the Processor API available in Kafka. After opening up 4-5 terminal windows (Zookeeper, Kafka, sending messgaes, recieving messages) and toggling in between them, I quickly realized I needed a way to open up these terminal sessions at one time and have them centralized. I had the perfect use case to use tmux! This post is about scripting tmux to save time during devlopment, in this case working with kafka.
In this post we are going to cover working with the Optional class introduced in Java 8. The introduction of
Optional was new only to Java. Guava has had a version of Optional and Scala has had the Option type for some time. Here’s a description of Optional from the Java API docs:
A container object which may or may not contain a non-null value.
Since so many others have done a good job of describing the
Optional type, we won’t be doing so here. Rather for this post, we are going to cover how to use the
Optional type without resorting to directly accessing the value contained or doing explicit checks if a value is present.
In this post we are going to discuss creating custom Collector instances. The
Collector interface was introduced in the java.util.stream package when Java 8 was released. A
Collector is used to “collect” the results of stream operations. Results are collected from a stream when the terminal operation
Stream.collect method is called. While there are default implementations available, there are times we’ll want to use some sort of custom container. Our goal today will be to create
Collector instances that produce Guava ImmutableCollections and Multimaps.
A while back I wrote two posts on avoiding the use of the
groupBy function in Spark. While I won’t re-hash both posts here, the bottom line was to take advantage of the combineByKey or aggreagateByKey functions instead. While both functions hold the potential for improved performance and efficiency in our Spark jobs, at times creating the required arguments over and over for basic use cases could get tedious. It got me to thinking is there a way of providing some level of abstraction for basic use cases? For example grouping values into a list or set. Simultaneously, I’ve been trying to expand my knowlege of Scala’s more advanced features including implicits and TypeClasses. What I came up with is the GroupingRDDFunctions class that provides some syntactic sugar for basic use cases of the
aggregateByKey function by using Scala’s implicit class functionality.
Last time we covered Secondary Sorting in Spark. We took airline performance data and sorted results by airline, destination airport and the amount of delay. We used id’s for all our data. While that approach is good for performance, viewing results in that format loses meaning. Fortunately, the Bureau of Transportation site offers reference files to download. The reference files are in CSV format with each line consisting of key-value pair. Our goal is to store the refrence data in hashmaps and leverage broadcast variables so all operations on different partitions will have easy access to the same data. We have four fields with codes: airline, origin city airport, orgin city, destination airport and destination city. Two of our code fields use the same reference file (airport id), so we’ll need to download 3 files. But is there an easier approch to loading 3 files into hashmaps and having 3 separate broadcast variables? There is, by using Guava Tables.