Random Thoughts on Coding

Whatever comes to mind at the moment.

Spark PairRDDFunctions - AggregateByKey

One of the great things about the Spark Framework is the amout of functionality provided out of the box. There is a class aimed exclusively at working with key-value pairs, the PairRDDFunctions class. When working data in the key-value format one of the most common operations to perform is grouping values by key. The PairRDDFunctions class provides a groupByKey function that makes grouping by key trivial. However, groupByKey is very expensive and depending on the use case, better alternatives are available. In a groupByKey call, all key-value pairs will be shuffled accross the network to a reducer where the values are collected together. In some cases the groupByKey is merely a starting point to perform additional operations (sum, average) by key. In other cases, we need to collect the values together in order to return a different value type. Spark provides some alternatives for grouping that can provide either a performance improvement or ease the ability to combine values into a different type. The point of this post is to consider one of these alternate grouping functions.

Partially Applied Functions in Java

Last year I completed an intro to functional progamming course on edX. The language used in the course was haskell. I found working in haskell enjoyable. One of my favorite features is functions taking more than one parameter can be partially applied functions automatically. For example, if you have a function expecting 3 parameters you can pass only the first parameter a function expecting the other two is returned. But you could supply only one more parameter and a function that accepts the final one will be returned (I believe this is the default behavior for all functions in haskell). I have used partially applied functions before from working in scala, but for some reason, this time the power and implications made more of an impression on me. For a better explaination of functions and partially applied functions in haskell, go to Learn You a Haskell. Now we need to move on to the point of this post, how can we achieve this behavior in java 8.

FlatMap in Guava

This is a short post about a method I recently discovered in Guava.

The Issue

I had a situation at work where I was working with objects structured something like this:

Sample Object Structures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class Outer {
    String outerId;
    List<Inner> innerList;
    .......
}

public class Inner {
    String innerId;
    Date timestamp;
}

public class Merged {
    String outerId;
    String innerId;
    Date timestamp;
}

My task was flatten a list Outer objects (along with the list of Inner objects) into a list of Merged objects. Since I’m working with Java 7, using streams is not an option.

Sql for Lucene

A short time ago, I started a side project to learn the latest version of Antlr. I decided to do something that has always interested me, a sql parser for the Lucene search engine. Even though the parser is a learning exercise, I thought someone else could find this useful. This post will cover the functionality of the LuceneQueryParser. Building the parser using Antlr4 will be coming in later posts.

Introduction and Examples

The LuceneSqlParser supports a subset of standard sql. Here are some examples:

Sample sql query handled
1
2
3
4
5
6
7
8
9
Select last_name from '/path/to/index/' where first_name='Foo' and age <=30 and city='Boston' limit 25

Select * from 'path/index/' where age in (31, 30, 50)

Select first_name, last_name from '/path/index/' where city in ('Cincinatti', 'New York', 'Boyds')

Select first_name from '/path/index/' where age between 35 and 50 and first_name like 'Br*'
-- Also takes paths from Windows OS
Select first_name from 'C:/path/index/' where first_name='John' and (age<=45 and city not in ('New York', 'Boston', 'Atlanta'))

The LuceneSqlParser returns a BooleanQuery. The BooleanQuery will contain different types of lucene query objects depending on the predicates used. There is a class Searcher avaiable for use with the LuceneSqlParser. The Searcher abstracts away the opening of a lucene IndexSearcher, iterating over the ScoreDoc array and extracting results. Next, we’ll take a look at the rules used to parse the sql.

Java 8 Functional Interfaces and Checked Exceptions

The Java 8 lambda syntax and functional interfaces have been a productivity boost for Java developers. But there is one drawback to functional interfaces. None of the them as currently defined in Java 8 declare any checked exceptions. This leaves the developer at odds on how best to handle checked exceptions. This post will present one option for handling checked exceptions in functional interfaces. We will use use the Function in our example, but the pattern should apply to any of the functional interfaces.

Example of a Function with a Checked Exception

Here’s an example from a recent side-project using a Function to open a directory in Lucene. As expected, opening a directory for writing/searching throws an IOException:

Create a Lucene Directory
1
2
3
4
5
6
7
   private Function<Path, Directory> createDirectory =  path -> {
        try{
            return FSDirectory.open(path);
        }catch (IOException e){
            throw new RuntimeException(e);
        }
    };