Random Thoughts on Coding

Whatever comes to mind at the moment.

I/O With Files That Aren’t Files

Recently at work I needed to search through our archived files and provide the results by the end of the day. Here’s the parameters of the request:

  1. The archive files are encrypted and stored in HDFS (Don’t ask why we store them in HDFS).
  2. The files vary in size form 3-9 GB.
  3. The total number of files to search was 300+
  4. It takes between 1 – 2 minutes to decrypt each file.

In the past there have been requests to search one archived file. In those cases we would copy the file out of HDFS to a server. Then run a shell script to decrypt the file and perform the search. The decrypting program requires 2 arguments: an encrypted file and a file to write the decrypted data to. This means the decrypted and encrypted file are on disk at the same time.

At an average rate of of 1.5 minutes to decrypt a single file, it was going to take 450 minutes (7.5 hours) for 300 files. To add to my dilema, there wasn’t enough time to write custom RecordReader. The only solution would be to stream the files in parallel. But there 2 problems with that approach:

  1. The server does not have enough space for 20 (10 encrypted and 10 decrypted) files at a time.
  2. The decrypting code does read from stdin or write to stdout.

What to do? Use named pipes of course!

Whats New in Java 8 - Date API Part II

This post is continues our review of the Date API that came with the release of Java 8. We are going to continue our concentration on classes that make working with dates/times very easy. Working with date objects in previous releases of Java was very challenging with respect to adding time or getting the difference between dates. Hopefully after looking at the classes we present here, your opinion of working with dates and times in Java will change. Specifically, we are going to take a look at the following classes:

  • Other classes to represent dates/times ZonedDateTime and OffsetDateTime
  • Getting the current snapshot in time with Instant
  • Using the Clock class to get system time but specify different time zones
  • Represent arbitrary number of days with the Period class
  • Represent arbitrary amount of hours with the Duration class

Blog Migrated to Octopress

After a long period of good intentions with no action, I finally moved the codingjunkie.net blog from Wordpress to Octopress. I have nothing bad to say about Wordpress, it’s been a great tool for me. It’s just that over time I’ve found myself wanting a simpler blogging platform. Aside from liking the overall looks of Ocotpress, the big draw for me was the decreased load time from serving up static HTML pages. Once I sat down and decided to pull the plug, it was really pretty simple. It all boiled down to just a few steps:

  1. Export my posts into the Wordpress XML format.
  2. Run exitwp on the exported XML.
  3. Some basic regex work to fix image tags.
  4. Configure the permalinks to match the form I already use.
  5. Spend 5 minutes reconfiguring my blog on the WebFaction control panel.
  6. Deploy the converted posts using rsync.

Overall, it went much easier than I had anticipated. Only time will tell if switching to the new platform will help me write better content!

What’s New in Java 8 - Date API

With the final release of Java 8 around the corner, one of the new features I’m excited about is the new Date API, a result of the work on JSR 310. While Lambda expressions are certainly the big draw of Java 8, having a better way to work with dates is a decidedly welcome addition. This is a quick post (part 1 of 2 or 3) showing some highlights of the new Date functionality, this time mostly around the LocalDate class.

MapReduce Algorithms - Understanding Data Joins Part II

It’s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional Programming Principals in Scala and Principles of Reactive Programming. I found both of them to be great courses and would recommend taking either one if you have the time. In this post we resume our series on implementing the algorithms found in Data-Intensive Text Processing with MapReduce, this time covering map-side joins. As we can guess from the name, map-side joins join data exclusively during the mapping phase and completely skip the reducing phase. In the last post on data joins we covered reduce side joins. Reduce-side joins are easy to implement, but have the drawback that all data is sent across the network to the reducers. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network. However, unlike reduce-side joins, map-side joins require very specific criteria be met. Today we will discuss the requirements for map-side joins and how we can implement them.