Book Review : Hadoop – Beginners Guide

Tweet This post I am going to review the book “Hadoop – Beginners Guide” by Gary Turkington. In the interest of full disclosure, while I received a free copy of the book for reviewing, I do not receive any compensation for books purchased from this blog. With that out of the way let’s get on [...]

MapReduce Algorithms – Secondary Sorting

Tweet We continue with our series on implementing MapReduce algorithms found in Data-Intensive Text Processing with MapReduce book. Other posts in this series: Working Through Data-Intensive Text Processing with MapReduce Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II Calculating A Co-Occurrence Matrix with Hadoop MapReduce Algorithms – Order Inversion This post [...]

MapReduce Algorithms – Order Inversion

Tweet This post is another segment in the series presenting MapReduce algorithms as found in the Data-Intensive Text Processing with MapReduce book. Previous installments are Local Aggregation, Local Aggregation PartII and Creating a Co-Occurrence Matrix. This time we will discuss the order inversion pattern. The order inversion pattern exploits the sorting phase of MapReduce to [...]

Calculating A Co-Occurrence Matrix with Hadoop

Tweet This post continues with our series of implementing the MapReduce algorithms found in the Data-Intensive Text Processing with MapReduce book. This time we will be creating a word co-occurrence matrix from a corpus of text. Previous posts in this series are: Working Through Data-Intensive Text Processing with MapReduce Working Through Data-Intensive Text Processing with [...]

Testing Hadoop Programs with MRUnit

Tweet This post will take a slight detour from implementing the patterns found in Data-Intensive Processing with MapReduce to discuss something equally important, testing. I was inspired in part from a presentation by Tom Wheeler that I attended while at the 2012 Strata/Hadoop World conference in New York. When working with large data sets, unit [...]

Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II

Tweet This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In the previous post, we discussed using the technique of local aggregation as a means of reducing the amount of data shuffled and transferred across the network. Reducing the amount [...]

Working Through Data-Intensive Text Processing with MapReduce

Tweet It has been a while since I last posted, as I’ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is worth a look. Some time ago, I purchased Data-Intensive Processing with MapReduce by Jimmy Lin and Chris Dyer. The book presents several key MapReduce algorithms, [...]

Google Guava BloomFIlter

Tweet When the Guava project released version 11.0, one of the new additions was the BloomFilter class. A BloomFilter is a unique data-structure used to indicate if an element is contained in a set. What makes a BloomFilter interesting is it will indicate if an element is absolutely not contained, or may be contained in [...]

Event Programming Example: Google Guava EventBus and Java 7 WatchService

Tweet This post is going to cover using the Guava EventBus to publish changes to a directory or sub-directories detected by the Java 7 WatchService. The Guava EventBus is a great way to add publish/subscribe communication to an application. The WatchService, new in the Java 7 java.nio.file package, is used to monitor a directory for [...]

What’s New in Java 7: WatchService

Tweet Of all the new features in Java 7, one of the more interesting is the WatchService, adding the capability to watch a directory for changes. The WatchService maps directly to the native file event notification mechanism, if available. If a native event notification mechanism is not available, then the default implementation will use polling. [...]