MapReduce Algorithms – Secondary Sorting

Tweet We continue with our series on implementing MapReduce algorithms found in Data-Intensive Text Processing with MapReduce book. Other posts in this series: Working Through Data-Intensive Text Processing with MapReduce Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II Calculating A Co-Occurrence Matrix with Hadoop MapReduce Algorithms – Order Inversion This post [...]

MapReduce Algorithms – Order Inversion

Tweet This post is another segment in the series presenting MapReduce algorithms as found in the Data-Intensive Text Processing with MapReduce book. Previous installments are Local Aggregation, Local Aggregation PartII and Creating a Co-Occurrence Matrix. This time we will discuss the order inversion pattern. The order inversion pattern exploits the sorting phase of MapReduce to [...]

Calculating A Co-Occurrence Matrix with Hadoop

Tweet This post continues with our series of implementing the MapReduce algorithms found in the Data-Intensive Text Processing with MapReduce book. This time we will be creating a word co-occurrence matrix from a corpus of text. Previous posts in this series are: Working Through Data-Intensive Text Processing with MapReduce Working Through Data-Intensive Text Processing with [...]

Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II

Tweet This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In the previous post, we discussed using the technique of local aggregation as a means of reducing the amount of data shuffled and transferred across the network. Reducing the amount [...]

Working Through Data-Intensive Text Processing with MapReduce

Tweet It has been a while since I last posted, as I’ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is worth a look. Some time ago, I purchased Data-Intensive Processing with MapReduce by Jimmy Lin and Chris Dyer. The book presents several key MapReduce algorithms, [...]

Google Guava BloomFIlter

Tweet When the Guava project released version 11.0, one of the new additions was the BloomFilter class. A BloomFilter is a unique data-structure used to indicate if an element is contained in a set. What makes a BloomFilter interesting is it will indicate if an element is absolutely not contained, or may be contained in [...]

What’s New in Java 7: WatchService

Tweet Of all the new features in Java 7, one of the more interesting is the WatchService, adding the capability to watch a directory for changes. The WatchService maps directly to the native file event notification mechanism, if available. If a native event notification mechanism is not available, then the default implementation will use polling. [...]

Creating An Asynchronous, Recursive DirectoryStream in Java 7

Tweet Continuing with my series on the Java 7 java.nio.file package, this time covering the DirectoryStream interface. In this post we are going implement our own DirectoryStream that will iterate over the files in an entire directory tree, not just a single directory. Our goal in the end is to have something that works similar [...]

What’s New In Java 7: Copy and Move Files and Directories

Tweet This post is a continuation of my series on the Java 7 java.nio.file package, this time covering the copying and moving of files and complete directory trees. If you have ever been frustrated by Java’s lack of copy and move methods, then read on, for relief is at hand. Included in the coverage is [...]

What’s new in Java 7 – The (Quiet) NIO File Revolution

Tweet Java 7 (“Project Coin”) has been out since July of last year. The additions with this release are useful, for example Try with resources – having closable resources handled automatically from try blocks, Strings in switch statements, multicatch for Exceptions and the ‘‘ operator for working with generics. The addition that everyone was anticipating [...]