<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Random Thoughts On Coding</title>
	<atom:link href="http://codingjunkie.net/feed/" rel="self" type="application/rss+xml" />
	<link>http://codingjunkie.net</link>
	<description>Practical HowTo&#039;s On Software Development</description>
	<lastBuildDate>Mon, 14 Jan 2013 14:03:13 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
	<atom:link rel='hub' href='http://codingjunkie.net/?pushpress=hub'/>
		<item>
		<title>MapReduce Algorithms &#8211; Secondary Sorting</title>
		<link>http://codingjunkie.net/secondary-sort/</link>
		<comments>http://codingjunkie.net/secondary-sort/#comments</comments>
		<pubDate>Mon, 14 Jan 2013 14:00:50 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=2599</guid>
		<description><![CDATA[We continue with our series on implementing MapReduce algorithms found in Data-Intensive Text Processing with MapReduce book. Other posts in this series: Working Through Data-Intensive Text Processing with MapReduce Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II Calculating A Co-Occurrence Matrix with Hadoop MapReduce Algorithms – Order Inversion This post covers [...]]]></description>
				<content:encoded><![CDATA[<p><img src="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg" alt="hadoop-logo" width="300" height="71" class="alignleft size-full wp-image-2180" />We continue with our series on implementing MapReduce algorithms found in Data-Intensive Text Processing with MapReduce book. Other posts in this series: </p>
<ol>
<li><a href="http://codingjunkie.net/text-processing-with-mapreduce-part1/" title="Working Through Data-Intensive Text Processing with MapReduce" target="_blank">Working Through Data-Intensive Text Processing with MapReduce</a></li>
<li><a href="http://codingjunkie.net/text-processing-with-mapreduce-part-2/" title="Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II" target="_blank">Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II</a></li>
<li><a href="http://codingjunkie.net/cooccurrence/" title="Calculating A Co-Occurrence Matrix with Hadoop" target="_blank">Calculating A Co-Occurrence Matrix with Hadoop</a></li>
<li><a href="http://codingjunkie.net/order-inversion/" title="MapReduce Algorithms – Order Inversion" target="_blank">MapReduce Algorithms – Order Inversion</a></li>
</ol>
<p>This post covers the pattern of secondary sorting, found in chapter 3 of <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" target="_blank">Data-Intensive Text Processing with MapReduce</a>.  While Hadoop automatically sorts data emitted by mappers before being sent to reducers, what can you do if you also want to sort by value? You use secondary sorting of course.  With a slight manipulation to the format of the key object, secondary sorting gives us the ability to take the value into account during the sort phase. There are two possible approaches here.  The first approach involves having the reducer buffer all of the values for a given key and do an in-reducer sort on the values.  Since the reducer will be receiving all values for a given key, this approach could possibly cause the reducer to run out of memory.  The second approach involves creating a composite key by adding a part of, or the entire value to the natural key to achieve your sorting objectives.  The trade off between these two approaches is doing an explicit sort on values in the reducer would most likely be faster(at the risk of running out of memory) but implementing a &#8220;value to key&#8221; conversion approach,  is offloading the sorting the MapReduce framework, which lies at the heart of what Hadoop/MapReduce is designed to do.  For the purposes of this post, we will consider the &#8220;value to key&#8221; approach.  We will need to write a custom partitioner to ensure all the data with same key (the natural key not including the composite key with the value) is sent to the same reducer and a custom Comparator so the data is grouped by the natural key once it arrives at the reducer.  </p>
<h3>Value to Key Conversion</h3>
<p>Creating a composite key is straight forward.  What we need to do is analyze what part(s) of the value we want to account for during the sort and add the appropriate part(s) to the natural key.  Then we need to work on the compareTo method either in key class, or comparator class to make sure the composite key is accounted. We will be re-visiting the weather data set and include the temperature as part of the natural key (the natural key being the year and month concatenated together).  The result will be a listing of the coldest day for a given month and year.  This example was inspired from the secondary sorting example found in <a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=dp_ob_title_bk" target="_blank">Hadoop, The Definitive Guide</a> book.  While there are probably better ways to achieve this objective, but it will be good enough to demonstrate how secondary sorting works.</p>
<h3>Mapper Code</h3>
<p>Our mapper code already concatenates the year and month together, but we will also include the temperature as part of the key.  Since we have included the value in the key itself, the mapper will emit a NullWritable, where in other cases we would emit the temperature.</p>
<pre class="brush:java">
public class SecondarySortingTemperatureMapper extends Mapper&lt;LongWritable, Text, TemperaturePair, NullWritable&gt; {

    private TemperaturePair temperaturePair = new TemperaturePair();
    private NullWritable nullValue = NullWritable.get();
    private static final int MISSING = 9999;
@Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String yearMonth = line.substring(15, 21);

        int tempStartPosition = 87;

        if (line.charAt(tempStartPosition) == '+') {
            tempStartPosition += 1;
        }

        int temp = Integer.parseInt(line.substring(tempStartPosition, 92));

        if (temp != MISSING) {
            temperaturePair.setYearMonth(yearMonth);
            temperaturePair.setTemperature(temp);
            context.write(temperaturePair, nullValue);
        }
    }
}
</pre>
<p>Now we have added the temperature to the key, we set the stage for enabling secondary sorting.  What&#8217;s left to do is write code taking temperature into account when necessary.  Here we have two choices, write a Comparator or adjust the compareTo method on the TemperaturePair class (TemperaturePair implements WritableComparable).  In most cases I would recommend writing a separate Comparator, but the TemperaturePair class was written specifically to demonstrate secondary sorting, so we will modify the TemperaturePair class compareTo method. </p>
<pre class="brush:java">
 @Override
    public int compareTo(TemperaturePair temperaturePair) {
        int compareValue = this.yearMonth.compareTo(temperaturePair.getYearMonth());
        if (compareValue == 0) {
            compareValue = temperature.compareTo(temperaturePair.getTemperature());
        }
        return compareValue;
    }
</pre>
<p>If we wanted to sort in descending order, we could simply multiply the result of the temperature comparison by a -1.<br />
Now that we have completed the part necessary for sorting, we need to write a custom partitioner. </p>
<h3>Partitoner Code</h3>
<p>To ensure only the natural key is considered when determining which reducer to send the data to, we need to write a custom partitioner.  The code is straight forward and only considers the yearMonth value of the TemperaturePair class when calculating the reducer the data will be sent to.</p>
<pre class="brush:java">
public class TemperaturePartitioner extends Partitioner&lt;TemperaturePair, NullWritable&gt;{
    @Override
    public int getPartition(TemperaturePair temperaturePair, NullWritable nullWritable, int numPartitions) {
        return temperaturePair.getYearMonth().hashCode() % numPartitions;
    }
}
</pre>
<p>While the custom partitioner guarantees that all of the data for the year and month arrive at the same reducer, we still need to account for the fact the reducer will group records by key.  </p>
<h3>Grouping Comparator</h3>
<p>Once the data reaches a reducer, all data is grouped by key.  Since we have a composite key, we need to make sure records are grouped solely by the natural key.  This is accomplished by writing a custom GroupPartitioner. We have a Comparator object only considering the yearMonth field of the TemperaturePair class for the purposes of grouping the records together.</p>
<pre class="brush:java">
public class YearMonthGroupingComparator extends WritableComparator {
    public YearMonthGroupingComparator() {
        super(TemperaturePair.class, true);
    }
    @Override
    public int compare(WritableComparable tp1, WritableComparable tp2) {
        TemperaturePair temperaturePair = (TemperaturePair) tp1;
        TemperaturePair temperaturePair2 = (TemperaturePair) tp2;
        return temperaturePair.getYearMonth().compareTo(temperaturePair2.getYearMonth());
    }
}
</pre>
<h3>Results</h3>
<p>Here are the results of running our secondary sort job:</p>
<pre class="brush:text">
new-host-2:sbin bbejeck$ hdfs dfs -cat secondary-sort/part-r-00000
190101	-206
190102	-333
190103	-272
190104	-61
190105	-33
190106	44
190107	72
190108	44
190109	17
190110	-33
190111	-217
190112	-300
</pre>
<h3>Conclusion</h3>
<p>While sorting data by value may not be a common need, it&#8217;s a nice tool to have in your back pocket when needed.  Also, we have been able to take a deeper look at the inner workings of Hadoop by working with custom partitioners and group partitioners.  Thank you for your time.</p>
<h3>Resources</h3>
<ul>
<li> <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" title="Data-Intensive Text Processing with MapReduce" target="_blank">Data-Intensive Processing with MapReduce</a> by Jimmy Lin and Chris Dyer</li>
<li><a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=tmm_pap_title_0?ie=UTF8&amp;qid=1347589052&amp;sr=1-1" target="_blank">Hadoop: The Definitive Guide</a> by Tom White</li>
<li><a href="https://github.com/bbejeck/hadoop-algorithms" target="_blank" title="Source Code">Source Code and Tests</a> from blog</li>
<li><a href="http://hadoop.apache.org/docs/r0.20.2/api/index.html">Hadoop API</a></li>
<li><a href="http://mrunit.apache.org/" target="_blank">MRUnit</a> for unit testing Apache Hadoop map reduce jobs</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/secondary-sort/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MapReduce Algorithms &#8211; Order Inversion</title>
		<link>http://codingjunkie.net/order-inversion/</link>
		<comments>http://codingjunkie.net/order-inversion/#comments</comments>
		<pubDate>Thu, 13 Dec 2012 14:00:18 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=2537</guid>
		<description><![CDATA[This post is another segment in the series presenting MapReduce algorithms as found in the Data-Intensive Text Processing with MapReduce book. Previous installments are Local Aggregation, Local Aggregation PartII and Creating a Co-Occurrence Matrix. This time we will discuss the order inversion pattern. The order inversion pattern exploits the sorting phase of MapReduce to push [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg"><img src="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg" alt="" title="hadoop-logo" width="300" height="71" class="alignleft size-full wp-image-2180" /></a>This post is another segment in the series presenting MapReduce algorithms as found in the <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" target="_blank">Data-Intensive Text Processing with MapReduce</a> book.  Previous installments are <a href="http://codingjunkie.net/text-processing-with-mapreduce-part1/" target="_blank" title="Working Through Data-Intensive Text Processing with MapReduce">Local Aggregation</a>, <a href="http://codingjunkie.net/text-processing-with-mapreduce-part-2/" target="_blank" title="Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II">Local Aggregation PartII</a> and <a href="http://codingjunkie.net/cooccurrence/" target="_blank" title="Calculating A Co-Occurrence Matrix with Hadoop">Creating a Co-Occurrence Matrix</a>.  This time we will discuss the order inversion pattern.  The order inversion pattern exploits the sorting phase of MapReduce to push data needed for calculations to the reducer <i>ahead of</i> the data that will be manipulated..  Before you dismiss this as an edge condition for MapReduce, I urge you to read on as we will discuss how to use sorting to our advantage and cover using a custom partitioner, both of which are useful tools to have available.  Although many MapReduce programs are written at a higher level abstraction i.e Hive or Pig, it&#8217;s still helpful to have an understanding of what&#8217;s going on at a lower level.  The order inversion pattern is found in chapter 3 of Data-Intensive Text Processing with MapReduce book.  To illustrate the order inversion pattern we will be using the Pairs approach from the co-occurrence matrix pattern.  When creating the co-occurrence matrix, we track the total counts of when words appear together.  At a high level we take the Pairs approach and add a small twist, in addition to having the mapper emit a word pair such as (&#8220;foo&#8221;,&#8221;bar&#8221;) we will emit an additional word pair of (&#8220;foo&#8221;,&#8221;*&#8221;) and will do so for every word pair so we can easily achieve a total count for how often the left most word appears, and use that count to calculate our relative frequencies.  This approach raised two specific problems.  First we need to find a way to ensure word pairs (&#8220;foo&#8221;,&#8221;*&#8221;) arrive at the reducer first.  Secondly we need to make sure all word pairs with the same left word arrive at the same reducer. Before we solve those problems, let&#8217;s take a look at our mapper code.</p>
<h3>Mapper Code</h3>
<p>First we need to modify our mapper from the Pairs approach.  At the bottom of each loop after we have emitted all the word pairs for a particular word, we will emit the special token WordPair(&#8220;word&#8221;,&#8221;*&#8221;) along with the count of times the word on the left was found.</p>
<pre class="brush:java">
public class PairsRelativeOccurrenceMapper extends Mapper&lt;LongWritable, Text, WordPair, IntWritable&gt; {
    private WordPair wordPair = new WordPair();
    private IntWritable ONE = new IntWritable(1);
    private IntWritable totalCount = new IntWritable();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        int neighbors = context.getConfiguration().getInt("neighbors", 2);
        String[] tokens = value.toString().split("\\s+");
        if (tokens.length &gt; 1) {
            for (int i = 0; i &lt; tokens.length; i++) {
                    tokens[i] = tokens[i].replaceAll("\\W+","");

                    if(tokens[i].equals("")){
                        continue;
                    }

                    wordPair.setWord(tokens[i]);

                    int start = (i - neighbors &lt; 0) ? 0 : i - neighbors;
                    int end = (i + neighbors &gt;= tokens.length) ? tokens.length - 1 : i + neighbors;
                    for (int j = start; j &lt;= end; j++) {
                        if (j == i) continue;
                        wordPair.setNeighbor(tokens[j].replaceAll("\\W",""));
                        context.write(wordPair, ONE);
                    }
                    wordPair.setNeighbor("*");
                    totalCount.set(end - start);
                    context.write(wordPair, totalCount);
            }
        }
    }
}
</pre>
<p>Now that we&#8217;ve generated a way to track the total numbers of times a particular word has been encountered, we need to make sure those special characters reach the reducer first so a total can be tallied to calculate the relative frequencies. We will have the sorting phase of the MapReduce process handle this for us by modifying the compareTo method on the WordPair object. </p>
<h3>Modified Sorting</h3>
<p>We modify the compareTo method on the WordPair class so when a &#8220;*&#8221; caracter is encountered on the right that particular object is pushed to the top.</p>
<pre class="brush:java">
    @Override
    public int compareTo(WordPair other) {
        int returnVal = this.word.compareTo(other.getWord());
        if(returnVal != 0){
            return returnVal;
        }
        if(this.neighbor.toString().equals("*")){
            return -1;
        }else if(other.getNeighbor().toString().equals("*")){
            return 1;
        }
        return this.neighbor.compareTo(other.getNeighbor());
    }

</pre>
<p>By modifying the compareTo method we now are guaranteed that any WordPair with the special character will be sorted to the top and arrive at the reducer first.  This leads to our second specialization, how can we guarantee that all WordPair objects with a given left word will be sent to the same reducer? The answer is to create a custom partitioner.</p>
<h3>Custom Partitioner</h3>
<p>Intermediate keys are shuffled to reducers by calculating the hashcode of the key modulo the number of reducers.  But our WordPair objects contain two words, so taking the hashcode of the entire object clearly won&#8217;t work.  We need to wright a custom Partitioner that only takes into consideration the left word when it comes to determining which reducer to send the output to.</p>
<pre class="brush:java">
public class WordPairPartitioner extends Partitioner&lt;WordPair,IntWritable&gt; {

    @Override
    public int getPartition(WordPair wordPair, IntWritable intWritable, int numPartitions) {
        return wordPair.getWord().hashCode() % numPartitions;
    }
}
</pre>
<p>Now we are guaranteed that all of the WordPair objects with the same left word are sent to the same reducer.  All that is left is to construct a reducer to take advantage of the format of the data being sent.</p>
<h3>Reducer</h3>
<p>Building the reducer for the inverted order inversion pattern is straight forward. It will involve keeping a counter variable and a &#8220;current&#8221; word variable. The reducer will check the input key WordPair for the special character &#8220;*&#8221; on the right. If the word on the left is not equal to the &#8220;current&#8221; word we will re-set the counter and sum all of the values to obtain a total number of times the given current word was observed.  We will now process the next WordPair objects, sum the counts and divide by our counter variable to obtain a relative frequency. This process will continue until another special character is encountered and the process starts over.</p>
<pre class="brush:java">
public class PairsRelativeOccurrenceReducer extends Reducer&lt;WordPair, IntWritable, WordPair, DoubleWritable&gt; {
    private DoubleWritable totalCount = new DoubleWritable();
    private DoubleWritable relativeCount = new DoubleWritable();
    private Text currentWord = new Text("NOT_SET");
    private Text flag = new Text("*");

    @Override
    protected void reduce(WordPair key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
        if (key.getNeighbor().equals(flag)) {
            if (key.getWord().equals(currentWord)) {
                totalCount.set(totalCount.get() + getTotalCount(values));
            } else {
                currentWord.set(key.getWord());
                totalCount.set(0);
                totalCount.set(getTotalCount(values));
            }
        } else {
            int count = getTotalCount(values);
            relativeCount.set((double) count / totalCount.get());
            context.write(key, relativeCount);
        }
    }
  private int getTotalCount(Iterable&lt;IntWritable&gt; values) {
        int count = 0;
        for (IntWritable value : values) {
            count += value.get();
        }
        return count;
    }
}
</pre>
<p>By manipulating the sort order and creating a custom partitioner, we have been able to send data to a reducer needed for a calculation, before the data needed for those calculation arrive. Although not shown here, a combiner was used to run the MapReduce job.  This approach is also a good candidate for the &#8220;in-mapper&#8221; combining pattern. </p>
<h3>Example &amp; Results</h3>
<p>Given that the holidays are upon us, I felt it was timely to run an example of the order inversion pattern against the novel &#8220;A Christmas Carol&#8221; by Charles Dickens.  I know it&#8217;s corny, but it serves the purpose.</p>
<pre class="brush:text">
new-host-2:sbin bbejeck$ hdfs dfs -cat relative/part* | grep Humbug
{word=[Humbug] neighbor=[Scrooge]}	0.2222222222222222
{word=[Humbug] neighbor=[creation]}	0.1111111111111111
{word=[Humbug] neighbor=[own]}	0.1111111111111111
{word=[Humbug] neighbor=[said]}	0.2222222222222222
{word=[Humbug] neighbor=[say]}	0.1111111111111111
{word=[Humbug] neighbor=[to]}	0.1111111111111111
{word=[Humbug] neighbor=[with]}	0.1111111111111111
{word=[Scrooge] neighbor=[Humbug]}	0.0020833333333333333
{word=[creation] neighbor=[Humbug]}	0.1
{word=[own] neighbor=[Humbug]}	0.006097560975609756
{word=[said] neighbor=[Humbug]}	0.0026246719160104987
{word=[say] neighbor=[Humbug]}	0.010526315789473684
{word=[to] neighbor=[Humbug]}	3.97456279809221E-4
{word=[with] neighbor=[Humbug]}	9.372071227741331E-4
</pre>
<h3>Conclusion</h3>
<p>While calculating relative word occurrence frequencies probably is not a common task, we have been able to demonstrate useful examples of sorting and using a custom partitioner, which are good tools to have at your disposal when building MapReduce programs.  As stated before, even if most of your MapReduce is written at higher level of abstraction like Hive or Pig, it&#8217;s still instructive to have an understanding of what is going on under the hood.  Thanks for your time.</p>
<h3>Resources</h3>
<ul>
<li> <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" title="Data-Intensive Text Processing with MapReduce" target="_blank">Data-Intensive Processing with MapReduce</a> by Jimmy Lin and Chris Dyer</li>
<li><a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=tmm_pap_title_0?ie=UTF8&#038;qid=1347589052&#038;sr=1-1" target="_blank">Hadoop: The Definitive Guide</a> by Tom White</li>
<li><a href="https://github.com/bbejeck/hadoop-algorithms" target="_blank" title="Source Code">Source Code and Tests</a> from blog</li>
<li><a href="http://hadoop.apache.org/docs/r0.20.2/api/index.html">Hadoop API</a></li>
<li><a href="http://mrunit.apache.org/" target="_blank">MRUnit</a> for unit testing Apache Hadoop map reduce jobs</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/order-inversion/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Calculating A Co-Occurrence Matrix with Hadoop</title>
		<link>http://codingjunkie.net/cooccurrence/</link>
		<comments>http://codingjunkie.net/cooccurrence/#comments</comments>
		<pubDate>Fri, 30 Nov 2012 14:00:51 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=2476</guid>
		<description><![CDATA[This post continues with our series of implementing the MapReduce algorithms found in the Data-Intensive Text Processing with MapReduce book. This time we will be creating a word co-occurrence matrix from a corpus of text. Previous posts in this series are: Working Through Data-Intensive Text Processing with MapReduce Working Through Data-Intensive Text Processing with MapReduce [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg"><img src="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg" alt="" title="hadoop-logo" width="300" height="71" class="alignleft size-full wp-image-2180" /></a>This post continues with our series of implementing the MapReduce algorithms found in the <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" target="_blank">Data-Intensive Text Processing with MapReduce</a> book.  This time we will be creating a word co-occurrence matrix from a corpus of text. Previous posts in this series are:</p>
<ol>
<li><a href="http://codingjunkie.net/text-processing-with-mapreduce-part1/" title="Working Through Data-Intensive Text Processing with MapReduce" target="_blank">Working Through Data-Intensive Text Processing with MapReduce</a></li>
<li><a href="http://codingjunkie.net/text-processing-with-mapreduce-part-2/" title="Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II" target="_blank">Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II</a></li>
</ol>
<p> A <a href="http://en.wikipedia.org/wiki/Co-occurrence" target="_blank">co-occurrence</a> matrix could be described as the tracking of an event, and given a certain window of time or space, what other events seem to occur.  For the purposes of this post, our &#8220;events&#8221; are the individual words found in the text and we will track what other words occur within our &#8220;window&#8221;, a position relative to the target word.  For example, consider the phrase &#8220;The quick brown fox jumped over the lazy dog&#8221;. With a window value of 2, the co-occurrence for the word &#8220;jumped&#8221; would be [brown,fox,over,the].  A co-occurrence matrix could be applied to other areas that require investigation into when &#8220;this&#8221; event occurs, what other events seem to happen at the same time. To build our text co-occurrence matrix, we will be implementing the Pairs and Stripes algorithms found in chapter 3 of Data-Intensive Text Processing with MapReduce. The body of text used to create our co-occurrence matrix is the collective works of <a href="http://www.gutenberg.org/ebooks/100" target="_blank">William Shakespeare</a>. </p>
<h3>Pairs</h3>
<p>Implementing the pairs approach is straightforward.  For each line passed in when the map function is called, we will split on spaces creating a String Array.  The next step would be to construct two loops.  The outer loop will iterate over each word in the array and the inner loop will iterate over the &#8220;neighbors&#8221; of the current word.  The number of iterations for the inner loop is dictated by the size of our &#8220;window&#8221; to capture neighbors of the current word.  At the bottom of each iteration in the inner loop, we will emit a WordPair object (consisting of the current word on the left and the neighbor word on the right) as the key, and a count of one as the value. Here is the code for the Pairs implementation:</p>
<pre class="brush:java">
public class PairsOccurrenceMapper extends Mapper&lt;LongWritable, Text, WordPair, IntWritable&gt; {
    private WordPair wordPair = new WordPair();
    private IntWritable ONE = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        int neighbors = context.getConfiguration().getInt("neighbors", 2);
        String[] tokens = value.toString().split("\\s+");
        if (tokens.length &gt; 1) {
          for (int i = 0; i &lt; tokens.length; i++) {
              wordPair.setWord(tokens[i]);

             int start = (i - neighbors &lt; 0) ? 0 : i - neighbors;
             int end = (i + neighbors &gt;= tokens.length) ? tokens.length - 1 : i + neighbors;
              for (int j = start; j &lt;= end; j++) {
                  if (j == i) continue;
                   wordPair.setNeighbor(tokens[j]);
                   context.write(wordPair, ONE);
              }
          }
      }
  }
}
</pre>
<p>The Reducer for the Pairs implementation will simply sum all of the numbers for the given WordPair key:</p>
<pre class="brush:java">
public class PairsReducer extends Reducer&lt;WordPair,IntWritable,WordPair,IntWritable&gt; {
    private IntWritable totalCount = new IntWritable();
    @Override
    protected void reduce(WordPair key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable value : values) {
             count += value.get();
        }
        totalCount.set(count);
        context.write(key,totalCount);
    }
}

</pre>
<h3>Stripes</h3>
<p>Implementing the stripes approach to co-occurrence is equally straightforward.  The approach is the same, but all of the &#8220;neighbor&#8221; words are collected in a HashMap with the neighbor word as the key and an integer count as the value.  When all of the values have been collected for a given word (the bottom of the outer loop), the word and the hashmap are emitted.  Here is the code for our Stripes implementation:</p>
<pre class="brush:java">
public class StripesOccurrenceMapper extends Mapper&lt;LongWritable,Text,Text,MapWritable&gt; {
  private MapWritable occurrenceMap = new MapWritable();
  private Text word = new Text();

  @Override
 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   int neighbors = context.getConfiguration().getInt("neighbors", 2);
   String[] tokens = value.toString().split("\\s+");
   if (tokens.length &gt; 1) {
      for (int i = 0; i &lt; tokens.length; i++) {
          word.set(tokens[i]);
          occurrenceMap.clear();

          int start = (i - neighbors &lt; 0) ? 0 : i - neighbors;
          int end = (i + neighbors &gt;= tokens.length) ? tokens.length - 1 : i + neighbors;
           for (int j = start; j &lt;= end; j++) {
                if (j == i) continue;
                Text neighbor = new Text(tokens[j]);
                if(occurrenceMap.containsKey(neighbor)){
                   IntWritable count = (IntWritable)occurrenceMap.get(neighbor);
                   count.set(count.get()+1);
                }else{
                   occurrenceMap.put(neighbor,new IntWritable(1));
                }
           }
          context.write(word,occurrenceMap);
     }
   }
  }
}
</pre>
<p>The Reducer for the Stripes approach is a little more involved due to the fact we will need to iterate over a collection of maps, then for each map, iterate over all of the values in the map:</p>
<pre class="brush:java">
public class StripesReducer extends Reducer&lt;Text, MapWritable, Text, MapWritable&gt; {
    private MapWritable incrementingMap = new MapWritable();

    @Override
    protected void reduce(Text key, Iterable&lt;MapWritable&gt; values, Context context) throws IOException, InterruptedException {
        incrementingMap.clear();
        for (MapWritable value : values) {
            addAll(value);
        }
        context.write(key, incrementingMap);
    }

    private void addAll(MapWritable mapWritable) {
        Set&lt;Writable&gt; keys = mapWritable.keySet();
        for (Writable key : keys) {
            IntWritable fromCount = (IntWritable) mapWritable.get(key);
            if (incrementingMap.containsKey(key)) {
                IntWritable count = (IntWritable) incrementingMap.get(key);
                count.set(count.get() + fromCount.get());
            } else {
                incrementingMap.put(key, fromCount);
            }
        }
    }
}
</pre>
<h3>Conclusion</h3>
<p>When looking at the two approaches, we can see that the Pairs algorithm will generate more key value pairs compared to the Stripes algorithm.  Also, the Pairs algorithm captures each individual co-occurrence event while the Stripes algorithm captures all co-occurrences for a given event.  Both the Pairs and Stripes implementations would benefit from using a Combiner. Because both produce commutative and associative results, we can simply re-use each Mapper&#8217;s Reducer as the Combiner. As stated before, creating a co-occurrence matrix has applicability to other fields beyond text processing, and represent useful MapReduce algorithms to have in one&#8217;s arsenal.  Thanks for your time.</p>
<h3>Resources</h3>
<ul>
<li> <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" title="Data-Intensive Text Processing with MapReduce" target="_blank">Data-Intensive Processing with MapReduce</a> by Jimmy Lin and Chris Dyer</li>
<li><a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=tmm_pap_title_0?ie=UTF8&#038;qid=1347589052&#038;sr=1-1" target="_blank">Hadoop: The Definitive Guide</a> by Tom White</li>
<li><a href="https://github.com/bbejeck/hadoop-algorithms" target="_blank" title="Source Code">Source Code and Tests</a> from blog</li>
<li><a href="http://hadoop.apache.org/docs/r0.20.2/api/index.html">Hadoop API</a></li>
<li><a href="http://mrunit.apache.org/" target="_blank">MRUnit</a> for unit testing Apache Hadoop map reduce jobs</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/cooccurrence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Testing Hadoop Programs with MRUnit</title>
		<link>http://codingjunkie.net/testing-hadoop-programs-with-mrunit/</link>
		<comments>http://codingjunkie.net/testing-hadoop-programs-with-mrunit/#comments</comments>
		<pubDate>Thu, 01 Nov 2012 12:45:51 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[Testing]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MRUnit]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=2393</guid>
		<description><![CDATA[This post will take a slight detour from implementing the patterns found in Data-Intensive Processing with MapReduce to discuss something equally important, testing. I was inspired in part from a presentation by Tom Wheeler that I attended while at the 2012 Strata/Hadoop World conference in New York. When working with large data sets, unit testing [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg"><img src="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg" alt="" title="hadoop-logo" width="300" height="71" class="alignleft size-full wp-image-2180" /></a> This post will take a slight detour from implementing the patterns found in <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" target="_blank">Data-Intensive Processing with MapReduce</a> to discuss something equally important, testing. I was inspired in part from a presentation by <a href="http://www.tomwheeler.com/" target="_blank">Tom Wheeler</a> that I attended while at the 2012 Strata/Hadoop World conference in New York.  When working with large data sets, unit testing might not be the first thing that comes to mind.  However, when you consider the fact that no matter how large your cluster is, or how much data you have, the same code is pushed out to all nodes for running the MapReduce job, Hadoop mappers and reducers lend themselves very well to being unit tested.  But what is not easy about unit testing Hadoop, is the framework itself.  Luckily there is a library that makes testing Hadoop fairly easy &#8211; <a href="http://mrunit.apache.org/" target="_blank">MRUnit</a>.  MRUnit is based on JUnit and allows for the unit testing of mappers, reducers and some limited integration testing of the mapper &#8211; reducer interaction along with combiners, custom counters and partitioners.  We are using the latest release of MRUnit as of this writing, 0.9.0. All of the code under test comes from the <a href="http://codingjunkie.net/text-processing-with-mapreduce-part-2/" title="Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II" target="_blank">previous post</a> on computing averages using local aggregation.</p>
<h3>Setup</h3>
<p>To get started, download MRUnit from <a href="http://mrunit.apache.org/general/downloads.html" target="_blank">here</a>. After you have extracted the tar file, cd into the mrunit-0.9.0-incubating/lib directory.  In there you should see the following:</p>
<ol>
<li>mrunit-0.9.0-incubating-hadoop1.jar</li>
<li>mrunit-0.9.0-incubating-hadoop2.jar</li>
</ol>
<p>As I&#8217;m sure can guess, the mrunit-0.9.0-incubating-hadoop1.jar is for MapReduce version 1 of Hadoop and mrunit-0.9.0-incubating-hadoop2.jar is for working the new version of Hadoop&#8217;s MapReduce.  For this post, and all others going forward, we will be using hadoop-2.0 version from Cloudera&#8217;s <a href="https://ccp.cloudera.com/display/SUPPORT/CDH4+Downloadable+Tarballs" target="_blank">CDH4.1.1 release</a> so we will need the mrunit-0.9.0-incubating-hadoop2.jar file.  I added MRUnit, JUnit and Mockito as libraries in Intellij (JUnit and Mockito are found in the same directory as the MRUnit jar files).  Now that we have set up our dependencies, let&#8217;s start testing.</p>
<h3>Testing Mappers</h3>
<p>Setting up to test a mapper is very straight forward and is best explained by looking at some code first.  We will use the in-mapper combining example from the <a href="http://codingjunkie.net/text-processing-with-mapreduce-part-2/" title="Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II" target="_blank">previous post</a>:</p>
<pre class="brush:java">
@Test
public void testCombiningMapper() throws Exception {
   new MapDriver&lt;LongWritable,Text,Text,TemperatureAveragingPair&gt;()
           .withMapper(new AverageTemperatureCombiningMapper())
           .withInput(new LongWritable(4),new Text(temps[3]))
           .withOutput(new Text("190101"),new TemperatureAveragingPair(-61,1))
           .runTest();
 }
</pre>
<p>Notice the fluent api style which adds the ease of creating the test.  To write your test you would:</p>
<ol>
<li>Instantiate an instance of the MapDriver class parameterized exactly as the mapper under test.</li>
<li>Add an instance of the Mapper you are testing in the withMapper call.</li>
<li>In the withInput call pass in your key and input value, in this case a LongWritable with an arbitrary value and a Text object that contains a line from from the NCDC weather dataset contained in a String array called &#8216;temps&#8217; that was set up earlier in the test (not displayed here as it would take away from the presentation).</li>
<li>Specify the expected output in the withOutput call, here we are expecting a Text object with the value of &#8220;190101&#8243; and a TemperatureAveragingPair object containing the values -61 (temperature) and a 1 (count).</li>
<li>The last call runTest feeds the specified input values into the mapper and compares the actual output against the expected output set in the &#8216;withOutput&#8217; method.</li>
</ol>
<p>One thing to note is the MapDriver only allows one input and output per test.  You can call withInput and withOutput multiple times if you want, but the MapDriver will overwrite the existing values with the new ones, so you will only ever be testing with one input/output at any time. To specify multiple inputs we would use the MapReduceDriver, covered a couple of sections later, but next up is testing the reducer. </p>
<h3>Testing Reducers</h3>
<p>Testing the reducer follows the same pattern as the mapper test. Again, let&#8217;s start by looking at a code example:</p>
<pre class="brush:java">
@Test
public void testReducerCold(){
  List&lt;TemperatureAveragingPair&gt; pairList = new ArrayList&lt;TemperatureAveragingPair&gt;();
      pairList.add(new TemperatureAveragingPair(-78,1));
      pairList.add(new TemperatureAveragingPair(-84,1));
      pairList.add(new TemperatureAveragingPair(-28,1));
      pairList.add(new TemperatureAveragingPair(-56,1));

      new ReduceDriver&lt;Text,TemperatureAveragingPair,Text,IntWritable&gt;()
                .withReducer(new AverageTemperatureReducer())
                .withInput(new Text("190101"), pairList)
                .withOutput(new Text("190101"),new IntWritable(-61))
                .runTest();
    }
</pre>
<ol>
<li>The test starts by creating a list of TemperatureAveragingPair objects to be used as the input to the reducer.</li>
<li>A ReducerDriver is instantiated, and like the MapperDriver, is parameterized exactly as the reducer under test.</li>
<li>Next we pass in an instance of the reducer we want to test in the withReducer call.</li>
<li>In the withInput call we pass in the key of &#8220;190101&#8243; and the pairList object created at the start of the test.</li>
<li>Next we specify the output that we expect our reducer to emit, the same key of &#8220;190101&#8243; and an IntWritable representing the average of the temperatures in the list.</li>
<li>Finally runTest is called, which feeds our reducer the inputs specified and compares the output from the reducer against the expect output.</li>
</ol>
<p>The ReducerDriver has the same limitation as the MapperDriver of not accepting more than one input/output pair. So far we have tested the Mapper and Reducer in isolation, but we would also like to test them together in an integration test.  Integration testing can be accomplished by using the MapReduceDriver class.  The MapReduceDriver is also the class to use for testing the use of combiners, custom counters or custom partitioners.</p>
<h3>Integration Testing</h3>
<p>To test your mapper and reducer working together, MRUnit provides the MapReduceDriver class.  The MapReduceDriver class as you would expect by now, with 2 main differences.  First, you parameterize the input and output types of the mapper and the input and output types of the reducer.  Since the mapper output types need to match the reducer input types, you end up with 3 pairs of parameterized types.  Secondly you can provide multiple inputs and specify multiple expected outputs.  Here is our sample code:</p>
<pre class="brush:java">
@Test
public void testMapReduce(){

new MapReduceDriver&lt;LongWritable,Text,
                      Text,TemperatureAveragingPair,
                      Text,IntWritable&gt;()
                .withMapper(new AverageTemperatureMapper())
                .withInput(new LongWritable(1),new Text(temps[0]))
                .withInput(new LongWritable(2),new Text(temps[1]))
                .withInput(new LongWritable(3),new Text(temps[2]))
                .withInput(new LongWritable(4),new Text(temps[3]))
                .withInput(new LongWritable(5),new Text(temps[6]))
                .withInput(new LongWritable(6),new Text(temps[7]))
                .withInput(new LongWritable(7),new Text(temps[8]))
                .withInput(new LongWritable(8),new Text(temps[9]))
                .withCombiner(new AverageTemperatureCombiner())
                .withReducer(new AverageTemperatureReducer())
                .withOutput(new Text("190101"),new IntWritable(-22))
                .withOutput(new Text("190102"),new IntWritable(-40))
                .runTest();
    }

</pre>
<p>As you can see from the example above, the setup is the same as the MapDriver and the ReduceDriver classes. You pass in instances of the mapper, reducer and optionally a combiner to test.  The MapReduceDriver allows us to pass in multiple inputs that have different keys.  Here the &#8216;temps&#8217; array is the same one referenced in the mapper sample and contains a few lines from the NCDC weather dataset and the keys in those sample lines are the months of January and February of the year 1901 represented as &#8220;190101&#8243; and &#8220;190102&#8243; respectively.  This test is successful, so we gain a little more confidence around the correctness of our mapper and reducer working together.</p>
<h3>Conclusion</h3>
<p>Hopefully, we have made the case for how useful MRUnit can be for testing Hadoop programs. I&#8217;d like to wrap this post up with some of my own observations. Although MRUnit makes unit testing easy for mapper and reducer code, the mapper and reducer examples presented here are fairly simple.  If your map and/or reduce code starts to become more complex, it&#8217;s probably a better to decouple the code from the Hadoop framework and test the new classes on their own. Also, as useful as the MapReduceDriver class is for integration testing, it&#8217;s very easy to get to a point where you are no longer testing your code, but the Hadoop framework itself, which has already been done.  I&#8217;ve come up with my own testing strategy that I intend to use going forward:</p>
<ol>
<li>Unit test the map/reduce code.</li>
<li>Possibly write one integration test with the MapReduceDriver class.</li>
<li>As a sanity check, run a MapReduce job on a single node install (on my laptop) to ensure it runs on the Hadoop framework.</li>
<li>Then run my code on a test cluster, on EC2 using Apache Whirr in my case.</li>
</ol>
<p>Covering how to set up a single node install on my laptop (OSX Lion) and standing up a cluster on EC2 using Whirr would make this post too long, so I&#8217;ll cover those topics in the next one.  Thanks for your time.</p>
<h3>Resources</h3>
<ul>
<li> <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" title="Data-Intensive Text Processing with MapReduce" target="_blank">Data-Intensive Processing with MapReduce</a> by Jimmy Lin and Chris Dyer</li>
<li><a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=tmm_pap_title_0?ie=UTF8&#038;qid=1347589052&#038;sr=1-1" target="_blank">Hadoop: The Definitive Guide</a> by Tom White</li>
<li><a href="https://github.com/bbejeck/hadoop-algorithms" target="_blank" title="Source Code">Source Code</a> from blog</li>
<li><a href="http://hadoop.apache.org/docs/r0.20.2/api/index.html">Hadoop API</a></li>
<li><a href="http://mrunit.apache.org/" target="_blank">MRUnit</a> for unit testing Apache Hadoop map reduce jobs</li>
<li><a href="http://www.gutenberg.org/" title="Project Gutenberg" target="_blank">Project Gutenberg</a> a great source of books in plain text format, great for testing Hadoop jobs locally.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/testing-hadoop-programs-with-mrunit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Working Through Data-Intensive Text Processing with MapReduce &#8211; Local Aggregation Part II</title>
		<link>http://codingjunkie.net/text-processing-with-mapreduce-part-2/</link>
		<comments>http://codingjunkie.net/text-processing-with-mapreduce-part-2/#comments</comments>
		<pubDate>Tue, 16 Oct 2012 03:50:04 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=2282</guid>
		<description><![CDATA[This post continues with the series on implementing algorithms found in the Data Intensive Processing with MapReduce book. Part one can be found here. In the previous post, we discussed using the technique of local aggregation as a means of reducing the amount of data shuffled and transferred across the network. Reducing the amount of [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg"><img src="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg" alt="" title="hadoop-logo" width="300" height="71" class="alignleft size-full wp-image-2180" /></a>This post continues with the series on implementing algorithms found in the <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" target="_blank" title="Working Through Data-Intensive Text Processing with MapReduce">Data Intensive Processing with MapReduce</a> book. Part one can be found <a href="http://codingjunkie.net/text-processing-with-mapreduce-part1/" target="_blank" title="Working Through Data-Intensive Text Processing with MapReduce">here</a>. In the previous post, we discussed using the technique of local aggregation as a means of reducing the amount of data shuffled and transferred across the network.  Reducing the amount of data transferred is one of the top ways to improve the efficiency of a MapReduce job. A word-count MapReduce job was used to demonstrate local aggregation.  Since the results only require a total count, we could re-use the same reducer for our combiner as changing the order or groupings of the addends will not affect the sum.  But what if you wanted an <i>average</i>? Then the same approach would not work because calculating an average of averages is not equal to the average of the original set of numbers. With a little bit of insight though, we can still use local aggregation.  For these examples we will be using a sample of the <a href="https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all" target="_blank">NCDC weather dataset</a> used in <a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520" target="_blank">Hadoop the Definitive Guide</a> book. We will calculate the average temperature for each month in the year 1901. The averages algorithm for the combiner and the in-mapper combining option can be found in chapter 3.1.3 of Data-Intensive Processing with MapReduce.</p>
<h3>One Size Does Not Fit All</h3>
<p>Last time we described two approaches for reducing data in a MapReduce job, Hadoop Combiners and the in-mapper combining approach.  Combiners are considered an optimization by the Hadoop framework and there are no guarantees on how many times they will be called, if at all.  As a result, mappers must emit data in the form expected by the reducers so if combiners aren&#8217;t involved, the final result is not changed.  To adjust for calculating averages, we need to go back to the mapper and change it&#8217;s output. </p>
<h3>Mapper Changes</h3>
<p> In the word-count example, the non-optimized mapper simply emitted the word and the count of 1. The combiner and in-mapper combining mapper optimized this output by keeping each word as a key in a hash map with the total count as the value.  Each time a word was seen the count was incremented by 1.  With this setup, if the combiner was not called, the reducer would receive the word as a key and a long list of 1&#8242;s to add together, resulting in the same output (of course using the in-mapper combining mapper avoided this issue because it&#8217;s guaranteed to combine results as it&#8217;s part of the mapper code). To compute an average, we will have our base mapper emit a string key (the year and month of the weather observation concatenated together) and a <a href="http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/io/Writable.html" target="_blank">custom writable</a> object, called TemperatureAveragingPair.  The <a href="https://github.com/bbejeck/hadoop-algorithms/blob/master/src/bbejeck/mapred/aggregation/TemperatureAveragingPair.java" target="_blank">TemperatureAveragingPair</a> object will contain two numbers (IntWritables), the temperature taken and a count of one.  We will take the MaximumTemperatureMapper from Hadoop: The Definitive Guide and use it as inspiration for creating an AverageTemperatureMapper:</p>
<pre class="brush:java">
public class AverageTemperatureMapper extends Mapper&lt;LongWritable, Text, Text, TemperatureAveragingPair&gt; {
 //sample line of weather data
 //0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF10899199999999999


    private Text outText = new Text();
    private TemperatureAveragingPair pair = new TemperatureAveragingPair();
    private static final int MISSING = 9999;

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String yearMonth = line.substring(15, 21);

        int tempStartPosition = 87;

        if (line.charAt(tempStartPosition) == '+') {
            tempStartPosition += 1;
        }

        int temp = Integer.parseInt(line.substring(tempStartPosition, 92));

        if (temp != MISSING) {
            outText.set(yearMonth);
            pair.set(temp, 1);
            context.write(outText, pair);
        }
    }
}
</pre>
<p>By having the mapper output a key and TemperatureAveragingPair object our MapReduce program is guaranteed to have the correct results regardless if the combiner is called.</p>
<h3>Combiner</h3>
<p>We need to reduce the amount of data sent, so we will sum the temperatures, and sum the counts and store them separately. By doing so we will reduce data sent, but preserve the format needed for calculating correct averages.  If/when the combiner is called, it will take all the TemperatureAveragingPair objects passed in and emit a single TemperatureAveragingPair object for the same key, containing the summed temperatures and counts.  Here is the code for the combiner: </p>
<pre class="brush:java">
  public class AverageTemperatureCombiner extends Reducer&lt;Text,TemperatureAveragingPair,Text,TemperatureAveragingPair&gt; {
    private TemperatureAveragingPair pair = new TemperatureAveragingPair();

    @Override
    protected void reduce(Text key, Iterable&lt;TemperatureAveragingPair&gt; values, Context context) throws IOException, InterruptedException {
        int temp = 0;
        int count = 0;
        for (TemperatureAveragingPair value : values) {
             temp += value.getTemp().get();
             count += value.getCount().get();
        }
        pair.set(temp,count);
        context.write(key,pair);
    }
}
 </pre>
<p>But we are really interested in being guaranteed we have reduced the amount of data sent to the reducers, so we&#8217;ll have a look at how to achieve that next.</p>
<h3>In Mapper Combining Averages</h3>
<p>Similar to the word-count example, for calculating averages, the in-mapper-combining mapper will use a hash map with the concatenated year+month as a key and a TemperatureAveragingPair as the value. Each time we get the same year+month combination, we&#8217;ll take the pair object out of the map, add the temperature and increase the count by by one.  Once the cleanup method is called we&#8217;ll and emit all pairs with their respective key:</p>
<pre class="brush:java">
public class AverageTemperatureCombiningMapper extends Mapper&lt;LongWritable, Text, Text, TemperatureAveragingPair&gt; {
 //sample line of weather data
 //0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF10899199999999999


    private static final int MISSING = 9999;
    private Map&lt;String,TemperatureAveragingPair&gt; pairMap = new HashMap&lt;String,TemperatureAveragingPair&gt;();


    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String yearMonth = line.substring(15, 21);

        int tempStartPosition = 87;

        if (line.charAt(tempStartPosition) == '+') {
            tempStartPosition += 1;
        }

        int temp = Integer.parseInt(line.substring(tempStartPosition, 92));

        if (temp != MISSING) {
            TemperatureAveragingPair pair = pairMap.get(yearMonth);
            if(pair == null){
                pair = new TemperatureAveragingPair();
                pairMap.put(yearMonth,pair);
            }
            int temps = pair.getTemp().get() + temp;
            int count = pair.getCount().get() + 1;
            pair.set(temps,count);
        }
    }


    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        Set&lt;String&gt; keys = pairMap.keySet();
        Text keyText = new Text();
        for (String key : keys) {
             keyText.set(key);
             context.write(keyText,pairMap.get(key));
        }
    }
}
</pre>
<p>By following the same pattern of keeping track of data between map calls, we can achieve reliable data reduction by implementing an in-mapper combining strategy.  The same caveats apply for keeping state across all calls to the mapper, but looking at the gains that can be made in processing efficiency, using this approach merits some consideration.</p>
<h3>Reducer</h3>
<p>At this point, writing our reducer is easy, take a list of pairs for each key, sum all the temperatures and counts then divide the sum of the temperatures by the sum of the counts.</p>
<pre class="brush:java">
public class AverageTemperatureReducer extends Reducer&lt;Text, TemperatureAveragingPair, Text, IntWritable&gt; {
    private IntWritable average = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable&lt;TemperatureAveragingPair&gt; values, Context context) throws IOException, InterruptedException {
        int temp = 0;
        int count = 0;
        for (TemperatureAveragingPair pair : values) {
            temp += pair.getTemp().get();
            count += pair.getCount().get();
        }
        average.set(temp / count);
        context.write(key, average);
    }
}
</pre>
<h3>Results</h3>
<p>The results are predictable with the combiner and in-mapper-combining mapper options showing substantially reduced data output.<br />
Non-Optimized Mapper Option:</p>
<pre class="brush:text">
12/10/10 23:05:28 INFO mapred.JobClient:     Reduce input groups=12
12/10/10 23:05:28 INFO mapred.JobClient:     Combine output records=0
12/10/10 23:05:28 INFO mapred.JobClient:     Map input records=6565
12/10/10 23:05:28 INFO mapred.JobClient:     Reduce shuffle bytes=111594
12/10/10 23:05:28 INFO mapred.JobClient:     Reduce output records=12
12/10/10 23:05:28 INFO mapred.JobClient:     Spilled Records=13128
12/10/10 23:05:28 INFO mapred.JobClient:     Map output bytes=98460
12/10/10 23:05:28 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
12/10/10 23:05:28 INFO mapred.JobClient:     Combine input records=0
12/10/10 23:05:28 INFO mapred.JobClient:     Map output records=6564
12/10/10 23:05:28 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108
12/10/10 23:05:28 INFO mapred.JobClient:     Reduce input records=6564
</pre>
<p>Combiner Option:</p>
<pre class="brush:text">
12/10/10 23:07:19 INFO mapred.JobClient:     Reduce input groups=12
12/10/10 23:07:19 INFO mapred.JobClient:     Combine output records=12
12/10/10 23:07:19 INFO mapred.JobClient:     Map input records=6565
12/10/10 23:07:19 INFO mapred.JobClient:     Reduce shuffle bytes=210
12/10/10 23:07:19 INFO mapred.JobClient:     Reduce output records=12
12/10/10 23:07:19 INFO mapred.JobClient:     Spilled Records=24
12/10/10 23:07:19 INFO mapred.JobClient:     Map output bytes=98460
12/10/10 23:07:19 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
12/10/10 23:07:19 INFO mapred.JobClient:     Combine input records=6564
12/10/10 23:07:19 INFO mapred.JobClient:     Map output records=6564
12/10/10 23:07:19 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108
12/10/10 23:07:19 INFO mapred.JobClient:     Reduce input records=12
</pre>
<p>In-Mapper-Combining Option:</p>
<pre class="brush:text">
12/10/10 23:09:09 INFO mapred.JobClient:     Reduce input groups=12
12/10/10 23:09:09 INFO mapred.JobClient:     Combine output records=0
12/10/10 23:09:09 INFO mapred.JobClient:     Map input records=6565
12/10/10 23:09:09 INFO mapred.JobClient:     Reduce shuffle bytes=210
12/10/10 23:09:09 INFO mapred.JobClient:     Reduce output records=12
12/10/10 23:09:09 INFO mapred.JobClient:     Spilled Records=24
12/10/10 23:09:09 INFO mapred.JobClient:     Map output bytes=180
12/10/10 23:09:09 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
12/10/10 23:09:09 INFO mapred.JobClient:     Combine input records=0
12/10/10 23:09:09 INFO mapred.JobClient:     Map output records=12
12/10/10 23:09:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108
12/10/10 23:09:09 INFO mapred.JobClient:     Reduce input records=12
</pre>
<p>Calculated Results:<br />
(NOTE: the temperatures in the sample file are in Celsius * 10)</p>
<table cellpadding="5" cellspacing="5">
<tr>
<td>Non-Optimized</td>
<td>Combiner&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</td>
<td>In-Mapper-Combiner Mapper</td>
</tr>
<tr>
<td>
       190101	-25<br />
190102	-91<br />
190103	-49<br />
190104	22<br />
190105	76<br />
190106	146<br />
190107	192<br />
190108	170<br />
190109	114<br />
190110	86<br />
190111	-16<br />
190112	-77
    </td>
<td>
      190101	-25<br />
190102	-91<br />
190103	-49<br />
190104	22<br />
190105	76<br />
190106	146<br />
190107	192<br />
190108	170<br />
190109	114<br />
190110	86<br />
190111	-16<br />
190112	-77
    </td>
<td>
190101	-25<br />
190102	-91<br />
190103	-49<br />
190104	22<br />
190105	76<br />
190106	146<br />
190107	192<br />
190108	170<br />
190109	114<br />
190110	86<br />
190111	-16<br />
190112	-77
    </td>
</tr>
</table>
<h3>Conclusion</h3>
<p>We have covered local aggregation for both the simple case where one could reuse the reducer as a combiner and a more complicated case where some insight on how to structure the data while still gaining the benefits of locally aggregating data for increase processing efficiency. </p>
<h3>Resources</h3>
<ul>
<li> <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" title="Data-Intensive Text Processing with MapReduce" target="_blank">Data-Intensive Processing with MapReduce</a> by Jimmy Lin and Chris Dyer</li>
<li><a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=tmm_pap_title_0?ie=UTF8&#038;qid=1347589052&#038;sr=1-1" target="_blank">Hadoop: The Definitive Guide</a> by Tom White</li>
<li><a href="https://github.com/bbejeck/hadoop-algorithms" target="_blank" title="Source Code">Source Code</a> from blog</li>
<li><a href="http://hadoop.apache.org/docs/r0.20.2/api/index.html">Hadoop API</a></li>
<li><a href="http://mrunit.apache.org/" target="_blank">MRUnit</a> for unit testing Apache Hadoop map reduce jobs</li>
<li><a href="http://www.gutenberg.org/" title="Project Gutenberg" target="_blank">Project Gutenberg</a> a great source of books in plain text format, great for testing Hadoop jobs locally.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/text-processing-with-mapreduce-part-2/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Working Through Data-Intensive Text Processing with MapReduce</title>
		<link>http://codingjunkie.net/text-processing-with-mapreduce-part1/</link>
		<comments>http://codingjunkie.net/text-processing-with-mapreduce-part1/#comments</comments>
		<pubDate>Wed, 26 Sep 2012 04:04:13 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=2179</guid>
		<description><![CDATA[It has been a while since I last posted, as I&#8217;ve been busy with some of the classes offered by Coursera. There are some very interesting offerings and is worth a look. Some time ago, I purchased Data-Intensive Processing with MapReduce by Jimmy Lin and Chris Dyer. The book presents several key MapReduce algorithms, but [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg"><img src="http://codingjunkie.net/wp-content/uploads/2012/09/hadoop-logo.jpeg" alt="" title="hadoop-logo" width="300" height="71" class="alignleft size-full wp-image-2180" /></a>It has been a while since I last posted, as I&#8217;ve been busy with some of the classes offered by <a href="https://www.coursera.org/" title="Coursera" target="_blank">Coursera</a>.  There are some very interesting offerings and is worth a look. Some time ago, I purchased <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" title="Data-Intensive Text Processing with MapReduce" target="_blank">Data-Intensive Processing with MapReduce</a> by Jimmy Lin and Chris Dyer.  The book presents several key MapReduce algorithms, but in pseudo code format.  My goal is to take the algorithms presented in chapters 3-6 and implement them in Hadoop,  using <a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=tmm_pap_title_0?ie=UTF8&#038;qid=1347589052&#038;sr=1-1" target="_blank">Hadoop: The Definitive Guide</a> by Tom White as a reference.   I&#8217;m going to assume familiarity with Hadoop and MapReduce and not cover any introductory material.  So let&#8217;s jump into chapter 3 &#8211; MapReduce Algorithm Design, starting with local aggregation.</p>
<h2>Local Aggregation</h2>
<p>At a very high level, when Mappers emit data, the intermediate results are written to disk then sent across the network to Reducers for final processing.  The latency of writing to disk then transferring data across the network is an expensive operation in the processing of a MapReduce job.  So it stands to reason that whenever possible, reducing the amount of data sent from mappers would increase the speed of the MapReduce job.  Local aggregation is a technique used to reduce the amount of data and improve the efficiency of our MapReduce job.  Local aggregation can not take the place of reducers, as we need a way to gather results with the same key from different mappers.  We are going to consider 3 ways of achieving local aggregation:</p>
<ol>
<li>Using Hadoop Combiner functions.</li>
<li>Two approaches of &#8220;in-mapper&#8221; combining presented in the Text Processing with MapReduce book.</li>
</ol>
<p>Of course any optimization is going to have tradeoffs and we&#8217;ll discuss those as well.<br />
To demonstrate local aggregation, we will run the ubiquitous word count job on a plain text version of <a href="http://www.gutenberg.org/cache/epub/46/pg46.txt">A Christmas Carol</a> by Charles Dickens (downloaded from <a href="http://www.gutenberg.org/wiki/Main_Page" target="_blank">Project Gutenberg</a>) on a pseudo distributed cluster installed on my MacBookPro, using the hadoop-0.20.2-cdh3u3 distribution from <a href="http://www.cloudera.com/" target="_blank">Cloudera</a>.  I plan in a future post to run the same experiment on an EC2 cluster with more realistic sized data.</p>
<h3>Combiners</h3>
<p>A combiner function is an object that extends the Reducer class. In fact, for our examples here, we are going to re-use the same reducer used in the word count job.  A combiner function is specified when setting up the MapReduce job like so:</p>
<pre class="brush:java">
 job.setReducerClass(TokenCountReducer.class);
</pre>
<p>Here is the reducer code:</p>
<pre class="brush:java">
public class TokenCountReducer extends Reducer&lt;Text,IntWritable,Text,IntWritable&gt;{
    @Override
    protected void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable value : values) {
              count+= value.get();
        }
        context.write(key,new IntWritable(count));
    }
}
</pre>
<p>The job of a combiner is to do just what the name implies, aggregate data with the net result of less data begin shuffled across the network, which gives us gains in efficiency.  As stated before, keep in mind that reducers are still required to put together results with the same keys coming from different mappers.  Since combiner functions are an optimization, the Hadoop framework offers no guarantees on how many times a combiner will be called, if at all.</p>
<h3>In Mapper Combining Option 1</h3>
<p>The first alternative to using Combiners (figure 3.2 page 41) is very straight forward and makes a slight modification to our original word count mapper:</p>
<pre class="brush:java">
public class PerDocumentMapper extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt; {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        IntWritable writableCount = new IntWritable();
        Text text = new Text();
        Map&lt;String,Integer&gt; tokenMap = new HashMap&lt;String, Integer&gt;();
        StringTokenizer tokenizer = new StringTokenizer(value.toString());

        while(tokenizer.hasMoreElements()){
            String token = tokenizer.nextToken();
            Integer count = tokenMap.get(token);
            if(count == null) count = new Integer(0);
            count+=1;
            tokenMap.put(token,count);
        }

        Set&lt;String&gt; keys = tokenMap.keySet();
        for (String s : keys) {
             text.set(s);
             writableCount.set(tokenMap.get(s));
             context.write(text,writableCount);
        }
    }
}
</pre>
<p>As we can see here, instead of emitting a word with the count of 1, for each word encountered, we use a map to keep track of each word already processed. Then when all of the tokens are processed we loop through the map and emit the total count for each word encountered in that line.</p>
<h3>In Mapper Combining Option 2</h3>
<p>The second option of in mapper combining (figure 3.3 page 41) is very similar to the above example with two distinctions &#8211; when the hash map is created and when we emit the results contained in the map.  In the above example, a map is created and has its contents dumped over the wire for <i>each</i> invocation of the map method.  In this example we are going make the map an instance variable and shift the instantiation of the map to the setUp method in our mapper.  Likewise the contents of the map will not be sent out to the reducers until all of the calls to mapper have completed and the cleanUp method is called.</p>
<pre class="brush:java">
public class AllDocumentMapper extends Mapper&lt;LongWritable,Text,Text,IntWritable&gt; {

    private  Map&lt;String,Integer&gt; tokenMap;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
           tokenMap = new HashMap&lt;String, Integer&gt;();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer tokenizer = new StringTokenizer(value.toString());
        while(tokenizer.hasMoreElements()){
            String token = tokenizer.nextToken();
            Integer count = tokenMap.get(token);
            if(count == null) count = new Integer(0);
            count+=1;
            tokenMap.put(token,count);
        }
    }


    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        IntWritable writableCount = new IntWritable();
        Text text = new Text();
        Set&lt;String&gt; keys = tokenMap.keySet();
        for (String s : keys) {
            text.set(s);
            writableCount.set(tokenMap.get(s));
            context.write(text,writableCount);
        }
    }
}
</pre>
<p>As we can see from the above code example, the mapper is keeping track of unique word counts, across all calls to the map method.  By  keeping track of unique tokens and their counts, there should be a substantial reduction in the number of records sent to the reducers, which in turn should improve the running time of the MapReduce job.  This accomplishes the same effect as using the combiner function option provided by the MapReduce framework, but in this case you are guaranteed that the combining code will be called.  But there are some caveats with this approach also.  Keeping state across map calls could prove problematic and definitely is a violation of the functional spirit of a &#8220;map&#8221; function.  Also, by keeping state across all mappers, depending on the data used in the job, memory could be another issue to contend with.  Ultimately, one would have to weigh all of the trade offs to determine the best approach.</p>
<h4>Results</h4>
<p>Now lets take a look at the some results of the different mappers.  Since the job was run in pseudo-distributed mode, actual running times are irrelevant, but we can still infer how using local aggregation could impact the efficiency of MapReduce job running on a real cluster.  </p>
<p>Per Token Mapper:</p>
<pre class="brush:text">
12/09/13 21:25:32 INFO mapred.JobClient:     Reduce shuffle bytes=366010
12/09/13 21:25:32 INFO mapred.JobClient:     Reduce output records=7657
12/09/13 21:25:32 INFO mapred.JobClient:     Spilled Records=63118
12/09/13 21:25:32 INFO mapred.JobClient:     Map output bytes=302886
</pre>
<p>In Mapper Reducing Option 1:</p>
<pre class="brush:text">
12/09/13 21:28:15 INFO mapred.JobClient:     Reduce shuffle bytes=354112
12/09/13 21:28:15 INFO mapred.JobClient:     Reduce output records=7657
12/09/13 21:28:15 INFO mapred.JobClient:     Spilled Records=60704
12/09/13 21:28:15 INFO mapred.JobClient:     Map output bytes=293402
</pre>
<p>In Mapper Reducing Option 2:</p>
<pre class="brush:text">
12/09/13 21:30:49 INFO mapred.JobClient:     Reduce shuffle bytes=105885
12/09/13 21:30:49 INFO mapred.JobClient:     Reduce output records=7657
12/09/13 21:30:49 INFO mapred.JobClient:     Spilled Records=15314
12/09/13 21:30:49 INFO mapred.JobClient:     Map output bytes=90565
</pre>
<p>Combiner Option:</p>
<pre class="brush:text">
12/09/13 21:22:18 INFO mapred.JobClient:     Reduce shuffle bytes=105885
12/09/13 21:22:18 INFO mapred.JobClient:     Reduce output records=7657
12/09/13 21:22:18 INFO mapred.JobClient:     Spilled Records=15314
12/09/13 21:22:18 INFO mapred.JobClient:     Map output bytes=302886
12/09/13 21:22:18 INFO mapred.JobClient:     Combine input records=31559
12/09/13 21:22:18 INFO mapred.JobClient:     Combine output records=7657
</pre>
<p>As expected the Mapper that did no combining had the worst results, followed closely by the first in-mapper combining option (although these results could have been made better had the data been cleaned up before running the word count). The second in-mapper combining option and the combiner function had virtually identical results.  The significant fact is that both produced <b><i>2/3 less</i></b> reduce shuffle bytes as the first two options.  Reducing the amount of bytes sent over the network to the reducers by that amount would surely would have a positive impact on the efficiency of a MapReduce job.  There is one point to keep in mind here and that is Combiners/In-Mapper combining can not just be used in all MapReduce jobs, in this case the word count lends itself very nicely to such an enhancement, but that might not always be true.</p>
<h3>Conclusion</h3>
<p>As you can see the benefits of using either in-mapper combining or the Hadoop combiner function require serious consideration when looking to improve the performance of your MapReduce jobs. As for which approach, it is up to you the weigh the trade offs for each approach.</p>
<h3>Resources</h3>
<ul>
<li> <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421" title="Data-Intensive Text Processing with MapReduce" target="_blank">Data-Intensive Processing with MapReduce</a> by Jimmy Lin and Chris Dyer</li>
<li><a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=tmm_pap_title_0?ie=UTF8&#038;qid=1347589052&#038;sr=1-1" target="_blank">Hadoop: The Definitive Guide</a> by Tom White</li>
<li><a href="https://github.com/bbejeck/hadoop-algorithms" target="_blank" title="Source Code">Source Code</a> from blog</li>
<li><a href="http://mrunit.apache.org/" target="_blank">MRUnit</a> for unit testing Apache Hadoop map reduce jobs</li>
<li><a href="http://www.gutenberg.org/" title="Project Gutenberg" target="_blank">Project Gutenberg</a> a great source of books in plain text format, great for testing Hadoop jobs locally.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/text-processing-with-mapreduce-part1/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Google Guava BloomFIlter</title>
		<link>http://codingjunkie.net/guava-bloomfilter/</link>
		<comments>http://codingjunkie.net/guava-bloomfilter/#comments</comments>
		<pubDate>Thu, 22 Mar 2012 05:01:04 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Guava]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=2144</guid>
		<description><![CDATA[When the Guava project released version 11.0, one of the new additions was the BloomFilter class. A BloomFilter is a unique data-structure used to indicate if an element is contained in a set. What makes a BloomFilter interesting is it will indicate if an element is absolutely not contained, or may be contained in a [...]]]></description>
				<content:encoded><![CDATA[<p><img src="http://codingjunkie.net/wp-content/uploads/2011/11/google-guava.gif" alt="" title="google-guava" width="48" height="48" class="alignleft size-full wp-image-1133" />When the Guava project released version 11.0, one of the new additions was the BloomFilter class.  A BloomFilter is a unique data-structure used to indicate if an element is contained in a set.  What makes a BloomFilter interesting is it will indicate if an element is <i>absolutely not</i> contained, or <i>may be</i> contained in a set. This property of never having a false negative makes the BloomFilter a great candidate for use as a guard condition to help prevent performing unnecessary and expensive operations.  While BloomFilters have received good exposure lately, using one meant rolling your own, or doing a Google search for code. The trouble with rolling your own BloomFilter is getting the correct hash function to make the filter effective.  Considering Guava uses the Murmur Hash for its&#8217; implementation, we now have the usefulness of an effective BloomFilter just a library away.</p>
<h4>BloomFilter Crash Course</h4>
<p>BloomFilters are essentially bit vectors.  At a high level BloomFilters work in the following manner: </p>
<ol>
<li>Add the element to the filter.</li>
<li>Hash it a few times, than set the bits to 1 where the index matches the results of the hash.</li>
</ol>
<p>When testing if an element is in the set, you follow the same hashing procedure and check if the bits are set to 1 or 0. This process is how a BloomFilter can guarantee an element does not exist. If the bits aren&#8217;t set, it’s simply impossible for the element to be in the set.  However, a positive answer means the element is in the set or a hashing collision occurred.  A more detaild description of a BloomFilter can be found <a href="http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html" target="_blank">here</a> and a good tutorial on BloomFilters <a href="http://llimllib.github.com/bloomfilter-tutorial/" target="_blank">here</a>. According to <a href="http://en.wikipedia.org/wiki/Bloom_filter" target="_blank">wikipedia</a>, Google uses BloomFilters in BigTable to avoid disk lookups for non-existent items.  Another interesting usage is <a href="http://asemanfar.com/Using-a-bloom-filter-to-optimize-a-SQL-query" target="_blank">using a BloomFilter to optimize a sql querry</a>.</p>
<h4>Using the Guava BloomFilter</h4>
<p>A Guava BloomFilter is created by calling the static method create on the BloomFilter class,<br />
passing in a <a href="http://docs.guava-libraries.googlecode.com/git-history/v11.0.2/javadoc/com/google/common/hash/Funnel.html" target="_blank">Funnel</a> object and an int representing the expected number of insertions. A Funnel, also new in Guava 11, is an object that can send data into a <a href="http://docs.guava-libraries.googlecode.com/git-history/v11.0.2/javadoc/com/google/common/hash/Sink.html"target="_blank">Sink</a>.  The following example is the default implementation and has a percentage of false positives of 3%. Guava provides a <a href="http://docs.guava-libraries.googlecode.com/git-history/v11.0.2/javadoc/com/google/common/hash/Funnels.html">Funnels</a> class containing two static methods providing implementations of the Funnel interface for inserting a CharSequence or byte Array into a filter.</p>
<pre class="brush:java">
//Creating the BloomFilter
BloomFilter bloomFilter = BloomFilter.create(Funnels.byteArrayFunnel(), 1000);

//Putting elements into the filter
//A BigInteger representing a key of some sort
bloomFilter.put(bigInteger.toByteArray());

//Testing for element in set
boolean mayBeContained = bloomFilter.mayContain(bitIntegerII.toByteArray());

</pre>
<p>UPDATE: based on the comment from Louis Wasserman, here&#8217;s how to create a BloomFilter for BigIntegers with a custom Funnel implementation:</p>
<pre class="brush:java">
//Create the custom filter
class BigIntegerFunnel implements Funnel&lt;BigInteger&gt; {
        @Override
        public void funnel(BigInteger from, Sink into) {
            into.putBytes(from.toByteArray());
        }
    }

//Creating the BloomFilter
BloomFilter bloomFilter = BloomFilter.create(new BigIntegerFunnel(), 1000);

//Putting elements into the filter
//A BigInteger representing a key of some sort
bloomFilter.put(bigInteger);

//Testing for element in set
boolean mayBeContained = bloomFilter.mayContain(bitIntegerII);
</pre>
<h4>Considerations</h4>
<p>It&#8217;s critical to estimate the number of expected insertions correctly. As insertions into the filter approach or exceeds the expected number, the BloomFilter begins to fill up and as a result will generate more false positives to the point of being useless.  There is another version of the BloomFilter.create method taking an additional parameter, a double representing the desired level of false hit probability (must be greater than 0 and less than one). The level of false hit probability affects the number of hashes for storing or searching for elements.  The lower the desired percentage, the higher number of hashes performed.</p>
<h4>Conclusion</h4>
<p>A BloomFilter is a useful item for a developer to have in his/her toolbox.  The Guava project now makes it very simple to begin using a BloomFilter when the need arises.  I hope you enjoyed this post.  Helpful comments and suggestions are welcomed.</p>
<h4>References</h4>
<ul>
<li><a href="https://github.com/bbejeck/guava-blog/blob/master/src/test/java/bbejeck/guava/hash/BloomFilterTest.java" target="_blank">Unit Test Demo of Guava BloomFilter</a>.</li>
<li><a href="http://docs.guava-libraries.googlecode.com/git-history/v11.0.2/javadoc/com/google/common/hash/BloomFilter.html">BloomFilter class</a></li>
<li><a href="http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html" target="_blank">All You Want to Know about BloomFilters</a>.</li>
<li><a href="http://llimllib.github.com/bloomfilter-tutorial/" target="_blank">BloomFilter Tutorial</a>.</li>
<li><a href="http://en.wikipedia.org/wiki/Bloom_filter" target="_blank">BloomFilter on Wikipedia</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/guava-bloomfilter/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Event Programming Example: Google Guava EventBus and Java 7 WatchService</title>
		<link>http://codingjunkie.net/eventbus-watchservice/</link>
		<comments>http://codingjunkie.net/eventbus-watchservice/#comments</comments>
		<pubDate>Fri, 24 Feb 2012 06:30:05 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[Guava]]></category>
		<category><![CDATA[Java 7]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=2004</guid>
		<description><![CDATA[This post is going to cover using the Guava EventBus to publish changes to a directory or sub-directories detected by the Java 7 WatchService. The Guava EventBus is a great way to add publish/subscribe communication to an application. The WatchService, new in the Java 7 java.nio.file package, is used to monitor a directory for changes. [...]]]></description>
				<content:encoded><![CDATA[<p><img src="http://codingjunkie.net/wp-content/uploads/2012/01/Java_Logo1-150x150.png" alt="" title="Java_Logo" width="150" height="150" class="alignleft size-thumbnail wp-image-1676" />This post is going to cover using the Guava EventBus to publish changes to a directory or sub-directories detected by the Java 7 WatchService.  The Guava EventBus is a great way to add publish/subscribe communication to an application.  The WatchService, new in the Java 7 java.nio.file package, is used to monitor a directory for changes. Since the EventBus and WatchService have been covered in previous posts, we will not be covering these topics in any depth here.  For more information, the reader is encouraged to view the <a href="http://codingjunkie.net/guava-eventbus/" title="Event Programming with Google Guava EventBus" target="_blank">EventBus</a> and <a href="http://codingjunkie.net/java-7-watchservice/" title="What’s New in Java 7: WatchService" target="_blank">WatchService</a> posts.  [NOTE: post updated on 02/28/2012 for clarity.]</p>
<h3>Why Use the EventBus</h3>
<p>There are two main reasons for using the EventBus with a WatchService.</p>
<ol>
<li>We don&#8217;t want poll for events, but would rather receive asynchronous notification.</li>
<li>Once events are processed, the WatchKey.reset method needs to be called to enable any new changes to be queued.  While the WatchKey object is thread safe, it&#8217;s important that the reset method is called only after all threads have finished processing events, leading to somewhat of a coordination hassle.  Using a single thread to process the events, invoke the reset method, then publish the changes via the EventBus, eliminates this problem.</li>
</ol>
<p>Our plan to accomplish this is simple and will involve taking the following steps:</p>
<ol>
<li>Instantiate an instance of the WatchService.</li>
<li>Register every directory recursively, starting with a given Path object.</li>
<li>Take events off the WatchService queue, then process and publish those events.</li>
<li>Start up a separate thread for taking events off the queue and publishing.</li>
</ol>
<p>The code examples that follow are the more relevant highlights from the <a href="https://github.com/bbejeck/Java-7/blob/master/src/main/java/bbejeck/nio/files/directory/event/DirectoryEventWatcherImpl.java" target="_blank">DirectoryEventWatcherImpl</a> class that is going to do all of this work.</p>
<h3>Registering Directories with the WatchService</h3>
<p>While adding or deleting a sub-directory will generate an event, any changes <i>inside</i> a sub-directory of a watched directory will not. We are going to compensate for this by recursively going through all sub-directories (via the Files.walkFileTree method) and register each one with the WatchService object (previously defined in the example here):</p>
<pre class="brush:java">

private void registerDirectories() throws IOException {
        Files.walkFileTree(startPath, new WatchServiceRegisteringVisitor());
}

private class WatchServiceRegisteringVisitor extends SimpleFileVisitor&lt;Path&gt;{
    @Override
    public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
         dir.register(watchService,ENTRY_CREATE,ENTRY_DELETE,ENTRY_MODIFY);
         return FileVisitResult.CONTINUE;
    }
}
</pre>
<p>On line 2 the Files.walkFileTree method uses the WatchServiceRegisteringVisitor class defined on line 5 to register every directory with the WatchService.  The registered events are creation of files/directories, deletion of files/directories or updates to a file.  </p>
<h3>Publishing Events</h3>
<p>The next step is to create a FutureTask that will do the work of checking the queue and publishing the events.</p>
<pre class="brush:java">
 private void createWatchTask() {
    watchTask = new FutureTask&lt;&gt;(new Callable&lt;Integer&gt;() {
       private int totalEventCount;
       @Override
       public Integer call() throws Exception {
           while (keepWatching) {
               WatchKey watchKey = watchService.poll(10, TimeUnit.SECONDS);
               if (watchKey != null) {
                  List&lt;WatchEvent&lt;?&gt;&gt; events = watchKey.pollEvents();
                  Path watched = (Path) watchKey.watchable();
                  PathEvents pathEvents = new PathEvents(watchKey.isValid(), watched);
                  for (WatchEvent event : events) {
                        pathEvents.add(new PathEvent((Path) event.context(), event.kind()));
                        totalEventCount++;
                  }
                  watchKey.reset();
                  eventBus.post(pathEvents);
                }
          }
           return totalEventCount;
        }
      });
    }

private void startWatching() {
  new Thread(watchTask).start();
}
</pre>
<p>On line 7, we are checking the WatchService every 10 seconds for queued events. When a valid WatchKey is returned, the first step is to retrieve the events (line 9) then get the directory where the events occurred (line 10). On line 11 a PathEvents object is created, taking a boolean and the watched directory as constructor arguments. Lines 12-15 are looping over the events retrieved on line 9, using the target Path and event type as arguments to create PathEvent object.  The WatchKey.reset method is called on line 16, setting the WatchKey state back to ready, making it eligible to receive new events and be placed back into the queue. Finally on line 17 the EventBus publishes the PathEvents object to all subscribers.  It&#8217;s important to note here that the PathEvents and PathEvent classes are immutable.  The totalEventCount that is returned from the Callable is never exposed in the API, but is used for testing purposes.  The startWatching method on line 25 starts the thread to run the watching/publishing task defined above.</p>
<h3>Conclusion</h3>
<p>By pairing the WatchService with the Guava EventBus we are able to manage the WatchKey and process events in a single thread and notify any number of subscribers asynchronously of the events. It is hoped the reader found this example useful.  As always comments and suggestions are welcomed.</p>
<h4>Resources</h4>
<ul>
<li><a href="https://github.com/bbejeck/Java-7/tree/master/src/main/java/bbejeck/nio/files/directory/event" target="new">Source code</a> and <a href="https://github.com/bbejeck/Java-7/blob/master/src/test/java/bbejeck/nio/files/directory/event/DirectoryEventWatcherImplTest.java" target="new">unit test</a> for this post</li>
<li><a href="http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/eventbus/package-summary.html" title="Guava EventBus" target="_blank">EventBus API</a></li>
<li><a href="http://docs.oracle.com/javase/7/docs/api/java/nio/file/WatchService.html" title="WatchService API" target="_blank">WatchService API</a></li>
<li>Previous post on the <a href="http://codingjunkie.net/java-7-watchservice/" target="new">WatchService</a>.</li>
<li>Previous post on the <a href="http://codingjunkie.net/guava-eventbus/" target="new">EventBus</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/eventbus-watchservice/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>What&#8217;s New in Java 7: WatchService</title>
		<link>http://codingjunkie.net/java-7-watchservice/</link>
		<comments>http://codingjunkie.net/java-7-watchservice/#comments</comments>
		<pubDate>Fri, 24 Feb 2012 06:29:47 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[Java 7]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=1950</guid>
		<description><![CDATA[Of all the new features in Java 7, one of the more interesting is the WatchService, adding the capability to watch a directory for changes. The WatchService maps directly to the native file event notification mechanism, if available. If a native event notification mechanism is not available, then the default implementation will use polling. As [...]]]></description>
				<content:encoded><![CDATA[<p><img src="http://codingjunkie.net/wp-content/uploads/2012/01/Java_Logo1-150x150.png" alt="" title="Java_Logo" width="150" height="150" class="alignleft size-thumbnail wp-image-1676" />Of all the new features in Java 7, one of the more interesting is the WatchService, adding the capability to watch a directory for changes.  The WatchService maps directly to the native file event notification mechanism, if available.  If a native event notification mechanism is not available, then the default implementation will use polling.  As a result, the responsiveness, ordering of events and details available are implementation specific.</p>
<h3>Watching A Directory</h3>
<p>The Path interface implements the register method that takes a WatchService object and varargs of type WatchEvent.Kind as arguments.  There are 4 events to watch for:</p>
<ol>
<li>ENTRY_CREATE</li>
<li>ENTRY_DELETE</li>
<li>ENTRY_MODIFY</li>
<li>OVERFLOW</li>
</ol>
<p>While the first 3 types are self explanatory, OVERFLOW indicates that events may been lost or discarded. A WatchService is created by calling FileSystem.newWatchService().  Watching a directory is accomplished by registering a Path object with the WatchService:</p>
<pre class="brush:java">
import static java.nio.file.StandardWatchEventKinds.*;
Path path = Paths.get("/home");
WatchService watchService = FileSystems.getDefault().newWatchService();
WatchKey watchKey = path.register(watchService,ENTRY_CREATE,ENTRY_DELETE,ENTRY_MODIFY);
</pre>
<p>As we can see from the example, the register method returns a WatchKey object. The WatchKey is a token that represents the registration of the Path with the WatchService. </p>
<h3>The WatchKey</h3>
<p>As a result of the registration process, the WatchKey is in a &#8216;ready&#8217; state and is considered valid.  A WatchKey remains valid until one of the following occurs:</p>
<ol>
<li>WatchKey.cancel() is called.</li>
<li>The directory being watched is no longer available.</li>
<li>The WatchService object is closed.</li>
</ol>
<h3>Checking For Changes</h3>
<p>When a change is detected, the WatchKey state is set to &#8216;signaled&#8217; and it is placed in a queue for processing. Getting WatchKeys off the queue involves calling WatchService.poll() or WatchService.take(). Here is a basic example:</p>
<pre class="brush:java">
private boolean notDone = true;
while(notDone){
    try{
         WatchKey watchKey = watchService.poll(60,TimeUnit.SECONDS);
         List&lt;WatchEvent.Kind&lt;?&gt;&gt; events = watchKey.pollEvents();
         for(WatchEvent event : events){
            ...process the events
         }
         if(!watchKey.reset()){
            ...handle situation no longer valid
         }
     }catch(InterruptedException e){
            Thread.currentThread().interrupt();
     }
</pre>
<p>On line 5 we are calling the pollEvents method to retrieve all the events for this WatchKey object. On line 9 you&#8217;ll notice a call to the reset method.  The reset method sets the WatchKey state back to &#8216;ready&#8217; and returns a boolean indicating if the WatchKey is still valid.  If there are any pending events, then the WatchKey will be immediately re-queued, otherwise it will remain in the ready state until new events are detected.  Calling reset on a WatchKey that has been cancelled or is in the ready state will have no effect.  If a WatchKey is canceled while it is queued, it will reamin in the queue until retrieved. Cancellation could also happen automatically if the directory was deleted or is no longer available.</p>
<h3>Processing Events</h3>
<p>Now that we have detected an event, how do we determine:</p>
<ol>
<li>On which directory did the event happen? (assuming more than one directory registered)</li>
<li>What was the actual event? (assuming listening for more than one event)</li>
<li>What was the target of the event, i.e what Path object was created,deleted or updated?</li>
</ol>
<p>Jumping in to line 6 in the previous example, we will parse the needed information from a WatchKey and a WatchEvent:</p>
<pre class="brush:java">
 //WatchKey watchable returns the calling Path object of Path.register
 Path watchedPath = (Path) watchKey.watchable();
 //returns the event type
 StandardWatchEventKinds eventKind = event.kind();
 //returns the context of the event
 Path target = (Path)event.context();
</pre>
<p>On line 6 we see the WatchEvent.context method being invoked.  The context method will return a Path object if the event was a creation, delete or update and will be relative to the watched directory. It&#8217;s important to know that when a event is received there is no guarantee that the program(s) performing the operation have completed, so some level of coordination may be required. </p>
<h3>Conclusion</h3>
<p>The WatchService is a very interesting feature of the new java.nio.file package in Java 7.  That being said, there are two things that about the WatchService to keep in mind:</p>
<ol>
<li>The WatchService does not pick up events for sub-directories of a watched directory.</li>
<li>We still need to poll the WatchService for events, rather than receive asynchronous notification.</li>
</ol>
<p>To address the above issues my next post will use the Guava EventBus to process the WatchService events. Thanks for your time</p>
<h4>Resources</h4>
<ol>
<li><a href="http://docs.oracle.com/javase/7/docs/api/java/nio/file/package-summary.html" target="new">java.nio.file</a> package that  contains the WatchService, WatchKey and WatchEvent objects discussed here.</li>
<li>A <a href="https://github.com/bbejeck/Java-7/blob/master/src/test/java/bbejeck/nio/files/watch/WatchDirectoryTest.java" target="new">unit test</a> demonstrating the WatchService</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/java-7-watchservice/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Creating An Asynchronous, Recursive DirectoryStream in Java 7</title>
		<link>http://codingjunkie.net/globbing-directories-in-java/</link>
		<comments>http://codingjunkie.net/globbing-directories-in-java/#comments</comments>
		<pubDate>Fri, 10 Feb 2012 07:13:04 +0000</pubDate>
		<dc:creator>Bill B</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[Concurrency]]></category>
		<category><![CDATA[Java 7]]></category>

		<guid isPermaLink="false">http://codingjunkie.net/?p=1865</guid>
		<description><![CDATA[Continuing with my series on the Java 7 java.nio.file package, this time covering the DirectoryStream interface. In this post we are going implement our own DirectoryStream that will iterate over the files in an entire directory tree, not just a single directory. Our goal in the end is to have something that works similar to [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://codingjunkie.net/wp-content/uploads/2012/01/Java_Logo1.png"><img src="http://codingjunkie.net/wp-content/uploads/2012/01/Java_Logo1-150x150.png" alt="" title="Java_Logo" width="150" height="150" class="alignleft size-thumbnail wp-image-1676" /></a>Continuing with my series on the Java 7 java.nio.file package, this time covering the DirectoryStream interface.  In this post we are going implement our own DirectoryStream that will iterate over the files in an entire directory tree, not just a single directory.  Our goal in the end is to have something that works similar to Ruby&#8217;s <a href="http://ruby-doc.org/core-1.9.3/Dir.html#method-c-glob" target="new">Dir.glob(&#8220;**&#8221;)</a> method. </p>
<h3>Requirements</h3>
<p>Here are the requirements:</p>
<ol>
<li>Starting from a given Path object, it should recursively go through each directory looking for files that match a given pattern.</li>
<li>It needs to be a single method call, something like DirUtils.glob(Path path, String pattern).</li>
<li>We want the search to be asynchronous, so we can iterate over files as they are found.</li>
</ol>
<p>The plan to meet our objective is straight forward:</p>
<ol>
<li>Create a DirectoryStream.Filter object from the given pattern.</li>
<li>Use the Files.walkFileTree method to recursively search for files that match our pattern.</li>
<li>Run the search in a separate thread, and when a matching file is found, place it in a queue.</li>
<li>The DirectoryStream iterator will pull the files out of the queue as they become available</li>
</ol>
<p><b>Disclosure</b>: This post is merely an attempt to see if making an asynchronous, recursive DirectorySteam will work, I make no guarantees about the code presented here, YMMV.</p>
<h3>The Filter</h3>
<p>DirectoryStream has a static interface, Filter, that is used to determine if the path object should be accepted.  The filter object is constructed by first creating a <a href="http://docs.oracle.com/javase/7/docs/api/java/nio/file/FileSystem.html#getPathMatcher(java.lang.String)" target="new">PathMatcher</a> object then using the PathMatcher in the Filter&#8217;s accept method:</p>
<pre class="brush:java">
public static DirectoryStream.Filter&lt;Path&gt; buildGlobFilter(String pattern) {
        final PathMatcher pathMatcher = getPathMatcher("glob:"+pattern);
        return new DirectoryStream.Filter&lt;Path&gt;() {
            @Override
            public boolean accept(Path entry) throws IOException {
                return pathMatcher.matches(entry);
            }
        };
    }
</pre>
<h3>The Recursive Search</h3>
<p>To perform the search we are going the use the Files.walkFileTree method with a <a href="https://github.com/bbejeck/Java-7/blob/master/src/main/java/bbejeck/nio/files/visitor/FunctionVisitor.java" target="new">FunctionVisitor</a> object. FunctionVisitor is a subclass of SimpleFileVisitor that takes a Guava Function as a constructor argument. The provided function is called as each file is visited and is checked by the <a href="http://docs.oracle.com/javase/7/docs/api/java/nio/file/DirectoryStream.Filter.html" target="new">DirectoryStream.Filter</a> object for a match.  We wrap the creation of the Function object in a method call:</p>
<pre class="brush:java">
 private Function&lt;Path, FileVisitResult&gt; getFunction(final Filter&lt;Path&gt; filter) {
        return new Function&lt;Path, FileVisitResult&gt;() {
            @Override
            public FileVisitResult apply(Path input) {
                try {
                    if (filter.accept(input.getFileName())) {
                        pathsBlockingQueue.offer(input);
                    }
                } catch (IOException e) {
                    throw new RuntimeException(e.getMessage());
                }
                return (pathTask.isCancelled()) ? FileVisitResult.TERMINATE : FileVisitResult.CONTINUE;
            }
        };
    }
</pre>
<p>On line 6 the filter is checking for a match on the filename. If a match is found, the path object is placed in a LinkedBlockingQueue, pathsBlockingQueue, on line 7. On line 12, we see the pathTask instance variable, which is a FutureTask handle to the search thread.  If pathTask has been cancelled we terminate the search, otherwise continue.</p>
<h3>The Search Thread</h3>
<p>To run the search, we wrap the Files.walkFileTree method in a Callable object, which is used as a constructor argument to pathTask, the FutureTask object.  By using a FutureTask, we have a hook into canceling the search as well as being able to check it&#8217;s running status.</p>
<pre class="brush:java">
 private void findFiles(final Path startPath, final Filter filter) {
        pathTask = new FutureTask&lt;Void&gt;(new Callable&lt;Void&gt;() {
            @Override
            public Void call() throws Exception {
                Files.walkFileTree(startPath, new FunctionVisitor(getFunction(filter)));
                return null;
            }
        });
        start(pathTask);
    }
</pre>
<p>On line 5 we see the call to the getFunction method from the previous example.  On line 9 the method call start is used to spin off the search in it&#8217;s own thread:</p>
<pre class="brush:java">
private void start(FutureTask&lt;Void&gt; futureTask) {
        new Thread(futureTask).start();
    }
</pre>
<p>We use a Thread object instead of an ExecutorService because we only need a single thread to execute once. Since we are not submitting any subsequent tasks, an ExecutorService really is not necessary.</p>
<h3>Implementing A DirectoryStream</h3>
<p>To implement a DirectoryStream there are two methods that need to be overridden &#8211; iterator and close. DirectoryStream extends the AutoCloseable interface, so when used with the try-with-resources construct, the close method is automatically invoked, releasing any underlying resources. The most interesting part of this code is overriding the iterator method:</p>
<pre class="brush:java">
public Iterator&lt;Path&gt; iterator() {
        confirmNotClosed();
        findFiles(startPath, filter);
        return new Iterator&lt;Path&gt;() {
            Path path;
            @Override
            public boolean hasNext() {
                try {
                    path = pathsBlockingQueue.poll();
                    while (!pathTask.isDone() &#038;&#038; path == null) {
                        path = pathsBlockingQueue.poll(5, TimeUnit.MILLISECONDS);
                    }
                    return (path != null) ? true : false;
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
                return false;
            }

            @Override
            public Path next() {
                return path;
            }

            @Override
            public void remove() {
                throw new UnsupportedOperationException("Removal not supported");
            }
        };
    }
</pre>
<p>On line 2 we check to see if the iterator was previously closed and throw an UnsupportedOperationException if that is the case.  Line 3 kicks off the search thread as we saw from the previous example.  The hasNext method is where the real &#8220;brains&#8221; of the class resides.  Since the search is asynchronous the DirectoryStream will attempt to iterate over the files immediately, but there needs to be some coordination with the search thread as it finds matching path objects and places them into the queue.  On line 9 we first call poll (a non blocking call) and set the result to the path variable defined on line 5.  If the path object is null we drop into a while loop that calls poll again, but this time it&#8217;s a blocking call with a timeout of 5 milliseconds.  We&#8217;ll stay in that loop until a non-null path is returned or the search thread has completed or cancelled.  On line 13 we check for null to determine if we have a valid result or if there are no more path objects to return.  If hasNext returned true, the next method returns the path object that was retrieved by the previous hasNext call.  This process will continue until all results have been returned. </p>
<h3>Conclusion</h3>
<p>Putting this all together we now have the <a href="https://github.com/bbejeck/Java-7/blob/master/src/main/java/bbejeck/nio/files/directory/AsynchronousRecursiveDirectoryStream.java" target="new">AsynchronousRecursiveDirectoryStream</a> class.  By combining our new class with <a href="https://github.com/bbejeck/Java-7/blob/master/src/main/java/bbejeck/nio/util/DirUtils.java" target="new">DirUtils</a> we can iterate over an entire directory tree:</p>
<pre class="brush:java">
try(DirectoryStream&lt;Path&gt; directoryStream = DirUtils.glob(path,"*")){
   for(Path path : directoryStream){
      ....
    }
}
</pre>
<p>It is hoped that the reader found this useful.  As always, comments and suggestions are welcomed. Thanks for your time.</p>
<h4>Resources</h4>
<ul>
<li><a href="https://github.com/bbejeck/Java-7/blob/master/src/main/java/bbejeck/nio/files/directory/AsynchronousRecursiveDirectoryStream.java" target="new">source code</a> and <a href="https://github.com/bbejeck/Java-7/blob/master/src/test/java/bbejeck/nio/files/directory/AsynchronousRecursiveDirectoryStreamTest.java" target="new">unit test</a> for material covered</li>
<li><a href="http://docs.oracle.com/javase/7/docs/api/java/nio/file/package-summary.html" target="new">Java 7 java.nio.file api</a></li>
]]></content:encoded>
			<wfw:commentRss>http://codingjunkie.net/globbing-directories-in-java/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 0.386 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2013-05-20 21:42:18 -->

<!-- Compression = gzip -->