Random Thoughts on Coding

Whatever comes to mind at the moment.

Machine Learning With Kafka Streams

The last two posts on Kafka Streams (Kafka Processor API, KStreams DSL) introduced kafka streams and described how to get started using the API. This post will demonstrate a use case that prior to the development of kafka streams, would have required using a separate cluster running another framework. We are going to take live a stream of data from twitter and perform language analysis to identify tweets in English, French and Spanish. The library we are going to do this with is LingPipe. Here’s a description of Ling-Pipe directly from the website:

LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like:

  • Find the names of people, organizations or locations in news
  • Automatically classify Twitter search results into categories
  • Suggest correct spellings of queries

The Hosebird Client from Twitter’s Streaming API will be used to create the stream of data that will serve as messages sent to kafka.

Kafka Streams - the KStreams API

The last post covered the new Kafka Streams library, specifically the “low-level” Processor API. This time we are going to cover the “high-level” API, the Kafka Streams DSL. While the Processor API gives you greater control over the details of building streaming applications, the trade off is more verbose code. In most cases, however, the level of detail provided by the Processor API is not required and the KStream API will get the job done. Compared to the declarative approach of the Processor API , KStreams uses a more functional style. You’ll find building an application is more a matter of stating “what” you want to accomplish versus “how”. Additionally, since many interfaces in the Kafka Streams API are Java 8 syntax compatible (method handles and lambda expressions can be substituted for concrete types), using the KStream DSL allows for building powerful applications quickly with minimal code. This post won’t be as detailed as the previous one, as the description of Kafka Streams applies to both APIs. Here we’ll focus on how to implement the same functionality presented in the previous post with the KStream API.

Kafka Streams - the Processor API

If you work on systems delivering large quatinties of data, you have probably heard of Kafka if you aren’t using it already. At a very high level, Kafka is a fault tolerant, distributed publish-subscribe messaging system that is designed for speed and the ability to handle hundreds of thousands of messages. Kafka has many applications, one of which is real-time processing. Real time processing typically involves reading data from a topic (source) doing some analytic or transformation work then writing the results to another topic (sink). Currently to do this type of work your choices are:

  1. Using your own custom code by using a KafkaConsumer to read in the data then writing out data via a KafkaProducer.
  2. Use a full fledged stream-processing framework such as Spark Steaming,Flink, Storm.

While either of those approaches are great, in some cases it would be nice to have a solution that was somewhere in the middle of those two. To that end, a Processor API was proposed via the KIP – Kafka Improvement Proposals process. The aim of the Processor API is to introduce a client to enable processing data consumed from Kafka and writing the results back into Kafka. There are two components of the processor client:

  1. A “lower-level” processor that providea API’s for data-processing, composable processing and local state storage.
  2. A “higher-level” stream DSL that would cover most processor implementation needs.

Java 8 CompletableFutures Part I

When Java 8 was released a while ago, a great concurrency tool was added, the CompletableFuture class. The CompletableFuture is a Future that can have it’s value explicity set and more interestingly can be chained together to support dependent actions triggered by the CompletableFutures completion. CompletableFutures are analogous to the ListenableFuture class found in Guava. While the two offer similar functionality, there won’t be any comparisons done in this post. I have previously covered ListenableFutures in this post. While the coverage of ListenableFutures is a little dated, most of the information should still apply. The documentation for the CompletableFuture class is comprehensive, but lacks concrete examples of how to use them. My goal is show how to use CompletableFutures through a series of simple examples in unit tests. Originally I was going to cover the CompleteableFuture in one post, but there is so much information, it seems better to break up coverage into 3 parts –

  1. Creating/combining tasks and adding listeners for follow on work.
  2. Handling errors and error recovery
  3. Canceling and forcing completion.

Scripting Tmux for Kafka

I’ve known about tmux for some time now, but I kept putting off working with it. Lately I’ve started looking at the Processor API available in Kafka. After opening up 4-5 terminal windows (Zookeeper, Kafka, sending messgaes, recieving messages) and toggling in between them, I quickly realized I needed a way to open up these terminal sessions at one time and have them centralized. I had the perfect use case to use tmux! This post is about scripting tmux to save time during devlopment, in this case working with kafka.