A short time ago, I started a side project to learn the latest version of Antlr. I decided to do something that has always interested me, a sql parser for the Lucene search engine. Even though the parser is a learning exercise, I thought someone else could find this useful. This post will cover the functionality of the
LuceneQueryParser. Building the parser using Antlr4 will be coming in later posts.
Introduction and Examples
LuceneSqlParser supports a subset of standard sql. Here are some examples:
1 2 3 4 5 6 7 8 9
LuceneSqlParser returns a BooleanQuery. The
BooleanQuery will contain different types of lucene query objects depending on the predicates used. There is a class
Searcher avaiable for use with the
Searcher abstracts away the opening of a lucene IndexSearcher, iterating over the ScoreDoc array and extracting results. Next, we’ll take a look at the rules used to parse the sql.
At high level a sql statement is broken down and parsed in the following manner:
- The ‘Select’ statement contains a comma sparated list of fields stored in a Lucene index. The parser stores fields in a
Set<String>for use by the
Searcher. To retreive all fields we can specify a ‘*’ operator, or omit the ‘select’ clause altogether.
- The ‘From’ clause takes a path in single quotes representing the location of a lucene index.
- The ‘Where’ clause contains the predicates for searching the data.
- The parser analyzes values used for searching in a similar fashion as the StandardAnalyzer (lower cased, whitespace and special characters removed). There are exceptions to this rule. The PrefixQuery, RegexQuery and the WildcardQuery are special cases and only removes characters not defined as special characters used by Lucene.
- Predicates can be nested to an arbitrary depth. For example:
where field='1' and (field2='2' and field3='3' and (field4='4' and (field5='5' and field='6')))
- The ‘Select’ and ‘From’ clauses are optional.
LuceneQueryParser defines two static methods
toParse method is intended to be used in conjuction with the
toParse returns a
QueryParseResults object that contains the path for the index, the set of fields to retrieve and the
BooleanQuery to execute. The
toParseBooleanQuery is intended to be used for parsing only and returns a
SQL to Lucene Query Functionality Mapping
We now will list the supported Lucene query objects and how they are mapped from the input sql.
<field name> ='Foo'converts to a TermQuery.
<field name> ='Bar Baz'converts to a PhraseQuery
<field name> like 'Fo*'converts to a
<field name> like 'B?l?'converts to a
<field name> matches('[Bb].*[hH]?')converts to a
<field name> in ('foo', 'bar', 'baz')converts to a
BooleanQueryconsisting of 3 BooleanClause objects. Each clause is a
TermQuerywith a BooleanClause.Occur of
TermQueryobjects can be combined in the ‘in’ operator. For example
city in ('New York', 'Boston', 'Los Angeles')
between 'Foo' AND 'Bar'converts to a TermRangeQuery with both items being inclusive.
between 25 and 40converts to a NumericRangeQuery again inclusive.
- The >, <, >= <= operators are converted to either
NumericRangeQueryobjects with one side of the range being unbounded. The < and > operators are exclusive. The <= and >= operators are inclusive.
The AND,OR & NOT operators are mappped in the following manner.
- AND converts to
- OR converts to
- NOT converts to
In two cases the query is converted to something different from the mappings shown previously. The first case is a query that contains a single predicate that must not match. For example:
A query submitted in this format will not work in lucene. This fix for this query is simple. The parser takes the original
BooleanQuery and adds an addtional clause. The underlying query object in the new clause is a MatchAllDocsQuery. The
MatchAllDocsQuery returns all documents in the index and the orginal predicate will fiter out the unwanted results. The second case is when searching for a numeric field in the
TermQuery format. For example:
Normally a ‘field=value’ or ‘field != value’ predicate is converted to a
TermQuery. But the way lucene searches for values it will not find a field if it is searching for a number versus a string. In this case the parser constructs a
NumericRangeQuery where the low and high value are equal and inclusive.
There are a several limitations at this point.
- Converting to a
PhraseQuerydoes not allow for specifying any slop. For a match with a
PhraseQueryall terms must be located adjacent to each other.
- Numeric type queries only support the Int type for now.
- Range querries are inclusive when both a high and low value are specified.
- If no limit clause is specified in the query, a default limit of 10,000 records is used.
The second component of sql for lucene is the
Searcher could be thought of as a convenience method for performing a lucene search and extracting the results. The
Searcher has one method
search that takes a sql query and returns a
List<Map<String,Object>> containing the search results. Each map in the list represents a document with the keys being the field names and the values are the values stored in the retrieved fields.
It’s worth noting the list and map returned from the searcher are of type ImmutableList and ImmutableMap respectively. The
Searcher has 4 constructors:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
If the searcher is instantiated with the no-arg constructor, then the path for the index will be extracted from the query and used to open the
IndexSearcher. All subsequent queries can safely omit the from clause. If the searcher is instantiated with any of the other 3 constructors, the ‘from’ clause will be ignored and can be omitted from the query.
Features To Be Added
- A JDBC Driver.
- Insert, Update and Delete support.
- Support all the numeric types supported by Lucene.
- Syntax to support
- Support for filters.
- Ability to specify slop for
That’s all for now. This was a lot of fun to write and the hope is someone will be find this useful. In the next few posts we’ll go into detail on how Antlr4 was used to build the parser.