Building a query parser over a weekendPart II

time to read 3 min | 576 words

In the previous post I talked about what I wanted to get, and how I decided to use the GOLD parser to formally define the language grammar. However, I had no intention of actually generating the parser, I wanted to write it myself. This is because I have a lot of poor experience with the results of parser generators. They generate horrible code that manages to be both unreadable and force you to go into it frequently enough to be maddening.

In particular, I haven’t been able to find anything that would be good enough as a hand rolled parser in terms of performance, readability of the code and the quality of errors it will generate. The later might sound strange, but a key part of the usefulness of any language is the kind of errors that you get when you give it invalid output.

An important consideration is that I’m intending to accept input over HTTP, so I can assume that my input is already wrapper in a pretty System.String package, which saves a lot of the complications that you usually have to deal with if your input is a streaming medium. Initially I tried to go with a scanner / parser in the same place, but that ended up being a Bad Idea, it was too complex to handle. Instead, I split the responsibility to a scanner and parser classes.

The scanner is responsible for scanning the string and finding the next token. But I decided to make it reactive, so you’ll tell it what you expect, and it will see if the next bit matches. Scanners will typically scan the input and produce a stream of tokens. That works, but it means that you need a lot more state in the scanner, and it can be more complex. I decided to simply if as much as I possible could.

Here is how we find identifiers in the scanner:

You might notice that the code is pretty simply, it runs over the input string (_q) and check if the next token is an identifier based on our rules. That make the parser code easier to handle. Let us consider how we can parse a  FROM clause. The formal definition is:

<From> ::= 'FROM INDEX' <From Source> | 'FROM' <From Source>

<From Source> ::= <Simple Field> 'AS' <Simple Field> | <Simple Field>

<Simple Field> ::= Identifier | StringLiteral

If you are familiar with BNF, this is quite readable. Now, how do we parse this? Here is the actual parser code:

As you can see, we start by asking for a FROM token (the scanner is case insensitive, so we don’t need to worry about casing) then check if this is followed by INDEX term then we get actual identifier or string literal to use, and possible alias.

This code can parse the following:

  • from Users
  • from Users as user
  • from index “Users/ByActiveMarker” AS u

You can note that I’m taking full advantage of the possibly of asking the input several questions, because it make my code simpler overall. I’m also not doing any substring operations. Instead, I’m passing indexes into the overall query string that allow me to get the information without paying the price for allocating all those strings.

In this manner, we also get something that is pretty easy to work with, and we can compare it to the formal definition to guide us in the parsing. At the same time, we get code that is readable and has quite good performance.

More posts in "Building a query parser over a weekend" series:

  1. (25 Jul 2017) Part II
  2. (24 Jul 2017) Part I