I’ve been writing parsers for a very long time, the book came out in 2010 and I was playing with Boo since 2005. The ANTLR book was an interesting read and it thought me a lot about how to approach text parsing.
However, you might have noticed that I shifted my thinking about a lot of design problems. In particular, performance, number of allocations, exceptions thrown during parsing, the readability of errors when getting invalid input, etc.
In particular, the Lucene query parser is a really good example of a really crappy one. It fail on pretty much all points above. I worked with a bunch of parser generators, and I never found one whose output was something that I could really like. They typically generate unreadable code and customization of their behavior are non obvious, to say the least.
Martin Fowler (only slightly out of context):
…it's hard to write a parser.
My most recent parser is the RavenDB JSON parser, you can see the progression of the ideas around that in this series of posts. That isn’t something that you’ll really want to read without a cup of coffee and some writing instruments to write notes.
Most non trivial parsers are composed of at least two pieces, a tokenizer and an builder. The tokenizer goes over the input, break it into tokens that the builder use to build the final format. The JSON scanner in RavenDB is called UnmanagedJsonParser and the builder is BlittableJsonDocumentBuilder. Traditionally they would be called scanner and parser, for the roles they play, but we’ll leave the names as is because it doesn’t really matter. This code is not fun to go through, it has been through multiple performance reviews, each time making it uglier then before, but much faster.
JSON is also one of the simplest possible textual formats. The formal definition of JSON fits a post-it note. The JSON scanner I have for RavenDB is close to 900 lines of code and is only part of the parsing process.
But the parser I built over the weekend wasn’t for JSON. Instead, I wanted to play with a query language, so I naturally wanted something SQL like. And that is anything but trivial to do.
The first thing I needed was to actually sit down and figure out what the language is going to look like. In order to do that, you almost always want to use a BNF notation of some kind. This allow you to specify what your language should look like, not just as a few snippets in a notepad window but in a more structure manner.
More to the point, there are a lot of tools out there to use. I decided to use GOLD Parser, it was last updated in in 2012 and it shows, but it had the lowest friction of all the parser IDEs that I tried and it has great support for debugging and working with grammars. Why not use ANTLR, which is pretty much the default choice? Put simply, it is usually too much of a hassle to setup ANTLR properly and I didn’t want to get bogged down with the actual details of generating the parsers, I wanted to focus on the grammar.
I actually don’t know how to parse text using the GOLD Parser. It looks like it generate a binary file that you feed to some library that would do it for you, but I’m not sure and it doesn’t matter. What I care about is that I can develop a formal definition and debug it easily. I’m not actually going to use it to generate the parser.
Huh?! Why do all this work for no reason?
A formal definition of the language is incredibly helpful when you consider a new syntax, because you can verify that you aren’t creating holes and ambiguities in your language. It also give you a pretty clear guideline on how to implement the language.
I’m going to go into more details about the language itself and building a parser for it in my next post.