Working with ANTLR: HQL Grammar
I am currently trying to build an ANTLR grammar for HQL. There is already an existing one for Hibernate 3, but that one is based on ANLTR 2.x and supports quite a bit more than NHibernate does at the moment (DML statements, for instance). After several failed attempts to port the grammar to ANTLR 3 and generate C# code out of it, I gave up and started building my own.
I have read the ANTLR book not that long ago, so I ought to have a known what was in store for me. I didn't. I found out that this require a totally different mode of thinking. My recursion muscle is very tired at the moment, but I managed to create a simple grammar for:
select x.y, z.b from Entity as e join e.Children as c
I kept having false starts with the thing, until I went and read Boo's source, and figure out how Boo's parser works. Basically, instead of letting the tool to generate the parser and work of the generated tree, this approach calls for constructing our own tree while we parse the source.
As you can see, we constantly pass the Query instance down to lower rules, so we can operate on it and build our tree. This is much easier than trying to handle the CommonTreeAdaptor [sic] and derivatives.
I want to make the parser and the resulting AST as smart as possible, before trying to plug it into NHibernate's itself. That is going to be a significant undertaking, and I would like to have help, so feel free to contribute.
Now that I have the initial stuff going, I am going to refactor it a bit to match this BNF (http://www.hibernate.org/89.html), and yes, I know it is outdated.
There are about a dozen tests for the syntax yet, so it is possible to just grab it and start working on it.
You can grab the source from: https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/experiments/Hql
Comments
All LR(n) parsers are the same: they are basicly an engine which performs actions on the input stream of tokens by using a shift/reduce table and an action table.
All you do is create these 2 tables and that's most of the work. That's basicly what antlr does. You then simply have to write handler routines for each EBNF non-terminal reduction and you're done. This is straight forward.
These handlers are the places where you place your emit calls, e.g. emit SQL, build object trees for the sql engines etc. etc.
The biggest mistake made with LR(n) parsers is that people try to squeeze the handlers and the parser engine into 1 big pile of code. Don't do that. In the end, your head will hurt and the code will be unmaintainable, because what if a non-terminal changes?
The one book you should read is the one from Aho-Ullman, (the 'dragon' book). It contains the algo for an LR(n) parser and it's straight forward to implement.
Another mistake often made is that people try to combine lexical analyzer (the tokenizer) with the token parser engine.
If you want my implementation in C# of an LR(n) parser engine (commented!), shift/reduce table generator and lexical analyzer, feel free to drop me a line.
Ayende, if I can offer any advice it would be: try come up with a great error reporting infrastructure before getting the grammar right or completed. I think several projects made the mistake of not focusing much attention on it (including boo). Than it's a pain to fix later.
Comment preview