RantCompilers != Parsers
This is just something that really annoys me. There is a common misconception that compilers are mostly about parsing the code. This couldn’t be further from the truth. About the easiest step in building a compiler is parsing the text into some form of machine readable format. There are great tools to help you there, and a lot of information.
It is the next stage that is complex, taking the machine readable format (AST) and turning that into executable. There is a lot less information about that, and that tends to be a pretty gnarly sort of a problem.
As the title says, this post is here just to make sure that people distinguish between the two.
More posts in "Rant" series:
- (20 Jul 2012) SignalR, Crazyiness, Head Butting & Wall Crashing
- (08 Sep 2011) Who thought THIS was a good idea?!
- (19 Oct 2009) Compilers != Parsers
- (05 Dec 2006) CSC Is Stupid
Comments
I guess, it depends on many things, such as how unruly your grammar is and how deep (wide?) is the semantic gap between the input and the output.
E.g. if you are writing a syntax highlighter for C++ (which spits out HTML), I guess it would be mostly about the parser. If you are writing a highly optimizing Pascal compiler (for lack of better example), then it probably wouldn't.
"E.g. if you are writing a syntax highlighter for C++ (which spits out HTML), I guess it would be mostly about the parser"
So if it's not a compiler, then it's not a compiler?
You are so right. It's even worse than that. A parser will produce an AST but without any Type/metadata information in it. For example this line of code from a complete .cs file:
will produce an AST of an assignment expression with a target of an Identifier and a constant. The Identifier may be a local variable, parameter, field, event or property. The parser can't narrow it down completely because it would have to have all the other source files and go through all the referenced assemblies to get that information. It also can't tell in this situation where foo is instance or static. All that, of course, is the compiler's job.
I've been looking for something that is between a parser and compiler. I want the AST where everything is resolved. I had hoped Mono C# compiler would give me this without too many code changes but I gave up on that. I am now looking forwards to the "compiler as a service" we've heard a little bit about in C# 5.0.
I've written a number of FxCop rules and felt an AST would make a number of them easier to create. FxCop only uses the generaed IL. You can see some of the FxCop rules I've written at http://jslfxcop.codeplex.com. I've written a bunch more but they are specific to the application I'm working on.
I've also had projects where I wanted to migrate source code but the simple AST didn't give me enough information. For example the Click event handler in WinForms doesn't have the same parameter types as the Click event handler in WPF.
Sorry for the rant. I would just really like any professional language to make available ALL the parts of the compiler: code to AST, AST + resolution of Types/methods/etc, AST to assembly.
Jeff,
Take a look at NRefactory, it does GREAT job in giving you valid AST, and it handles things like local vs. field vs. argument
I have been using NRefactory but haven't seen it do anything beyond parsing. In the example:
The foo would be represented in the AST as an IdentifierExpression if it was method parameter, local variable, field, event or property. What I want is to have foo be represented by, for example, something like a VariableIdentifier, and know that the type of foo is int. All IdentifierExpression tells you is that the identifier's name is "foo".
I did look at the code completion that is also part of SharpDevelop but it was much too hairy for me to make sense out of it. Maybe that is what you are talking about?
I am currently using NRefactory to parse the code, modify it and then spit it back out through "pretty printing". To get the method information, etc., I'm using source code line information from the .pdb and pulling MethodInfo, etc. from the IL. This could lead to ambiguous names in a single "line" of code but I haven't seen that yet.
Jeff,
Then what you want isn't the parser, you want the output from somewhere in the middle of the compiler pipeline.
For that, I suggest taking a look at the mono C# compiler. That will give you that information
Yep. I've looked at that a couple times. I believe the Mono compiler ditches most of the AST as it is moves through the different phases of compiling. I don't think there is a place to stop where I have a usable AST and all the type information.
My hopes are on C# 5.0 Compiler as a Service. I wish they would say more about that.
I also looked at Phoenix project from Microsoft Research but it is hugely overcomplicated. It looks like FxCop for VS 2010 is migrating towards using it. I'm just installing VS 2010 Beta 2 so will know more soon.
My reason for commenting in the first place was to agree that parsers are not compilers. I think we've made that point. :-)
Thanks!
I don't quite agree.
Emitting executing code from an AST is pretty simple.
Optimizations can be hard though.
IMO, the hardest part of it all is to create a valid grammar for your language.
Once you have the grammar, parsing and emitting code is fairly easy.
@Roger. You overlook such stuff as type inference and various sorts of binding. If we talk about C#, the latter would include member lookup, generic args inference, overload resolution. Once I wrote a subset of C# compiler and that was the most difficult part of all the pipeline.
I agree with Roger, your post doesn't make sense. The hardest part is either to create an (LA)LR(n) valid grammar so an LR(n) grammar parser / lexer combo can understand all input (as parsers are universal for LR(n)) or an LL(n) variant grammar.
Both have pretty straightforward parse systems defined. For example LR(n) parsers reduce to non-terminal rules, and you know exactly what rule is reduced and which terminals/nonterminals the rule is made of, so all you have to do is interpret the rule handler input.
LL(n) grammar parsers are also straightforward, as you move along the non-terminal rule's terminal/nonterminal input in a handler to recognize a rule and act accordingly when you have understood it completely.
People often focus on AST's as the output of a parser, but frankly, parsers don't output ASTs you need to handle with visitors. Parsers do that for you, and they notify you along the way.
@Eugene: these kind of problems have been solved decades ago already. Have a look at symbol tables and multi-pass parsing.
Seriously, the hardest part is getting rid of ambiguity in the grammar, so automated parser tools can be used (so shift-reduce/reduce-reduce etc. conflicts are not present in LR(n), or the necessity of long lookaheads for LL(n) grammars.)
have a look at the Dragon book (Aho/Sehti/Ullman: 'Compilers'), it contains the complete algo of an LR(n) parser. implementing it is fairly easy, and you then just have to fill in the rule handlers for every non-terminal which is really a simple job.
@Jeff, you don't use the right approach. Parsing is actually one of the most researched topics in computer science and many tools are created to make things very easy, next to the mechanics to approach these problems. if you want, please take a look at the 'Gold parser'. It's an LR(n) parser generator with a fairly easy to use editor and it generates C# as well. With the code, you can just write code which is called when a non-terminal is reduced, really easy to do.
Roger & Frans,
Emitting the code is actually pretty easy, if a bit obscure.
The usual problem is what Jeff is talking about, extracting enough semantic information from the AST to do something about it.
In comparison to the parser work, that is nothing.
Frans,
Here is a simple test to show this.
Would you rather get a LINQ query as text and parse it yourself, or would you rather get the processed expression tree?
Just run a couple of different visitors over your AST to add the relevant meta data to it.
e.g.
pass1, find all declarations and add type and scope to a lookup
pass2, ensure that all identifiers are present in the lookup.
...
passX, resolve and reduce constant expressions.
Not that hard, but that doesnt mean there is alot of work involved.
Tedious != Hard.
(There are ofcourse tricky problems, but in general not as hard as some think it is)
Roger,
Please try it on anything that is non trivial, you'll find that it is decidedly not just tedious.
I wrote a few languages, the parser was the easiest piece of them all
I'm not talking about the parser, i'm talking about the grammar..
So many times that the grammar tool have told me that my grammar is ambigious in some mysterious way that I'm unable to imagine.
And trying to correct it only results in more ambiguity.
Frustrating to the point where I want to give up and go into nerd rage.
Once you are past the grammar, it's only code ;-)
Roger,
Sorry, I wasn't precise.
From my point of view, parser == grammer.
Ayende, I'm not sure I get what you are trying to say. First you say that code generation is the easy part. Then you say that writing a parser was the easiest piece. So are you trying to convince us that compilers are easy, if not trivial?
@Frans: Well, I'm aware of that theory after all we had it back then in the university.
In my project I used ANTLR, so there was a pita with circular dependencies. So I did have a fight with the grammar and yeah at times it was SO frustrating and I was progressing soooo slowly that I wanted cry. However I managed to tame ANTLR within several days.
Tho all the remaining things - specifying type- and binding-related rules and implementing the spec took considerably much time and mental effort.
Maybe it's C#-like syntax that made writing a parser to be relatively simple, but that's what I experienced.
I'm working on a compiler that has such things as compile-time function execution, templates, type inference, and mixins. Parsing's straightforward, and it takes up 1/4th the amount of code that semantic analysis currently does.
Codegen is something that I find difficult. That's because it's verbose, strict, and using an API that I don't know very well (LLVM's C api, with some wrappers for some things) and that I can't easily extract information from for debugging (like the size of each instance of a type).
Comment preview