Implementing generic natural language DSL

time to read 4 min | 748 words

I said that I would post about it, so here is the high level design for generic implementation of natural language looking parsing. Let us explore the problem scenario first. We want to be able to build this language, without having to build a full blown language from scratch:

open http://www.ayende.com/
click on link to Blog
click on link to first post
enter comment with name Ayende Rahien and email foo@example.org and url http://www.ayende.com/Blog/
enter comment text This is an awesome post.
click on submit
comment with This is an awesome post should appear on page

And to prove that we are not focusing on a single language, let us try this one as well:

when account balance is 500$ and withdrawal is made of 400$ we should get a low funds alert
when account balance is 500$ and withdrawal is made of 501$ we should deny the transaction
when weekly international charge is at 3,500$ and max weekly international charge is of 5,000$ and new charge arrives for amount 2,230$ we should deny the transaction

I think that those are divergent enough to show that the solution is a generic one.

And now, to the solution. Each type of language is going to have its own DSL engine, which know how to deal with the particular dialect that we are using. The default parsing is a three steps solution. First, split the text into sentences, then, split each sentence to tokens by whitespace. Now, for each statement, we search for the appropriate statement resolve, which is a class that knows how to deal with it. The statement resolver methods are then called to process the statement.

There are two key principal to the design. First, turning something like 'click on link' to an invocation of the ClickOnLink statement resolver and lazy parameter evaluation.

This is going to be interesting, the time right now is 19:38, and I am going to start implementing this.

It is now 22:04, and I finished the first language.

Working on the second now. It is 22:10 and I am done with the second one.

What did I do?

I took the text we had and turn that into executable commands. Now, this isn't flexible at all. If you make a modification in the way it is structured, it will fail, coming back to why natural language is a bad choice here, but it had quite a bit of flexibility in it.

You can get the code for this, including tests, here: https://rhino-tools.svn.sourceforge.net/svnroot/rhino-tools/experiments/natrual-language

But let us talk for a bit about how this is implemented. I'll show the bank example, because it is easier.

We start by defining the BankParser, which looks like this:

The bank parser merely define what the statement resolvers are, and any special parsers that are needed (in this case, we need to handle dollar values).

A statement parser is trivial:

And yes, those are pure POCO classes.

The whole idea here was that I can implement some smarts into the default engine about how it recognize methods and resolve parameters. I will admit that overloading caused some issues, but I think that this is pretty simple implementation.

It also does a good job in demonstrating the problems in such a language. Go ahead and try to build operator precedence into it. Or implement an if statement. You really can't, not without introducing a lot more structure into it. And that would turn it into yet another programming language.

What about the tooling? Intellisense and syntax highlighting?

Well, since we have the structure of the code, and we know the conventions, you shouldn't have a problem taking my previous posts about this and translating them directly into supporting this.

And yes, I can create a language in this in a few minutes, As BankParser has proven.

Tweet Share Share 15 comments

Tags:

Domain Specific Languages

Comments

08 Sep 2008
22:18 PM

josh

between your book and your blog, I feel like I'm reading two books. I think I'm starting to understand DSL a little better now. Is the point of this experiment to determine/demonstrate that it's possible to create a DSL like this using Boo & Rhino-Dsl without too much trouble?

08 Sep 2008
22:21 PM

Ayende Rahien

You cannot create a DSL like this in boo. Not one that uses natural language

You can (and should) create a DSL which is just as readable, has more structure and is far easier to work with

08 Sep 2008
22:28 PM

josh

I'll have to look at the code.. I was assuming that the poco code was just facade.

08 Sep 2008
23:12 PM

Ayende Rahien

Nope, the POCO code is all that there is there.

The magic is in the parser.

08 Sep 2008
23:51 PM

Kyle

Hm. I wonder if there is a relationship between making a good fluent interface and making a good domain-specific language. It seems like there should be, but it could be me artificially trying to add that in.

Sorry, I'm going to ramble for a second here.

Does it all come down simply to syntax: A fluent API (FLAPI, as I believe Chad Myers likes to call it) and a DSL should have easily understood commands/methods of achieving a specific programming task? Or is it more?

Would it be smart to have a DSL that maps directly to a FLAPI? This could result in an easier time debugging, but it could also result in the more arduous task of writing extra interfaces which may not be necessary, and of writing two languages: The fluent interface language and the DSL itself.

Maybe I've just lost my mind, though, and maybe there's no real connection at all!

09 Sep 2008
00:19 AM

Dave Foley

A couple months back I built a P.O.C. of a natural language -> TestFixture generation engine that is somewhat similar to this here:

http://code.google.com/p/storevil/

(Currently specific to NUnit, and but it would be pretty easy to change the syntax of the generated output, and since writing it I've looked into mbUnit, which actually would probably be easier to extend)

Various people had some pretty valid criticisms of the approach, but I think that, given an editor with intellisense or even just easy validation based on the syntax exposed by the context classes, it could be used by non-devs to specify behavior in a language that is more readable than FIT.

09 Sep 2008
01:33 AM

All I know is I have a hard time pronouncing the parentheses in English. I think eventually you will evolve into an object oriented syntax, i.e. C#, such as, let's say I have a bank account and I can withdraw and transfer. Eventually there is no worth to the DSL except to a programmer because someone has to structure that and you have to provide a UI and the business user wants something simple to deal with. So they need an expert to understand what they really mean and see future problems and limitations.

Fluent interface = good, the programmer can understand and get up to speed quickly, DSL that handles everything = terminator, not currently possible.

09 Sep 2008
05:50 AM

Ayende Rahien

Kyle,

That depends on the way you structure things.

A fluent interface doesn't have to read like an English statement to be readable. In fact, I find it cumbersome when they do.

A lot of the thought that goes into the design of the interface is the same, however.

09 Sep 2008
16:14 PM

Kyle

Oren,

Yeah, you're right. I decided to write my own little bank/ATM FLAPI last night after writing that post. Not only is the FLAPI a little TOO fluent, i.e. it's just too cluttered, but also it's quite cumbersome. I think I added too many words.

I was talking with my wife, who is also a bit of a programmer (more of a mathematician though), about different ways to determine the number of possible sentences and such available to a language. It's interesting stuff, for me anyway.

I just looked at your parser definition again and I noticed something that seems to me a bit odd. You have a CreateArgumentParser method which presumably returns an argument parser for a given command (whatever it was that the end-user-programmer wanted to call). Can you explain why that method is in the bank parser class, and not (say) on the command object that represents the command that the user wanted to use? I've always done it the latter way, and that allowed me to make the language extensible very easily, because when a person wanted to make a new function for the language, they could define it and the method to parse the arguments at the same time, and just plug it directly in without having to change anything else. My way could be more cluttered though, too, so I'm just interested to hear your thoughts.

09 Sep 2008
16:17 PM

Ayende Rahien

Because I don't want the dev to think about parsing.

The BankParser's CreateArgumentParser responsability it to add any additonal argument parser.

In the actual statement resolver class, there is not parsing at all. It is POCO

09 Sep 2008
16:28 PM

Kyle

Hmm, interesting. I think we've approached the same problem fairly differently. What I meant was that the dev would have to think about parsing if and only if they wanted to actually add a new command to the language. Normally, they pay no attention to the parser standing behind the curtain.

In other words, when using the language directly, they do not see the parsing. When adding their own commands to the language, they will need to know about it, because there's no way for me as the language writer to guess what the argument format will be for the function they want to add.

Does that make any sense?

09 Sep 2008
16:34 PM

Ayende Rahien

But from a dev perspective, they are always going to do something with the language.

Notice how the code is structure, you only deal with creating the language

10 Sep 2008
17:14 PM

Eber Irigoyen

I just can't believe you're wasting your time trying to process natural language as a DSL

10 Sep 2008
17:24 PM

Ayende Rahien

See previous post

11 Sep 2008
05:53 AM

Ivo Ramírez

Hi,

I'm very impressed for reading this, because I'm working on something similar.

Basically, I have a fluent interface api decorated with some attributes indicating the text to be interpreted. For example:

[DslDeclaration( "between ${begin} and ${end}" )]

public MyClass Between( int begin, int end ){

...

}

will work for "between 3 and 4"

The parser I'm working infers the types in the text, so you can use something like "property Name is described as Contact Name".

Also, this work with generics in this way:

[DslDeclaration( "is ${T}")]

public T Is<T>() where T : Constraint {

...

}

property Name is required

when class Required inherits from Constraint (In the previous example, Described is a class that inherits from Constraint too and has a Method "As" that receives an string "description" as argument. This method is decorated with [DslDeclaration( "as ${description}")] )

In fact, the way i'm using it is to provide this syntax:

property Name {

  is required;

  is described as "Contact Name";

}

}

I have a "LanguageStyle" class that exposes the way the sentences are terminated, the blocks and comments begins and ends, etc. I have implemented the CStyle class (with the ; {} // and stuff).

By now, it's working but only as a draft. I want the interpreter can understand by default that "Is<T>()" should map "is ${T}".

I don't know if it's useful but I ensure the implementation was very funny :)

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB