Refactoring C CodeImplementing parsing

time to read 3 min | 549 words

So far, I did a whole lot of work around building the basic infrastructure of just building a trivial echo server with SSL. But the protocol I have in mind is a lot more complex, let’s get started with actually implementing the parsing of messages.

To start with, we need to implement parsing of lines. In C, this is actually a decidedly non trivial operation, because you need to read the data from the network into someplace and parse it. This area is rife with errors, so that is going to be fun.

Here is a simple raw message:

GET employees/1-A employees/2-B
Timeout: 30
Sequence: 293
Include: ReportsTo

The structure goes:

CMD args1 argN\r\n

And then header lines with:

Name: value\r\n

The final end of the message is \r\n\r\n.

To make things simple for myself, I’m going to define the maximum size of a message as 8KB (this is the common size in HTTP as well). Here is how I decided to represent it in memory:

image

The key here is that I want to minimize the amount of work and complexity that I need to do. That is why the entire message is limited to 8KB. I’m also simplifying how I’m going to be handling things from an API perspective. All the strings are actually C strings, null terminated, and I’m using the argv, argc convention for naming, just like in the main function.

This means that I can simply read from the network until I find a “\r\n\r\n” in there. Here is how I do this:

There is a bit of code here, but the gist of it is pretty simple. The problem is that I need to handle partial state. That is, a single message may come in two separate packets, or multiple messages may come in a single packet. I don’t have a way to control that, so I need to be careful about tracking past state. The connection has a buffer that is used to hold the state in memory, whose size is large enough to hold the largest possible message. I’m reading from the network to a buffer and then scanning to find the message separator.

If I couldn’t find it, I’m recording the last location where it could be starting, and then issuing another network read and will try searching for \r\n\r\n again. Once that is found, the code will call to the parse_commnad() method that operates over the entire command in memory (which is much easier). With that done, my message parsing is actually quite easy, from a conceptual point of view, although I’ll admit that C make it a bit long.

I’m copying the memory from the network buffer to my own location, this is important because the read_message() function will overwrite it in a bit, and it also allow me to modify the memory more easily, which is required for using strtok(). This basically allow me to tokenize the message into its component parts. First on a line by line basis (with splitting on space for the first line and then treating this as headers lines).

I added the ability to reply to a command, which means that we are pretty much almost done. You can see the current state of the code here.

More posts in "Refactoring C Code" series:

  1. (12 Dec 2018) Going to async I/O
  2. (10 Dec 2018) Do we need a security review?
  3. (07 Dec 2018) Implementing parsing
  4. (04 Dec 2018) Multi platform and Valgrind
  5. (03 Dec 2018) Choosing the next direction
  6. (30 Nov 2018) Giving good SSL errors to your client…
  7. (29 Nov 2018) Starting with an API
  8. (27 Nov 2018) Error handling is HARD, error REPORTING is much harder