Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I just spent a week trying to write a parser for C declarations. Recursive Descent didn't work, Top Down Operator Precedence didn't work, I finally found an simple implementation of cdecl in K&R2, but it printed out textual description instead of creating an AST. Converting it to create an AST was a whole different challenge. In the end, I got it working by adding the idea of a blank-type and doing some tree inversions (almost using each AST as a program that executes over the nested chunk of AST).

It was only about 200 lines of code, and yet never have I been happier to finish a solution and never having to think about it again.



Rec-descent absolutely does work (source: wrote a working parser), but it's a bit annoying to make it build a syntax tree. Making it build a prefix encoding of the declaration is much easier.

If you have to, you can make a syntax tree work, too, but you'll have to thread through the declarator parser a double (triple?) pointer for the hole in the tree where the next part goes, and at the end plug it with the specifiers-and-qualifiers part you parsed before that. Or at least that's what I had working before I gave up and switched to a prefix code. It'd probably pay to check the LCC book (Fraser & Hanson, A Retargetable C Compiler: Design and Implementation) to see what they do there, because their type representation is in fact a hash-consed tree. (The LuaJIT FFI parser needs some sort of magic fix-up pass at the end that I didn't much like, but I don't remember the specifics.)


I think the most straightforward recursive-descent approach is just to parse the expression part of the type as an expression and then turn it inside out to get the type. So if you have e.g. 'int (*(p[5]))(char)', then your parse tree is something like

    declaration
      int
      function call
        *
          []
            p
            5
        char
and you need to turn that inside out to get something like

    array
      pointer to
        function
          int
          (char)
This way you have a simple expression parser combined with a simple recursive tree transformation function. (The tree transformation took about 50 LOC when I implemented it for a toy parser.)


> Rec-descent absolutely does work (source: wrote a working parser), but it's a bit annoying to make it build a syntax tree.

Yeah, it required too much backtracking and state snap-shotting and resets, and I couldn't figure out a decent way of reporting good errors.

Thanks for the references. The code I have now is pretty elegant and functional, so I'm not in the mood of diving back into it. But if I ever need to change it, I'll take a look.


Reading this make me happy. People still spend time on fun & nerdy stuff like this!


How can recursive descent not work? Did you not do backtracking? It might be slow or annoying but it should work.


I mean, with enough hacking around, anything can work. But the code had gotten extremely complicated and brittle, and it was still not handling all the edge cases. Readability was important for this project, since I want to be able to modify the parser later to add new extensions to the language.


Ah gotcha, thanks! Out of curiosity do you recall examples of any of the hardest cases it couldn't properly handle?


Stuff like:

    char *(*(**foo[][8])())[]


And it would fail to parse at all, or parse as the wrong thing?


Recursive descent required too much backtracking and state resets, which made the code buggy and unreadable (and the error reporting was awful). Pratt parser simply parsed wrong (presumably because I couldn't figure out the right way to sort the precedences).


Gotcha! I've been thinking of trying my hand at a C parser but it always gave me the sense that it was more annoying than it should be haha. Interesting to near your experience, thanks for sharing!


What are you working on?


A C parser for a meta-programming layer (automatically generating serialisation routines, etc.)


Just my 2 cents:

The _external_ way of doing introspection (a parser like yours) is generally very limiting in real life, as you will have to interpret all the preprocessor directives that your code follows.

e.g.:

    struct Foo {
    #ifdef ...
        int a;
    #else
        float b;
    #endif
    }
Not only will you have to feed your parser the exact same inputs as your build system, but also some directives are built-in the compiler and may be hard to replicate.

The easiest way to do introspection is _intrusive_, even though it pollute the code a bit.

e.g.

    DECL_STRUCT(Foo)
    {
        DECL_ATTR(int, a);
    }
    END_DECL_STRUCT
A long time ago, Qt had a sort of hybrid intrusive parser like this (MOC?), that was 20y ago though things may have changed.


Regarding pre-processing, I feed the result of `clang -E` to the code-generator. Since I use a unity build (single translation unit), it works out fine. In fact, the parser treats pre-processor directives as comments (to work around #line and `-dD`, etc.)

Regarding external and intrusive, I used to do it in the intrusive way but found it too limiting. Here, I not only generate code but can also (potentially) add whole new extensions to the language. This was the reason I wrote a new parser instead of just using libclang's JSON AST dump. Well, that and the fact that libclang is a multi-MB dependency while my parser is ~3000 lines of C code.


I finally found an simple implementation of cdecl in K&R2

Perhaps you should've started learning C with K&R. ;-)


I did, 10 years ago. Didn't remember every piece of code.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: