I just spent a week trying to write a parser for C declarations. Recursive Descent didn't work, Top Down Operator Precedence didn't work, I finally found an simple implementation of cdecl in K&R2, but it printed out textual description instead of creating an AST. Converting it to create an AST was a whole different challenge. In the end, I got it working by adding the idea of a blank-type and doing some tree inversions (almost using each AST as a program that executes over the nested chunk of AST).
It was only about 200 lines of code, and yet never have I been happier to finish a solution and never having to think about it again.
Rec-descent absolutely does work (source: wrote a working parser), but it's a bit annoying to make it build a syntax tree. Making it build a prefix encoding of the declaration is much easier.
If you have to, you can make a syntax tree work, too, but you'll have to thread through the declarator parser a double (triple?) pointer for the hole in the tree where the next part goes, and at the end plug it with the specifiers-and-qualifiers part you parsed before that. Or at least that's what I had working before I gave up and switched to a prefix code. It'd probably pay to check the LCC book (Fraser & Hanson, A Retargetable C Compiler: Design and Implementation) to see what they do there, because their type representation is in fact a hash-consed tree. (The LuaJIT FFI parser needs some sort of magic fix-up pass at the end that I didn't much like, but I don't remember the specifics.)
I think the most straightforward recursive-descent approach is just to parse the expression part of the type as an expression and then turn it inside out to get the type. So if you have e.g. 'int (*(p[5]))(char)', then your parse tree is something like
declaration
int
function call
*
[]
p
5
char
and you need to turn that inside out to get something like
array
pointer to
function
int
(char)
This way you have a simple expression parser combined with a simple recursive tree transformation function. (The tree transformation took about 50 LOC when I implemented it for a toy parser.)
> Rec-descent absolutely does work (source: wrote a working parser), but it's a bit annoying to make it build a syntax tree.
Yeah, it required too much backtracking and state snap-shotting and resets, and I couldn't figure out a decent way of reporting good errors.
Thanks for the references. The code I have now is pretty elegant and functional, so I'm not in the mood of diving back into it. But if I ever need to change it, I'll take a look.
I mean, with enough hacking around, anything can work. But the code had gotten extremely complicated and brittle, and it was still not handling all the edge cases. Readability was important for this project, since I want to be able to modify the parser later to add new extensions to the language.
Recursive descent required too much backtracking and state resets, which made the code buggy and unreadable (and the error reporting was awful). Pratt parser simply parsed wrong (presumably because I couldn't figure out the right way to sort the precedences).
Gotcha! I've been thinking of trying my hand at a C parser but it always gave me the sense that it was more annoying than it should be haha. Interesting to near your experience, thanks for sharing!
The _external_ way of doing introspection (a parser like yours) is generally very limiting in real life, as you will have to interpret all the preprocessor directives that your code follows.
Not only will you have to feed your parser the exact same inputs as your build system, but also some directives are built-in the compiler and may be hard to replicate.
The easiest way to do introspection is _intrusive_, even though it pollute the code a bit.
Regarding pre-processing, I feed the result of `clang -E` to the code-generator. Since I use a unity build (single translation unit), it works out fine. In fact, the parser treats pre-processor directives as comments (to work around #line and `-dD`, etc.)
Regarding external and intrusive, I used to do it in the intrusive way but found it too limiting. Here, I not only generate code but can also (potentially) add whole new extensions to the language. This was the reason I wrote a new parser instead of just using libclang's JSON AST dump. Well, that and the fact that libclang is a multi-MB dependency while my parser is ~3000 lines of C code.
It was only about 200 lines of code, and yet never have I been happier to finish a solution and never having to think about it again.