Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

C's most infamous parsing difficulty is:

    A * B;
Is it A multiplied by B, or B declared as a pointer to A? It can't be resolved without a symbol table. But there's another way that works very well: it's a declaration. The reason is simple. The multiply has no purpose, and so nobody would write that. C doesn't have metaprogramming, so it won't be generating such code as an edge case (unless using the preprocessor for metaprogramming, in which case you deserve what you get).

But there's a worse problem:

    (A) - B
Is it A minus B, or casting negative B to type A? There's just no way to know without a symbol table. One might think who would write code that has a vacuous set of parentheses around an identifier? It turns out they don't, but they write macros that parenthesize the arguments, and the preprocessed result has those vacuous parentheses.

D resolves both issues with:

1. if it parses like a declaration, it's a declaration

2. a cast expression is preceded by the keyword `cast`

and D is easy to parse without a symbol table.



I don't see why resolution without a "symbol table" is a big deal. Knowing whether A is a type or not resolves these fairly easily. And in well written code, it should generally be obvious whether A is a type or not so it should not be a readability problem either.


Because a quite different AST is built for the second case depending on whether A is a type or not. And you can't tell whether A is a type or not without a symbol table.


I think the two of you are talking past each other. You're saying, "yes, you need a symbol table", and he's saying, "yes, but a symbol table isn't very hard to do".

Personally, I wonder how many of these things would go away if pointers were a suffix operator with a different character (maybe the @ sign), and if casts looked like function calls.

   A B@;   # B is a pointer to type A
   A(-B);  # A is either a function or a type for casting
           # Same AST regardless


As for symbol tables being hard to do, notice that C does not allow forward references. Supporting forward references while relying on the symbol table to drive the parse winds up with unresolvable problems.


I'm not sure I understand your point. I don't think we're disagreeing about anything.

C needs forward declarations for some things, and it needs a symbol table to resolve some parts of the grammar. All I was saying was that I think you could resolve both of them with minor changes. (I see that D has a "cast" keyword, and that's obviously one way to do it.)


> A B@; # B is a pointer to type A

IIRC that's how Pascal did it, using a caret ^ for denoting a pointer.


Kind of.

A pointer to type A is:

     var B: ^A
The parser knows it is a declaration because of the var, and it knows ^A is the type, because of the colon. That it is a caret does not really matter here


Yeah, now only if Pascal had used curly braces, all could be right in the world :-)


I don't know about Pascal, but AT&T allowed anyone to write a C compiler without needing a license. (C++, too, I know because I asked AT&T's lawyers.)


Arh.... But then it wouldn't be Pascal ><


Yes! As he said - all would be right with the world =)


It was more about AT&T monopoly I believe :)


It's a problem because it means your editor will have a very hard time parsing your file without analyzing a lot of other files first.

The grammar of your file depends on previously declared symbols. But which symbols have been declared depends on the header.h files you import. But which headers you import depends on the -I options you give to your compiler. Except there's no standardized way to express what -I options your project uses, and they might change depending on your build profile.

Any modern text editor can give you good syntax highlighting for a Rust file or a Go file basically as soon as it opens. When it opens a C/C++ file, it has to do a lot of guessing.

(This is not conjecture, by the way. I tried to integrate clang's implementation of the Language Server Protocol in Atom for my end-of-studies project, and it was not fun.)


It’s annoying when parsing c. Also the situation is worse in c++ since * can have side effects!


You're right, and C++ is hopeless to parse without a symbol table in other ways. Later versions of C++ added keywords to disambiguate, but of course there's legacy code.


Because pretty much the whole science of parsing is built for context-free languages. Yes, of course, you can ignore science when you do your thing, but please don't call yourself an engineer, then ;).


And as an engineer you can slap on a symbol table and it works like a context free language after that.


Mainly because most parser generators don't have a feature for disambiguation based on the application providing hints based on symbol table lookups.


It seems like you maybe misread their comment as referring to human parsing of code, rather than writing a parser program? I can’t make much sense of this otherwise.


This is because there would be extra memory involved, and this could make it hard to deploy on low memory environment. Consider the historical background of C, in the late 70s' to 90s' and you'll see why.

But thanks to Moore's Law and the hedge by the Writh's Law, you have significantly more powerful hardware yet not so faster software, because we started to deploy languages worse than C: C++. C++ with template, is Turing Complete, and this means every template expansion is potentially undecidable, meaning it could run forever, where most compiler turned a blind eye to by putting a "recursion depth". Having templates, even without "recursion depth" alone consumes even more memory than C, and so that's why we complaint about C++ compilers are memory boggling, while C has (relatively) low memory overhead. It's just that times are different and the perception about memory use is not that apparent anymore.


The problem has nothing to do with extra memory, it’s just that C parsers are more hacky than others due to the C language not having a proper context-free grammar.

Also, I frankly don’t get why templates are “bad” according to you. They allow for proper metaprogramming not relying on ugly hacks like memory layout conventions, etc.


I'm not saying templates are bad, well maybe my wordings are ambiguous but I do find template metaprogramming too complicated to begin with. Yes I'm fascinated by many things TMP made possible (such as std::tuple, std::bitset, Boost.ASIO, Boost.Hana), but also because of this, to this date I still cannot commit my idea of creating my own C++ compiler.


It’s both.


You need those symbols tables anyway though


> But there's another way that works very well: it's a declaration. The reason is simple. The multiply has no purpose, and so nobody would write that.

But there's another way that works very well: it's a multiplication. The reason is simple. A declaration of a variable that isn’t used has no purpose, and so nobody would write that.

As I think you know, C doesn’t handle this case by guessing that it must be a declaration. Its lexer looks in the tables that its parser creates to check whether a type called ‘A’ is in scope (https://stackoverflow.com/questions/41331871/how-c-c-parser-...)


> The multiply has no purpose, and so nobody would write that.

A fellow named Bjarne Stroustrup fixed that bug, though. In the plus plus dialect of C, A * B could reboot your system, without any #define macros for A or B.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: