I'm not sure I understand your argument. The optimizations here are:
>"-Ofast" same as "-O3 -ffast-math"
-ffast-math makes the compiler violate IEEE, it would make a very bad default.
-O3 used to be potentially buggy and somewhat experimental, although I doubt that it's much of a practical problem these days. It still makes the code harder to debug though. Most build systems I'm aware of default to debug builds, so GCC is not really unique in that regard.
>"-flto" enable link time optimizations
Those optimizations are typically quite expensive and can backfire in some scenarios. Link time optimizations are relatively novel, at least within the timeframe of GCC. I'm sure they'll be enabled by default some day, but GCC has to be conservative.
>"-mfpmath=sse" enables use of XMM registers in floating point instructions (instead of stack in x87 mode)
So that means that you tell the compiler to assume that SSE is available on the target. By default GCC outputs code that's compatible with the baseline, which is perfectly reasonable IMO. Same reason why -march=native is also not the default, it is assumed that you may want to ship your binaries to other computers.
>"-funroll-loops" enables loop unrolling
From GCC's own docs:
"This option makes code larger, and may or may not make it run faster."
Loop unrolling is tricky, because it gets rid of branches but also increases the size of the code and therefore the pressure on the icache. In some scenarios it's possible that the looping code run faster than the linear version if it saves on cache misses.
Given that there are tradeoffs involved, it's also reasonable to let the user decide to enable this optim.
I work with C a lot and IMO the only default in GCC that's truly bad is that it doesn't have -Wall by default.
>That depends on the language and the GCC version.
I was quoting TFA verbatim. I actually never use -Ofast myself, beyond -O3 I tend to use the individual flags manually, checking wit benchmarks that it makes a difference.
>Yes, but unfortunately it's the Intel default, which contributes to some of the compiler mythology.
I didn't know that. I guess Intel is extremely performance-oriented and doesn't really care for portability so it makes some sense for them to do that.
>Regarding unrolling, you usually do want it in numeric loops, and -O3 unrolls (and jams) as -fopt-info shows.
Sure but that's to my point: the default optims for -O3 are fairly aggressive already. Unless you're writing code where performance is absolutely critical you'll probably be do just fine just remembering to pass "-Wall -O3" to GCC and that's it. Actually for most of my code where I want good performance but I'm not counting individual clock cycles I tend to default to -O2 which gives you most of the performance benefits with more conservative and easier to debug optimizations.
And for cases where you need to go beyond -O3 you'll probably have to write some benchmarking code before you can decide which additional option to use. At least, in my experience.
> So that means that you tell the compiler to assume that SSE is available on the target. By default GCC outputs code that's compatible with the baseline, which is perfectly reasonable IMO. Same reason why -march=native is also not the default, it is assumed that you may want to ship your binaries to other computers.
If your x86 computer was made after the US invaded Iraq, it supports SSE2 instructions.
And I'm currently writing code for a 32bit CPU that was first introduced in 2000. It's not x86 so SSE is irrelevant, but my point is that GCC is routinely used to build millions if not billions of lines of code, they can't just YOLO-deprecate things as if it were a javascript framework.
clang also doesn't assume -march=native - just like gcc it makes binaries that can run on machines of the same architecture rather than specialising to the current processor's features.
You can see the flags it's using by comparing
clang -E -v - </dev/null 2>&1
and
clang -march=native -E -v - </dev/null 2>&1
The second output will probably show a lot of important features enabled - for example
Clang works much the same way by default. All compilers have to pick a baseline, and assuming it is arch_of(your_core) is a recipe for disaster if you are compiling on your high-end machine for release to a broad user base.
For many users of GCC the intended user base will be “the same as my current distribution” and so GCC can be configured using —-with-{cpu, arch, schedule, tune} depending on the architecture to set a default in line with user/distro expectations.
Exactly. What you do not want to happen is that your program crashes with fancy "illegal instruction" errors on older CPUs that do not have the newest fancy features yet, but are still in use by a large chunk of your (paying) users.
Performance critical programs usually deal with that by either providing multiple builds targeting different CPU features/CPUs, or let the user compile themselves from source with the right flags, or have CPU runtime detection and provide alternative versions of a few important performance critical functions for different CPU feature sets (e.g. browsers, ffmpeg, glibc, various VMs/runtimes like Java Hotspot or dotnet, do that) while the majority of code is still compiled for a lowest common subset of CPU features.
Of course, languages that run on (usually) JIT-ed VMs/runtimes have a bit of an advantage here, as the actual machine code is generated from source code or byte code only at runtime, at which point it is clear what kind of CPU is underneath the program. They can - but not always do - implement optimized JITting depending on the CPU features. (of course, every language/VM/runtime comes with its own set of pros and cons and there is no silver bullet).
To make matters even more complicated: compiling code to use the newest CPU features or newest optimization techniques will not mean it will actually run faster.
E.g. AVX512 may actually slow down your code (when multi-threaded) on many CPUs[1]. Or heavily "optimized" code may become larger in machine code, to the point where your "unoptimized" code may run faster because it fits in the CPU cache(s) properly while the "optimized" version does not. "-Os" optimized code may run faster than "-Ofast" optimized code for this matter. Or it may not. Depending on the actual code.
I remember compiling ffmpeg and libx264 myself a bunch of years ago, with the "best" flags for my system, starting with "-march=" and "-Ofast" of course, thinking I am a tough skillful super geek now. Imagine my surprise when I tested the performance against a default ffmpeg build and my optimized build was 2-5% slower.
I mean, I have actually still never used Clang, but even before this list, I was aware that there were quite a few gcc defaults that were ridiculous.
But every gcc article is about how there are even more ways that gcc doesn't actually work sensibly unless you know the trick.
I mean, why doesn't it say somewhere in the help next to the optimization stuff that you might want to consider specifying the architecture?