The “over 1 TFLOPS” claim for the M1 appears to be for single precision floats whereas FLOPS performance figures for supercomputers, including the one given for the CRAY-1, are almost always based on double precision (FP64) floats. The double precision FLOPS performance of the M1 would be lower, perhaps half of the single precision performance.
I had just gone down this rabbit hole for unrelated reasons (looking into yields). Nvidia's 5090 die is 750 mm^2, managing 419 TFLOPS on the FP16 benchmark.
Very quickly: