I am not sure you set it up right. Did you have a runnable WolframLanguage file so it can compare results? Did you give it H100 / H200 access to compile and then iterate?
My experience is that once you have these two, it does amazing kernel work (Codex-5.4).
> Did you have a runnable WolframLanguage file so it can compare results?
Yes.
> Did you give it H100 / H200 access to compile and then iterate?
Yes via Lambda.ai. Also, FWIW, I run claude with --dangerously-skip-permissions and codex with the equivalent flag.
> it does amazing kernel work (Codex-5.4)
Specifically with WGMMA + TMA?
---
Once TMA gets involved both Claude and Codex spin endlessly until they dump TMA for a slower fallback.
I've observed this with Claude-Code having Opus 4.6 reasoning set to medium, high, and max; "adaptive thinking" enabled and disabled; and I've made sure to max-out thinking tokens.
I've also observed this with Codex GPT-5.4 in addition to GPT-5.3-Codex with reasoning efforts from medium to xhigh.
---
I've also observed this on the web, as mentioned in my OP, with GPT-5.4pro (Extended Pro), Gemini3-DeepThink, and Opus 4.6.
That is informative, thanks! Yes, I observe the same thing as the model tends to give up (like you said, "dump TMA for a slower fallback") and needs active steering to get good results. But it indeed works further than one-shot from Chat interface and knows much more about profiling / kernel coding than these.
Fair point. We only have clear evidence they're being more transparent about credit pricing and value, but it's unclear whether that'll make people burn through usage faster or slower.
The fuzziness is intentional. It gives them wiggle room and obscures how much "value" you actually get from $200, a 5-hour block, or a week. That keeps the tension manageable between subscription pricing and pay-per-token API pricing, especially for larger businesses on enterprise plans who want transparent $-per-MTok rates.
If they were fully transparent, like "your $200 sub gets you up to $2,000 of equivalent API usage," it would be a constant fight. People would track pennies and scream any time 5-hour blocks got throttled during peak hours. Businesses would push harder for pay-per-token discounts seeing that juicy $200 sub value.
Brand recognition. Since "model-is-the-service", various previously-interesting companies become thin API resellers and the moat is between "selling a dollar for fifty cents" and Brand awareness.
I am not saying this in bad faith. Model companies cannot penetrate every niche with the same brand recognition as some other companies you would consider as "API resellers" do.
prepare uses measure text, if it is in a for loop, it won't be fast. This library is meant to do prepare once and then layout many times. layout calls should be sub-1 ms.
it is not clear from the API/docs how i would use prepare() once on one text and then use layout() for completely different text.
i think the intended purpose is that your text is maybe large but static and your layout just changes quickly. this is not the case for figuring out the height of 100k rows of different texts in a table, for example.
I think for that to use pretext is to join each row with hard line break and then do prepare once, then walk each line. At least that will put the single layout performance into the best light.
I am skeptical getting row height of many items only once is the intended behavior though. It is probably the intended behavior to get row height of many items and enables you to resizing width many time later (which is pretty useful on desktop).
Still have 4 brand new ones in my storage unit. Just in case these moments.
Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.
It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.
I don't think this argument is wrong. But also debatable. At the end of the day, we are talking about the manifold of the reality (as compressed by LLM through language abstraction). It is remain to be seen if supervised fine-tuning on the best human can produce would nag the model enough to generate surprising findings.
We know the pre-trained models do tend to revert to mean, but I don't think that's enough to say SFT / RL models will do the same, although some might argue RL only sharpens the distribution, even for that, I am skeptical about that paper.
You can also play tricks with inlining and constant propagation in C (especially on the malloc path, where the ground-truth allocation size is usually statically known).
My experience is that once you have these two, it does amazing kernel work (Codex-5.4).
reply