More

liuliu · 2026-04-07T18:01:13 1775584873

I am not sure you set it up right. Did you have a runnable WolframLanguage file so it can compare results? Did you give it H100 / H200 access to compile and then iterate?

My experience is that once you have these two, it does amazing kernel work (Codex-5.4).

acuozzo · 2026-04-07T18:12:05 1775585525

> Did you have a runnable WolframLanguage file so it can compare results?

Yes.

> Did you give it H100 / H200 access to compile and then iterate?

Yes via Lambda.ai. Also, FWIW, I run claude with --dangerously-skip-permissions and codex with the equivalent flag.

> it does amazing kernel work (Codex-5.4)

Specifically with WGMMA + TMA?

---

Once TMA gets involved both Claude and Codex spin endlessly until they dump TMA for a slower fallback.

I've observed this with Claude-Code having Opus 4.6 reasoning set to medium, high, and max; "adaptive thinking" enabled and disabled; and I've made sure to max-out thinking tokens.

I've also observed this with Codex GPT-5.4 in addition to GPT-5.3-Codex with reasoning efforts from medium to xhigh.

---

I've also observed this on the web, as mentioned in my OP, with GPT-5.4pro (Extended Pro), Gemini3-DeepThink, and Opus 4.6.

liuliu · 2026-04-07T20:27:58 1775593678

That is informative, thanks! Yes, I observe the same thing as the model tends to give up (like you said, "dump TMA for a slower fallback") and needs active steering to get good results. But it indeed works further than one-shot from Chat interface and knows much more about profiling / kernel coding than these.

liuliu · 2026-04-05T17:30:14 1775410214

It does feel like also impact the usage meter for subscription plans?

raincole · 2026-04-05T17:33:37 1775410417

Usage meter has always been completely opaque anyway. They could (and probably did) shrink the limit whenever they like.

mrtesthah · 2026-04-05T17:45:21 1775411121

Ostensibly this makes usage meter rate changes more transparent?

liuliu · 2026-04-05T18:16:08 1775412968

It is a bit insidious that the price hike coincide with the end of 2x promotion, which makes the usage change a bit more obscure.

HumanOstrich · 2026-04-05T18:34:19 1775414059

It's not a price hike, it's actually making it easier to understand relative usage for different models/features.

thejazzman · 2026-04-05T19:34:52 1775417692

I have no idea what I’m getting for $200/mo at this point. Maybe that’s on me, idk.

HumanOstrich · 2026-04-05T21:55:49 1775426149

Fair point. We only have clear evidence they're being more transparent about credit pricing and value, but it's unclear whether that'll make people burn through usage faster or slower.

The fuzziness is intentional. It gives them wiggle room and obscures how much "value" you actually get from $200, a 5-hour block, or a week. That keeps the tension manageable between subscription pricing and pay-per-token API pricing, especially for larger businesses on enterprise plans who want transparent $-per-MTok rates.

If they were fully transparent, like "your $200 sub gets you up to $2,000 of equivalent API usage," it would be a constant fight. People would track pennies and scream any time 5-hour blocks got throttled during peak hours. Businesses would push harder for pay-per-token discounts seeing that juicy $200 sub value.

ssl-3 · 2026-04-05T21:53:56 1775426036

I have no idea what I'm getting for $20/mo, either. (But I do know that it's at least $180 less than what I could be spending, I suppose.)

liuliu · 2026-04-02T18:22:21 1775154141

Brand recognition. Since "model-is-the-service", various previously-interesting companies become thin API resellers and the moat is between "selling a dollar for fifty cents" and Brand awareness.

I am not saying this in bad faith. Model companies cannot penetrate every niche with the same brand recognition as some other companies you would consider as "API resellers" do.

liuliu · 2026-04-01T02:30:35 1775010635

It needs a mlx fork because the lowest bit in mlx is 2 currently (for affine quantization).

riidom · 2026-04-01T15:22:10 1775056930

That mlx is for apple hardware only, though? Or did I misunderstand something.

dragonwriter · 2026-04-01T23:22:06 1775085726

It needs a llama.cpp fork, too; so the stock runtime (based on stock llama.cpp) used by LM Studio presumably won't work for it.

liuliu · 2026-03-30T05:50:59 1774849859

Close enough to a humanoid then you can move to places that humans can move to / around.

liuliu · 2026-03-30T00:36:48 1774831008

prepare uses measure text, if it is in a for loop, it won't be fast. This library is meant to do prepare once and then layout many times. layout calls should be sub-1 ms.

leeoniya · 2026-03-30T00:41:16 1774831276

it is not clear from the API/docs how i would use prepare() once on one text and then use layout() for completely different text.

i think the intended purpose is that your text is maybe large but static and your layout just changes quickly. this is not the case for figuring out the height of 100k rows of different texts in a table, for example.

liuliu · 2026-03-30T01:52:21 1774835541

I think for that to use pretext is to join each row with hard line break and then do prepare once, then walk each line. At least that will put the single layout performance into the best light.

I am skeptical getting row height of many items only once is the intended behavior though. It is probably the intended behavior to get row height of many items and enables you to resizing width many time later (which is pretty useful on desktop).

leeoniya · 2026-03-30T02:59:48 1774839588

tried just doing a concat of the 100k sentences with line breaks, it wasnt much faster, ~1880ms.

liuliu · 2026-03-24T17:13:59 1774372439

Still have 4 brand new ones in my storage unit. Just in case these moments.

Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.

zozbot234 · 2026-03-24T17:23:29 1774373009

It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.

liuliu · 2026-03-23T22:22:45 1774304565

I don't think this argument is wrong. But also debatable. At the end of the day, we are talking about the manifold of the reality (as compressed by LLM through language abstraction). It is remain to be seen if supervised fine-tuning on the best human can produce would nag the model enough to generate surprising findings.

We know the pre-trained models do tend to revert to mean, but I don't think that's enough to say SFT / RL models will do the same, although some might argue RL only sharpens the distribution, even for that, I am skeptical about that paper.

liuliu · 2026-03-18T17:39:16 1773855556

The problem is that it cannot access your credentials hence useless.

liuliu · 2026-03-16T19:40:37 1773690037

There are, look no further than jemalloc API surface itself:

https://jemalloc.net/jemalloc.3.html

One thing to call out: sdallocx integrates well with C++'s sized delete semantics: https://isocpp.org/files/papers/n3778.html

hedora · 2026-03-16T19:59:29 1773691169

You can also play tricks with inlining and constant propagation in C (especially on the malloc path, where the ground-truth allocation size is usually statically known).