I'm been occasionally glancing at PR/issue tracker to keep up to date with thing...

VagabundoP · 2026-03-31T13:58:46 1774965526

Occasionally Core.py will do some updates, higher level stuff:

https://open.spotify.com/show/1PGRfdrLEwgXjQbPBNk1pW

pablo and Łukasz

kenjin4096 · 2026-03-18T10:47:31 1773830851

> I've never seen where the high level discussions were happening

Thanks for your interest. This is something we could improve on. We were supposed to document the JIT better in 3.15, but right now we're crunching for the 3.15 release. I'll try to get to updating the docs soon if there's enough interest. PEP 744 does not document the new frontend.

I wrote a somewhat high-level overview here in a previous blog post https://fidget-spinner.github.io/posts/faster-jit-plan.html#...

> does this mean each opcode is possibly split into two (or more?) stencils, with and without removed increfs/decrefs?

This is a great question, the answer is not exactly! The key is to expose the refcount ops in the intermediate representation (IR) as one single op. For example, BINARY_OP becomes BINARY_OP, POP_TOP (DECREF), POP_TOP (DECREF). That way, instead of optimizing for n operations, we just need to expose refcounting of n operations and optimize only 1 op (POP_TOP). Thus, we just need to refactor the IR to expose refcounting (which was the work I divided up among the community).

If you have any more questions, I'm happy to answer them either in public or email.

adrian17 · 2026-03-20T23:02:26 1774047746

I saw your documentation PR, thank you!

I also did some reading and experiments, so quickly talking about things I've found out re: refcount elimination:

Previously given an expression `c = a + b`, the compiler generated a sequence of two LOADs (that increment the inputs' refcounts), then BINARY_OP that adds the inputs and decrements the refcounts afterwards (possibly deallocating the inputs).

But if the optimizer can prove that the inputs definitely will have existing references after the addition finishes (like when `a` and `b` are local variables, or if they are immortals like `a+5`), then the entire incref/decref pair could be ignored. So in the new version, the DECREFs part of the BINARY_OP was split into separate uops, which are then possibly transformed into POP_TOP_NOP by the optimizer.

And I'm assuming that although normally splitting an op this much would usually cost some performance (as the compiler can't optimize them as well anymore), in this case it's usually worth it as the optimization almost always succeeds, and even if it doesn't, the uops are still generated in several variants for various TOS cache (which is basically registers) states so they still often codegen into just 1-2 opcodes on x86.

One thing I don't entirely understand, but that's super specific from my experiment, not sure if it's a bug or special case: I looked at tier2 traces for `for i in lst: (-i) + (-i)`, where `i` is an object of custom int-like class with overloaded methods (to control which optimizations happen). When its __neg__ returns a number, then I see a nice sequence of

_POP_TOP_INT_r32, _r21, _r10.

But when __neg__ returns a new instance of the int-like class, then it emits

_SPILL_OR_RELOAD_r31, _POP_TOP_r10, _SPILL_OR_RELOAD_r01, _POP_TOP_r10, etc.

Is there some specific reason why the "basic" pop is not specialized for TOS cache? Is it because it's the same opcode as in tier1, and it's just not worth it as it's optimized into specialized uops most of the time; or is it that it can't be optimized the same way because of the decref possibly calling user code?

kenjin4096 · 2026-03-18T13:58:32 1773842312

Update: I put up a PR to document the trace recording interpreter https://github.com/python/cpython/pull/146110

flakes · 2026-03-17T21:01:28 1773781288

You’ll probably want to look to the PEPs. Havent dug into this topic myself but looks related https://peps.python.org/pep-0744/

adrian17 · 2026-03-17T21:28:17 1773782897

I think CPython already had tier2 and some tracing infrastructure when the copy-and-patch JIT backend was added; it's the "JIT frontend" that's more obscure to me.

rtpg · 2026-03-18T06:03:22 1773813802

discussions might be happening on the Python forums, which are pretty active.

https://discuss.python.org/t/pep-744-jit-compilation/50756/8... here's one thing

I do think you can also just outright ask questions about it on the forums and you'll get some answers.

At the end of the day there's only so many people working on this though.

saikia81 · 2026-03-17T21:39:26 1773783566

have you read the dev mailing list? There the developers of python discuss lots.

pansa2 · 2026-03-17T22:29:17 1773786557

There isn’t a dev mailing list any more, is there? Do you mean the Discord forum?

sheepscreek · 2026-03-17T21:39:01 1773783541

UPDATE: I misunderstood the question :-/ You can ignore this.

I love playing with compilers for fun, so maybe I can shed some light. I’ll explain it in a simplified way for everyone’s benefit (going to ignore the stack):

When an object is passed between functions in Python, it doesn’t get copied. Instead, a reference to the object’s memory address is sent. This reference acts as a pointer to the object’s data. Think of it like a sticky note with the object’s memory address written on it. Now, imagine throwing away one sticky note every time a function that used a reference returns.

When an object has zero references, it can be freed from memory and reused. Ensuring the number of references, or the “reference count” is always accurate is therefore a big deal. It is often the source of memory leaks, but I wouldn’t attribute it to a speed up (only if it replaces GC, then yes).

yuliyp · 2026-03-17T21:51:49 1773784309

what at all does this comment have to do with what it's replying to?

sheepscreek · 2026-03-17T23:41:04 1773790864

I misread the original comment, thinking it was a question about what is refcount elimination, than how it affects the JIT's performance(?).