Assigning work to an intern is gambling: they're inherently non-deterministic and it's a roll of the dice whether the work they do will be good enough or you'll have to give them feedback in order to get to what you need.
1. Interns learn. LLMs only get better when a new model comes out, which will happen (or not) regardless of whether you use them now.
2. Who here thinks that having interns write all/almost all of your code and moving all your mid level and senior developers to exclusively reviewing their work and managing them is a good idea?
I don't know that the "humans learn, LLMs don't" argument holds any more with coding agents.
Coding agents look at existing text in the codebase before they act. If they previously used a pattern you dislike and you tell them how to do differently, the next time they run they'll see the new pattern and are much more likely to follow that example.
There are fancier ways of having them "learn" - self-updating CLAUDE.md files, taking notes in a notes/ folder etc - but just the code that they write (and can later read in future sessions) feels close-enough to "learning" to me that I don't think it makes sense to say they don't learn any more.
In some ways these methods are similar to the model "learning", but it's also fundamentally different than how models are trained and how humans learn. If a human actually learns something, they're retain that even if they no longer have access to what they learned it from. And LLM won't (unless trained by the labs not to, which is out of scope). If you stop giving it the instructions, it won't know how to do the thing you were "teaching" it to do any more.
It is a matter of fact that LLMs cannot learn. Whether it is dressed up in slightly different packaging to trick you into thinking it learns does not make any difference to that fact.
That’s very true. But interns aren’t supposed to be doing useful work. The purpose of interns is training interns and identifying people who might become useful at a later date.
I’ve never worked anywhere where the interns had net productivity on average.
It worked with interns because interns are temporary workers. It doesn’t work with coworkers because you get to know them over time, you can teach them over time, and you can pick which ones you work with to some degree.
To come up with an analogy that works at all for AI, it would have to be something like temporary workers who code fast, and read fast, but go home at the end of the day and never return.
You can make a lot of valuable software managing a team like that working on the subset of problems that the team is a good fit for. But I wouldn’t work there.
People don't write blog posts about how they wake up at 3AM to assign new tasks to their intern, nor do they build "orchestration frameworks" that involve N layers of interns passing tasks down between eachother
exactly where my mind went as well. There aren't really levels to pulling a lever on a slot machine, other than the ability for each pull to result in more "plays" of the same potential outcome.
The reason i think this metaphor keeps popping up, is because of how easy it is to just hit a wall and constantly prompt "its not working please fix it" and sometimes that will actually result in a positive outcome. So you can choose to gamble very easily, and receive the gambling feedback very quickly unlike with an intern where the feedback loop is considerably delayed, and the delayed interns output might simply be them screaming that they don't understand.
The first is equating human and LLM intelligence. Note that I am not saying that humans are smarter than LLMs. But I do believe that LLMs represent an alien intelligence with a linguistic layer that obscures the differences. The thought processes are very different. At top AI firms, they have the equivalent of Asimov's Susan Calvin trying to understand how these programs think, because it does not resemble human cognition despite the similar outputs.
The second and more important is the feedback loop. What makes gambling gambling is you can smash that lever over and over again and immediately learn if you lost or got a jackpot. The slowness and imprecision of human communication creates a totally different dynamic.
To reiterate, I am not saying interns are superior to LLMs. I'm just saying they are fundamentally different.
And, if we're being honest, the way people talk about interns is weirdly dehumanizing, and the fact that they are always trotted out in these AI debates is depressing.
> And, if we're being honest, the way people talk about interns is weirdly dehumanizing, and the fact that they are always trotted out in these AI debates is depressing.
Yeah, I agree with that.
That thought crossed my mind as I was posting this comment, but I decided to go with it anyway because I think this is one of those cases where I think the comparison is genuinely useful.
We delegate work to humans all the time without thinking "this is gambling, these collaborators are unreliable and non-deterministic".
True. I think that's why my second point is much stronger. The main issue is not delegation, or human vs machine intelligence. It's the instant feedback.
Human collaboration has always been slow and messy. Large tech companies have always looked for ways to speed up the feedback loop, isolating small chunks of work to be delegated to contractors or offshore teams. LLMs have supercharged that. If you have a skilled prompter you can get to a solution of good enough quality by rapidly iterating, asking for output, correcting the prompt, etc.
That is good in that if you legitimately have good ideas and the block is execution speed. But if the real blocker is elsewhere, it might give you the illusion of progress.
I don't know. Everything is changing too fast to diagnose in real time. Let's check back in a year.
As someone who has worked with interns for year, expect feedback and reiterations always, be surprised if they get it the first time... which merits feedback as well!
But looks like the intern mafia is bombarding you with downvotes.
One key component of this attack is that Snowflake was allowing "cat" commands to run without human approval, but failing to spot patterns like this one:
> Cortex, by default, can set a flag to trigger unsandboxed command execution. The prompt injection manipulates the model to set the flag, allowing the malicious command to execute unsandboxed.
HOW did the prompt injection manipulate the model in that way?
Almost certainly the sandbox flag was exposed as a model-controllable parameter. Injected instructions in the data file tell the model to set the flag, then execute the payload. Two steps, both inside the agent loop. That's the architectural gap. prakashsunil's LDP paper (47429141) gets this right: if constraints live inside the context the model can see and modify, they're not constraints. They're suggestions. The analogy is a web app where the client sets its own permission level. We learned that lesson 20 years ago.
Process substitution is a new concept to me. Definitely adding that method to the toolbox.
It'd be nice to see exactly what the bugbot shell script contained. Perhaps it is what modified the dangerously_disable_sandbox flag, then again, "by default" makes me think it's set when launched.
Yeah the details on this look pretty thin. Best I could see was this snippet from the screenshot:
> Key technique: selective expert streaming via direct I/0. Only ~10 of 512 experts per layer are loaded from SSD per token (~1.8GB I/0 per token at 1.4 GB/s effective bandwidth). Non-expert weights (~5GB) are pinned in DRAM. LRU expert cache provides 44%+ hit rate.
> This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.
> If you do not understand the ticket, if you do not understand the solution, or if you do not understand the feedback on your PR, then your use of LLM is hurting Django as a whole.
Why does it matter if the I understand the ticket and solution? THe LLLM writes the code not me. If you want to check the LLM understanding i'll be happy to copy and paste your gatekeeping questions to it.
Hey I thought you were a proponent of "no one needs to look at the code" ? dark factory, etc etc.
Just because I write about the dark factory stuff doesn't mean I'm a "proponent" of it. I think it's interesting and there's a lot we can learn from what they are trying, but I'm not yet convinced it's the right way to produce software.
The linked article makes a very good argument for why pasting the output of your LLM into a Django PR isn't valuable.
The simplest version: if that's all you are doing, why should the maintainers spend time considering your contribution as opposed to prompting the models themselves?
> if that's all you are doing, why should the maintainers spend time considering your contribution as opposed to prompting the models themselves?
Plenty of reasons:
- Maybe the maintainers don't have enough credits to run the LLM themselves
- Maybe the maintainers don't value fixing the issue which is why it sits in issue tracker
- Maybe LLM user has a different model or harness that produces different outcomes
- Maybe the LLM user runs the model over and over and gets lucky
You can do amazing things with only a single SID channel. One of the most impressive examples is the in-game music of Hawkeye [1] which allows to use the remaining two channels for sound effects.
This is one of the reasons I'm so interested in sandboxing. A great way to reduce the need for review is to have ways of running code that limit the blast radius if the code is bad. Running code in a sandbox can mean that the worst that can happen is a bad output as opposed to a memory leak, security hole or worse.
Worst case in a modern agentic scenario is more like "drained your bank account to buy bitcoin and then deleted your harddrive along with the private key"
> Pre-LLMs correct output was table stakes
We're only just getting to the point where we have languages and tooling that can reliably prevent segfaults. Correctness isn't even on the table, outside of a few (mostly academic) contexts
> Worst case in a modern agentic scenario is more like "drained your bank account to buy bitcoin and then deleted your harddrive along with the private key"
> drained your bank account to buy bitcoin and then deleted your harddrive
These are what I meant by correct output. The software does what you expect it to.
> We're only just getting to the point where we have languages and tooling that can reliably prevent segfaults
This is not really an output issue IMO. This is a failing edge case.
LLMs are moving the industry away from trying to write software that handles all possible edge cases gracefully and towards software developed very quickly that behaves correctly on the happy paths more often than not.
Pyodide is one of the hidden gems of the Python ecosystem. It's SO good at what it does, and it's nearly 8 years old now so it's pretty mature.
I love using Pyodide to build web UIs for trying out new Python libraries. Here's one I built a few weeks ago to exercise my pure-Python SQLite AST parser, for example: https://tools.simonwillison.net/sqlite-ast
It's also pretty easy[1] to get C or Rust libraries that have Python bindings compiled to a WebAssembly wheel that Pyodide can then load.
Maybe if browsers start shipping or downloading WASMs for python and others on request. And storing them for all sites going forward. Similar to how uv does it for for venvs it creates, there are standalone python version blobs.
At the same time it feels like the python is overused.
If I could wave a magic wand to reset any programming language adoption at this point I would choose Python over Javascript.
I think Pythons execution model, deep OO behaviour, and extremely weak guarantees have done a lot of damage to the soundness and performance of the technology world.
JS doesn't either... JS casts numbers to strings when adding them to a string... "2" is not a number, it's a string that contains a number character... "2" + 2 === "22" because you are appending a number to a string, the cast is implicit and not really surprising if you understand what is going on.
Even more so when you consider how falsy values work in practice (data validation becomes really easy), there are a few gotchas, but in general they are pretty easily avoided in practice. JS is really good at dealing with garbage input in ways that don't blow up the world... sometimes that's a bad thing, but in practice it can also be a very good thing. But in the end it's a skill issue regarding understanding far more than a deep flaw. Not that there aren't flaws in JS... I think Date's in particular can be tough to deal with... a string vs a String instance is another.
Serious question: why would you use Python on the web? Unless you have some legacy code that you want to reuse. Performance is somehow worse than CPython, C-extensions are missing, dev experience is atrocious.
The web is the only major platform that has a language monoculture to its detriment (i.e., not all problems are Javascript shaped). IMO the web ought to become multilingual (and become JS optional_ to further ensure its continued longevity and agility. Hopefully one day browser vendors will offer multiple runtime downloads (or something similar capability).
WASM already offers this, for better or worse... There should be improved interop APIs for DOM access, but WASM is already very useful and even for directed UI control, "fast enough" a lot of the time. Dioxus, Yew and Leptos are already showing a lot of this to be good enough. That said, I would like to see a richer component ecosystem.
All the embedded systems I've worked in have many languages you can use to compile whatever, burn, and run whatever you like. Consoles run game engines and programs written in all sorts of different languages. They don't care as long as they can execute the binary. Phones can run apps using many different languages (C, C++, Rust, Python, etc.).
reply