I'm sure there are plenty of examples like this, but one thing that I find really hard to deal with is to integrate such tools into existing codebase -- you can make all these things as standalone pages, but for a professional developer, you have certain standards and conventions, and often it takes a lot of work to review/revise the code to make it work with existing codebase, so much that you end up using inline completion just to help with obvious stuff or boilerplate. I woule rather spend 20% extra amount of time to write the code myself yet have confidence, than spend time tweaking the prompt or giving follow up instructions.
With a sufficiently advanced type system, you could lay out high level building blocks, such as function definitions, and let the LLM go wild, and then if things compile there's a high chance it's correct. Or maybe even a formal proof things are correct.
I was blown away when I realized some Haskell functions have only one possible definition, for example. I think most people haven't worked with type systems like this, and there are type systems far more powerful than Haskell's, such as dependant types.
There's not much reason to worry about low level quality standards so long as you know it's correct from a high level. I don't think we've seen what a deep integration between a LLM and a programming language can do, where the type system helps validate the LLM output, and the LLM has a type checker integrated into its training process.
if there is only one valid translation of the type constraints into executable code then what you have is a slow, expensive, and occasionally argumentative compiler
it merely remains to build a debugger for your Turing-complete type system, and the toolchain will be ready for production
> Is this a brother or a cousin of the "sufficiently advanced compiler"? :-)
I believe the claim was that a sufficiently advanced compiler could do a lot of optimization and greatly improve performance. Maybe my claim here will turn out the same.
Because we're using languages with flexibility but no correctness, but the vast performance advantage AI programming has over us could be used to manage the formal proofs for a verified toolchain.
We're not quite there yet, but while regular programming is quite tough for AI due to how fuzzy it is, formal proofs are something AI is already very good at.
The functions for which there's only one implementation are trivial examples. It's not going to work for anything even slightly more complicated, like a function that returns a float.
Even if you could, you probably wouldn't want to make any change a breaking change by exposing implementation details.
* build a codegen for Idris2 and a rust RT (a parallel stack "typed" VM)
* a full application in Elm, while asking it to borrow from DT to have it "correct-by-construction", use zippers for some data structures… etc. And it worked!
* Whilst at it, I built Elm but in Idris2, while improving on the rendering part (this is WIP)
* data collators and iterators to handle some ML trainings with pausing features so that I can just Ctrl-C and continue if needed/possible/makes sense.
* etc.
At the end I had to rewrite completely some parts, but I would say 90% of the boring work was correctly done and I only had to focus on the interesting bits.
However it didn’t deliver the kind of thorough prep work a painter would do before painting a house when asked for. It simply did exactly what I asked, meaning, it did the paint and no more.
I keep seeing folks who say they’ve built a “full application” using or deeply collaborating with an LLM but, aside from code that is only purported to be LLM-generated, I’ve yet to see any evidence that I can consider non-trivial. Show me the chat sessions that produced these “full applications”.
An application powered by a single source file comprised of only 400 lines of code is, by my definition, trivial. My needs are more complex than that, and I’d expect that the vast majority of folks who are trying to build or maintain production quality revenue generating code have the same or similar needs.
Again, please don’t be offended; what you’re doing is great and I dearly appreciate you sharing your experience! Just be aware that the stuff you’re demonstrating isn’t (hasn’t been, for me at least) capable of producing the kind of complexity I need while using the languages and tooling required in my environment. In other words, while everything of yours I’ve seen has intellectual and perhaps even monetary value, that doesn’t mean your examples or strategies work for all use-cases.
LLMs are restricted by their output limits. Most LLMs can output 4096 tokens - the new Claude 3.5 Sonnet release this week brings that up to 8196.
As such, for one-shot apps like these there's a strict limit to how much you can get done purely though prompting in a single session.
I work on plenty of larger projects with lots of LLM assistance, but for those I'm using LLMs to write individual functions or classes or templates - not for larger chunks of functionality.
> As such, for one-shot apps like these there's a strict limit to how much you can get done purely though prompting in a single session.
That’s an important detail that is (intentionally?) overlooked by the marketing of these tools. With a human collaborator, I don’t have to worry much about keeping collab sessions short—and humans are dramatically better at remembering the context of our previous sessions.
> I work on plenty of larger projects with lots of LLM assistance, but for those I'm using LLMs to write individual functions or classes or templates - not for larger chunks of functionality.
Good to know. For the larger projects where you use the models as an assistant only, do the models “know” about the rest of the project’s code/design through some sort of RAG or do you just ask a model to write a given function and then manually (or through continued prompting in a given session) modify the resulting code to fit correctly within the project?
There are systems that can do RAG for you - GitHub Copilot and Cursor for example - but I mostly just paste exactly what I want the model to know into a prompt.
In my experience most of effective LLM usage comes down to carefully designing the contents of the context.
My experience with Copilot (which is admittedly a few months outdated; I never tried Cursor but will soon) shows that it’s really good at inline completion and producing boilerplate for me but pretty bad at understanding or even recognizing the existence of scaffolding and business logic already present in my projects.
> but I mostly just paste exactly what I want the model to know into a prompt.
Does this include the work you do on your larger projects? Do those larger projects fit entirely within the context window? If not, without RAG, how do you effectively prompt a model to recognize or know about the relevant context of larger projects?
For example, say I have a class file that includes dozens of imports from other parts of the project. If I ask the model to add a method that should rely upon other components of the project, how does the model know what’s important without RAG? Do I just enumerate every possible relevant import and include a summary of their purpose? That seems excessively burdensome given the purported capabilities of these models. It also seems unlikely to result in reasonable code unless I explicitly document each callable method’s signature and purpose.
For what it’s worth, I know I’ve been pretty skeptical during our conversations but I really appreciate your feedback and the work you’ve been doing; it’s helping me recognize both the limitations of my own knowledge and the limitations of what I should reasonably expect from the models. Thank you, again.
Yes, I paste stuff in from larger projects all the time.
I'm very selective about what I give them. For example, if I'm working on a Django project I'll paste in just the Django ORM models for the part of the codebase I'm working on - that's enough for it to spit out forms and views and templates, it doesn't need to know about other parts of the codebase.
Another trick I sometimes use is Claude Projects, which allow you to paste up to 200,000 tokens into persistent context for a model. That's enough to fit a LOT of code, so I've occasionally dumped my entire codebase (using my https://github.com/simonw/files-to-prompt/ tool) in there, or selected pieces that are important like the model and URL definitions.
That's one take I'm trying to take with https://www.flyde.dev.
Build that higher-level abstraction that allows LLM to shine and us humans to enjoy a more effective way to view and control its outcome.
I know it's not what you meant, but tbh I don't think going deeper than Haskell on type-system power is the route to achieve mass adoption.
> With a sufficiently advanced type system, you could lay out high level building blocks, such as function definitions, and let the LLM go wild, and then if things compile there's a high chance it's correct. Or maybe even a formal proof things are correct.
A formal proof is simply a type that allows only the correct implementation at the term level. And specifying this takes ages, as in hundreds of times longer than just writing it out. So you want to use an LLM to save this 1% of time after you've done all the hard work of specifying what the only correct implementation is?
I'm not sure how much things have changed but I tried to use GPT-4 when it first came out to build a library on top of Scala + Shapeless and it utterly failed. Unless you can somehow wire in an LSP per language as an agent and have it work through the type errors as it tries to create code, I can't see where we'll ever get to a place where LLMs can work with strongly typed languages and produce compliant code.
Even with, the aforementioned "have an LSP agent work through type errors", it may be faster to just do it yourself than wait for an LLM to spit out what may be correct.
Oh, I’ve found Claude 3.5 to be better but still pointless. To be specific, it generates code that does what I ask (roughly) but…
- Has obvious bugs, many at runtime
- Has subtle bugs.
- Is inefficient.
All of which it will generally fix when asked. But how useful is it if I need to know all the problems with its code beforehand? Then it responds with the same-ish wrong answer the next time.
Claude 3.5 is good, however, it's "Type comprehension" is really basic. That was my experience with it when using Rust. It still can't create an internal mental model of the types and how to link them together. It'll also, at some points, start to heavily hallucinate functions and stuff.
Glad to know I’m not the only one who has had difficulties getting reasonable Scala code from an LLM. While I agree that LSP integration would likely help with generating or debugging production-quality code, I’d also suggest that the need for an LSP demonstrates these models’ inability to reason—at least about code. Perhaps they were trained more deeply and carefully on the marketing or “getting started” pages for the libraries and languages I need to use and not as much on examples of production-quality code repositories that make use of those languages/libs/tools.
I find it helps if you treat the code generated as a third-party package with a well defined API. Then your role becomes gluing things together.
It’s an approach similar to how I’ve dealt with junior devs in the past. You specify an interface for a class, provide examples as a spec, and you get what you want without colliding with the main project.
For sanity’s sake, I keep these AI generated modules in single files just so it’s an easy copy and paste into ChatGPT.
> It’s an approach similar to how I’ve dealt with junior devs in the past.
Your experience with and approach to juniors is different than my own. I don’t ask juniors to write a single class file; I give them a design document, an API spec, and documentation for standard practices, then I work as closely with them as needed to get the results I need and the experience they need. This approach works well for me and the vast majority of juniors with whom I’ve worked because we can pretty quickly identify gaps in their knowledge so we can then provide experience and education that benefits both parties. The same approach has failed miserably for me when pairing with an LLM for anything other than trivial code-generation tasks. The models don’t learn (sure, some providers offer “memory” but the limits of those features are pretty obvious once you try to use ‘em in practice for anything other than “don’t forget that I like tacos, the color blue, and sci-fi”)
> For sanity’s sake, I keep these AI generated modules in single files just so it’s an easy copy and paste into ChatGPT.
That’s not acceptable for production-quality code—at least not in my environment.
I ask juniors to do just as you do. There's a contract specified in the API spec and relevant documents. In kind, I build up an "AI" package one module at a time by specifying contracts for each module. The principle remains the same but at a smaller scale because, as you say, LLMs don't learn.
I know the organizational style doesn't fit a typical "production" set up, but the reality is the code produced is very good. I only set it up this way so I guarantee I can continue iterating on a module without too much pain.
Also, who cares if I have a way more files if I'm still building features for my customers?
With something like the AI assistant in Zed you’d generally provide a few files the assistant can use as a reference. I’ve had good luck in having it follow the codebase’s style and standards this way.
+1 to the completions/inferences in Zed. It's the first editor that I feel (mostly) confident about just tabbing-through the AI completions with minimal prompting/re-editing.
I tried it yesterday and wasn't successful. I spent like 30 minutes trying to explain to it to make a simple change. Every time it made this change and several others as well which I didn't ask for. I asked to undo those several other changes and it undoes everything or does other unrelated things.
It works good until it doesn't.
It's definitely a useful tool and I'll continue to learn to use it. However it is absolutely stupid at times. I feel there's very high bar to use it, much higher than traditional IDEs.
Which LLM were you using? I’ve had a great experience with Aider and Claude Sonnet 3.5 (which is not coincidentally at the top of the Aider leaderboard).
I've been using Claude dev VSCode extension (which just got renamed but I forget the new name), I think it's similar to Aider except that it works via a gui.
I do find it very useful, but I agree that one of the main issues is preventing it from making unnecessary changes. For example, this morning I asked it to help me fix a single specific type error, and it did so (on the third attempt, but to be fair it was a tricky error). However, it persistently deleted all of the comments, including the standard licensing info and explanation at the top of the file, even when I end my instructions with "DO NOT DELETE MY COMMENTS!!".
excerpt:
"""
Act as an expert software developer.
Always use best practices when coding.
Respect and use existing conventions, libraries, etc that are already present in the code base.
{lazy_prompt}
Take requests for changes to the supplied code.
If the request is ambiguous, ask questions.
Always reply to the user in the same language they are using.
Once you understand the request you MUST:
""" ... etc...
They use this thing called repo map[1]. I only used it for personal projects and it’s been great. You need to add the files you care about yourself, it’ll do its best and add additional files from the repo map if needed.
Since it’s git based, it makes it very easy to keep track of the LLMs output. The agents is really well done too. I like to skip auto commit so I can “git reset —hard HEAD^1” if needed but aider has built in “undo” command too.
Can you work in a repo with fifty million files? Can Git? I just checked on my Windows machine using Everything and there are 15,960,619 files total including every executable, image, datafile, &c.
Out of curiosity what does your IDE do when you do a global symbol rename in a repository with fifty million files?
I'm absolutely a real human, and I think this just might be too much context for me! Perhaps I am not general enough.
I thought this was common knowledge but I guess not: Google's monopoly famously has over a billion files. No, Git cannot handle it. Their whole software stack is developed around this from the ground up. But they are one of the largest software employers in the world, so quite a few engineers evidently do make do with 200x more than 50 million files.
Having worked on a codebase like that, you need to use some extra plugins to get git to work. And even then, it’s very slow. Like 15-30 seconds for a git status to run even with caching. Global renames with an IDE are impossible but tools like sed and grep still work well. Usually there is a module system that maps to the org structure and you don’t venture outside of your modules or dependency modules very often.
I made a tool that allows you to use LLMs on large codebases. You can select which files are relevant and embed them into the prompt: https://prompt.16x.engineer/
Based on my personal experience it works well as long as each file is not too long.
I believe they can as long as you're able to identify a contained task that touches no more than a handful of files. Still very useful to automate some tedious work or refactoring if you ask me.
Or what "yes" looks like to you? It can do all the work itself, for a 50m-file monorepo, without a human guiding it which files to look at?
If it were true then human programmers would have been considered obsoleted today. There would be exactly zero human programmers who make any money in 2025.
It doesn't take the whole repo as context, it tries to guess which files to look at. So if you prompt with that in mind, it works well. Haven't tried it on a very large codebase though. You can also explicitly add files, if you know where work should be done.
you can make it work. just think of the many approaches and you'll see that there are actually quite many viable ways to work around pseudo-infinite context.
You are right, except on the part about tweaking the prompt to get your desired code styling.
The easier way to integrate into an existing code base is just to refactor the code yourself. AI gives a working version, you refactor and move on. For me this has been a huge productivity boost from writing everything from scratch
Drag in a couple files and say “make a class that does X but use this format”. I absolutely don’t rely on it for lots of things but it’s absolutely capable of working with existing code. Claude is far better than OpenAI when dealing with sets of existing files. I also like that it’s outside of my IDE so I make the final changes. LLM’s love to just write tokens so I keep a fairly short leash.
This isn't for professional software developers. This is for managers, so they can shit out a one week test and then feel superior enough to pay software engineers less for the "easy" job they do.
I haven't spent enough time with cursor specifically, but with other similar coding assistants, adding context can take some time. And often, even if it does save time, the action of "adding context" itself gets tiring and tedious so that you don't want to bother but instead just write the code yourself. It's about mental affordability.