I'm sure there are plenty of examples like this, but one thing that I find reall...

Buttons840 · on Oct 23, 2024

With a sufficiently advanced type system, you could lay out high level building blocks, such as function definitions, and let the LLM go wild, and then if things compile there's a high chance it's correct. Or maybe even a formal proof things are correct.

I was blown away when I realized some Haskell functions have only one possible definition, for example. I think most people haven't worked with type systems like this, and there are type systems far more powerful than Haskell's, such as dependant types.

There's not much reason to worry about low level quality standards so long as you know it's correct from a high level. I don't think we've seen what a deep integration between a LLM and a programming language can do, where the type system helps validate the LLM output, and the LLM has a type checker integrated into its training process.

oblio · on Oct 24, 2024

> With a sufficiently advanced type system

Is this a brother or a cousin of the "sufficiently advanced compiler"? :-)

zahlman · on Oct 24, 2024

A component, if I correctly understand the proponents of such type systems.

MichaelBurge · on Oct 24, 2024

Anything on top of the Calculus of Constructions is usually enough. So it's not a moving target, and there are multiple implementations.

inopinatus · on Oct 24, 2024

if there is only one valid translation of the type constraints into executable code then what you have is a slow, expensive, and occasionally argumentative compiler

it merely remains to build a debugger for your Turing-complete type system, and the toolchain will be ready for production

Buttons840 · on Oct 24, 2024

> Is this a brother or a cousin of the "sufficiently advanced compiler"? :-)

I believe the claim was that a sufficiently advanced compiler could do a lot of optimization and greatly improve performance. Maybe my claim here will turn out the same.

tessierashpool9 · on Oct 24, 2024

it certainly is a sister or niece of the "sufficiently advanced technology".

fhdsgbbcaA · on Oct 23, 2024

Things I never want to hear about flight control systems before I board a plane: “if things compile there's a high chance it's correct”

literalAardvark · on Oct 24, 2024

Very odd comment, since that's exactly what you do want to hear

gloflo · on Oct 24, 2024

I'd rather hear: "The compiled code has gone through all tests in the comprehensive, human-expert-written, standardized test suite correctly"

Compiling does not differentiate between True and False, so no safety for that escape pod door.

literalAardvark · on Oct 24, 2024

I took that as part of the build process.

But I definitely want as much as possible to be automated and formally correct, which is why I wrote what I wrote.

xmprt · on Oct 24, 2024

There's a lot of code that compiles but isn't correct.

literalAardvark · on Oct 24, 2024

Because we're using languages with flexibility but no correctness, but the vast performance advantage AI programming has over us could be used to manage the formal proofs for a verified toolchain.

We're not quite there yet, but while regular programming is quite tough for AI due to how fuzzy it is, formal proofs are something AI is already very good at.

ziggyzecat · on Oct 24, 2024

if-then-high-chance logic is for sex education & prototypes, not for airplanes carrying insured passengers

skybrian · on Oct 24, 2024

The functions for which there's only one implementation are trivial examples. It's not going to work for anything even slightly more complicated, like a function that returns a float.

Even if you could, you probably wouldn't want to make any change a breaking change by exposing implementation details.

rq1 · on Oct 24, 2024

I did just that actually to:

* build a codegen for Idris2 and a rust RT (a parallel stack "typed" VM)

* a full application in Elm, while asking it to borrow from DT to have it "correct-by-construction", use zippers for some data structures… etc. And it worked!

* Whilst at it, I built Elm but in Idris2, while improving on the rendering part (this is WIP)

* data collators and iterators to handle some ML trainings with pausing features so that I can just Ctrl-C and continue if needed/possible/makes sense.

* etc.

At the end I had to rewrite completely some parts, but I would say 90% of the boring work was correctly done and I only had to focus on the interesting bits.

However it didn’t deliver the kind of thorough prep work a painter would do before painting a house when asked for. It simply did exactly what I asked, meaning, it did the paint and no more.

(Using 4o and o1-preview)

tharant · on Oct 24, 2024

I keep seeing folks who say they’ve built a “full application” using or deeply collaborating with an LLM but, aside from code that is only purported to be LLM-generated, I’ve yet to see any evidence that I can consider non-trivial. Show me the chat sessions that produced these “full applications”.

simonw · on Oct 24, 2024

How about this one? https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/

That's for this: https://tools.simonwillison.net/ocr

tharant · on Oct 24, 2024

An application powered by a single source file comprised of only 400 lines of code is, by my definition, trivial. My needs are more complex than that, and I’d expect that the vast majority of folks who are trying to build or maintain production quality revenue generating code have the same or similar needs.

Again, please don’t be offended; what you’re doing is great and I dearly appreciate you sharing your experience! Just be aware that the stuff you’re demonstrating isn’t (hasn’t been, for me at least) capable of producing the kind of complexity I need while using the languages and tooling required in my environment. In other words, while everything of yours I’ve seen has intellectual and perhaps even monetary value, that doesn’t mean your examples or strategies work for all use-cases.

simonw · on Oct 24, 2024

LLMs are restricted by their output limits. Most LLMs can output 4096 tokens - the new Claude 3.5 Sonnet release this week brings that up to 8196.

As such, for one-shot apps like these there's a strict limit to how much you can get done purely though prompting in a single session.

I work on plenty of larger projects with lots of LLM assistance, but for those I'm using LLMs to write individual functions or classes or templates - not for larger chunks of functionality.

tharant · on Oct 25, 2024

> As such, for one-shot apps like these there's a strict limit to how much you can get done purely though prompting in a single session.

That’s an important detail that is (intentionally?) overlooked by the marketing of these tools. With a human collaborator, I don’t have to worry much about keeping collab sessions short—and humans are dramatically better at remembering the context of our previous sessions.

> I work on plenty of larger projects with lots of LLM assistance, but for those I'm using LLMs to write individual functions or classes or templates - not for larger chunks of functionality.

Good to know. For the larger projects where you use the models as an assistant only, do the models “know” about the rest of the project’s code/design through some sort of RAG or do you just ask a model to write a given function and then manually (or through continued prompting in a given session) modify the resulting code to fit correctly within the project?

simonw · on Oct 25, 2024

There are systems that can do RAG for you - GitHub Copilot and Cursor for example - but I mostly just paste exactly what I want the model to know into a prompt.

In my experience most of effective LLM usage comes down to carefully designing the contents of the context.

tharant · on Oct 25, 2024

> There are systems that can do RAG for you

My experience with Copilot (which is admittedly a few months outdated; I never tried Cursor but will soon) shows that it’s really good at inline completion and producing boilerplate for me but pretty bad at understanding or even recognizing the existence of scaffolding and business logic already present in my projects.

> but I mostly just paste exactly what I want the model to know into a prompt.

Does this include the work you do on your larger projects? Do those larger projects fit entirely within the context window? If not, without RAG, how do you effectively prompt a model to recognize or know about the relevant context of larger projects?

For example, say I have a class file that includes dozens of imports from other parts of the project. If I ask the model to add a method that should rely upon other components of the project, how does the model know what’s important without RAG? Do I just enumerate every possible relevant import and include a summary of their purpose? That seems excessively burdensome given the purported capabilities of these models. It also seems unlikely to result in reasonable code unless I explicitly document each callable method’s signature and purpose.

For what it’s worth, I know I’ve been pretty skeptical during our conversations but I really appreciate your feedback and the work you’ve been doing; it’s helping me recognize both the limitations of my own knowledge and the limitations of what I should reasonably expect from the models. Thank you, again.

simonw · on Oct 25, 2024

Yes, I paste stuff in from larger projects all the time.

I'm very selective about what I give them. For example, if I'm working on a Django project I'll paste in just the Django ORM models for the part of the codebase I'm working on - that's enough for it to spit out forms and views and templates, it doesn't need to know about other parts of the codebase.

Another trick I sometimes use is Claude Projects, which allow you to paste up to 200,000 tokens into persistent context for a model. That's enough to fit a LOT of code, so I've occasionally dumped my entire codebase (using my https://github.com/simonw/files-to-prompt/ tool) in there, or selected pieces that are important like the model and URL definitions.

anonzzzies · on Oct 24, 2024

any code to see? I like Idris so this would be interesting.

gabigrin · on Oct 28, 2024

That's one take I'm trying to take with https://www.flyde.dev. Build that higher-level abstraction that allows LLM to shine and us humans to enjoy a more effective way to view and control its outcome.

I know it's not what you meant, but tbh I don't think going deeper than Haskell on type-system power is the route to achieve mass adoption.

runeks · on Oct 25, 2024

> With a sufficiently advanced type system, you could lay out high level building blocks, such as function definitions, and let the LLM go wild, and then if things compile there's a high chance it's correct. Or maybe even a formal proof things are correct.

A formal proof is simply a type that allows only the correct implementation at the term level. And specifying this takes ages, as in hundreds of times longer than just writing it out. So you want to use an LLM to save this 1% of time after you've done all the hard work of specifying what the only correct implementation is?

halfmatthalfcat · on Oct 23, 2024

I'm not sure how much things have changed but I tried to use GPT-4 when it first came out to build a library on top of Scala + Shapeless and it utterly failed. Unless you can somehow wire in an LSP per language as an agent and have it work through the type errors as it tries to create code, I can't see where we'll ever get to a place where LLMs can work with strongly typed languages and produce compliant code.

Even with, the aforementioned "have an LSP agent work through type errors", it may be faster to just do it yourself than wait for an LLM to spit out what may be correct.

nycdatasci · on Oct 24, 2024

Definitely try Claude 3.5 Sonnet and o1-preview. They have succeeded for me where other models have failed. Also, use Cursor IDE.

mlhpdx · on Oct 24, 2024

Oh, I’ve found Claude 3.5 to be better but still pointless. To be specific, it generates code that does what I ask (roughly) but…

- Has obvious bugs, many at runtime - Has subtle bugs. - Is inefficient.

All of which it will generally fix when asked. But how useful is it if I need to know all the problems with its code beforehand? Then it responds with the same-ish wrong answer the next time.

Still a long way to go IMO.

csomar · on Oct 24, 2024

Claude 3.5 is good, however, it's "Type comprehension" is really basic. That was my experience with it when using Rust. It still can't create an internal mental model of the types and how to link them together. It'll also, at some points, start to heavily hallucinate functions and stuff.

cft · on Oct 24, 2024

o1-mini is even higher in coding benchmarks than o1-preview. That has been my experience also.

tharant · on Oct 24, 2024

Glad to know I’m not the only one who has had difficulties getting reasonable Scala code from an LLM. While I agree that LSP integration would likely help with generating or debugging production-quality code, I’d also suggest that the need for an LSP demonstrates these models’ inability to reason—at least about code. Perhaps they were trained more deeply and carefully on the marketing or “getting started” pages for the libraries and languages I need to use and not as much on examples of production-quality code repositories that make use of those languages/libs/tools.

rapind · on Oct 24, 2024

Oh this is an interesting take. Haskell, F#, possibly Elm, etc.

szundi · on Oct 24, 2024

You know people work with what they work with.

TechDebtDevin · on Oct 24, 2024

I'm sort of playing around with something like this in go for fun.

fny · on Oct 24, 2024

I find it helps if you treat the code generated as a third-party package with a well defined API. Then your role becomes gluing things together.

It’s an approach similar to how I’ve dealt with junior devs in the past. You specify an interface for a class, provide examples as a spec, and you get what you want without colliding with the main project.

For sanity’s sake, I keep these AI generated modules in single files just so it’s an easy copy and paste into ChatGPT.

tharant · on Oct 24, 2024

> It’s an approach similar to how I’ve dealt with junior devs in the past.

Your experience with and approach to juniors is different than my own. I don’t ask juniors to write a single class file; I give them a design document, an API spec, and documentation for standard practices, then I work as closely with them as needed to get the results I need and the experience they need. This approach works well for me and the vast majority of juniors with whom I’ve worked because we can pretty quickly identify gaps in their knowledge so we can then provide experience and education that benefits both parties. The same approach has failed miserably for me when pairing with an LLM for anything other than trivial code-generation tasks. The models don’t learn (sure, some providers offer “memory” but the limits of those features are pretty obvious once you try to use ‘em in practice for anything other than “don’t forget that I like tacos, the color blue, and sci-fi”)

> For sanity’s sake, I keep these AI generated modules in single files just so it’s an easy copy and paste into ChatGPT.

That’s not acceptable for production-quality code—at least not in my environment.

fny · on Oct 25, 2024

I ask juniors to do just as you do. There's a contract specified in the API spec and relevant documents. In kind, I build up an "AI" package one module at a time by specifying contracts for each module. The principle remains the same but at a smaller scale because, as you say, LLMs don't learn.

I know the organizational style doesn't fit a typical "production" set up, but the reality is the code produced is very good. I only set it up this way so I guarantee I can continue iterating on a module without too much pain.

Also, who cares if I have a way more files if I'm still building features for my customers?

cube2222 · on Oct 23, 2024

With something like the AI assistant in Zed you’d generally provide a few files the assistant can use as a reference. I’ve had good luck in having it follow the codebase’s style and standards this way.

benwilber0 · on Oct 23, 2024

+1 to the completions/inferences in Zed. It's the first editor that I feel (mostly) confident about just tabbing-through the AI completions with minimal prompting/re-editing.

salviati · on Oct 23, 2024

Have you tried https://aider.chat ?

vbezhenar · on Oct 23, 2024

I tried it yesterday and wasn't successful. I spent like 30 minutes trying to explain to it to make a simple change. Every time it made this change and several others as well which I didn't ask for. I asked to undo those several other changes and it undoes everything or does other unrelated things.

It works good until it doesn't.

It's definitely a useful tool and I'll continue to learn to use it. However it is absolutely stupid at times. I feel there's very high bar to use it, much higher than traditional IDEs.

ripley12 · on Oct 24, 2024

Which LLM were you using? I’ve had a great experience with Aider and Claude Sonnet 3.5 (which is not coincidentally at the top of the Aider leaderboard).

esperent · on Oct 24, 2024

I've been using Claude dev VSCode extension (which just got renamed but I forget the new name), I think it's similar to Aider except that it works via a gui.

I do find it very useful, but I agree that one of the main issues is preventing it from making unnecessary changes. For example, this morning I asked it to help me fix a single specific type error, and it did so (on the third attempt, but to be fair it was a tricky error). However, it persistently deleted all of the comments, including the standard licensing info and explanation at the top of the file, even when I end my instructions with "DO NOT DELETE MY COMMENTS!!".

kbaker · on Oct 24, 2024

You may want to peek at the system prompts Aider uses. I think this is part of the secret sauce that makes it so good.

https://github.com/Aider-AI/aider/blob/main/aider/coders/edi...

excerpt: """ Act as an expert software developer. Always use best practices when coding. Respect and use existing conventions, libraries, etc that are already present in the code base. {lazy_prompt} Take requests for changes to the supplied code. If the request is ambiguous, ask questions.

Always reply to the user in the same language they are using.

Once you understand the request you MUST: """ ... etc...

freediver · on Oct 23, 2024

I am a big fan!

kridsdale3 · on Oct 23, 2024

Can those kinds of things work in monorepos with 50 million files?

syntaxing · on Oct 23, 2024

They use this thing called repo map[1]. I only used it for personal projects and it’s been great. You need to add the files you care about yourself, it’ll do its best and add additional files from the repo map if needed.

Since it’s git based, it makes it very easy to keep track of the LLMs output. The agents is really well done too. I like to skip auto commit so I can “git reset —hard HEAD^1” if needed but aider has built in “undo” command too.

[1] https://aider.chat/docs/repomap.html

cdchn · on Oct 24, 2024

Thats a cool idea, kind of reminds me of ctags.

imjonse · on Oct 24, 2024

Aider had actually used ctags to implement that feature before they switched to tree-sitter.

ctoth · on Oct 23, 2024

Can you work in a repo with fifty million files? Can Git? I just checked on my Windows machine using Everything and there are 15,960,619 files total including every executable, image, datafile, &c.

Out of curiosity what does your IDE do when you do a global symbol rename in a repository with fifty million files?

I'm absolutely a real human, and I think this just might be too much context for me! Perhaps I am not general enough.

achierius · on Oct 24, 2024

I thought this was common knowledge but I guess not: Google's monopoly famously has over a billion files. No, Git cannot handle it. Their whole software stack is developed around this from the ground up. But they are one of the largest software employers in the world, so quite a few engineers evidently do make do with 200x more than 50 million files.

nasmorn · on Oct 24, 2024

Monopoly <> Monorepo This is the funniest typo possible in the context of google

dexwiz · on Oct 24, 2024

Having worked on a codebase like that, you need to use some extra plugins to get git to work. And even then, it’s very slow. Like 15-30 seconds for a git status to run even with caching. Global renames with an IDE are impossible but tools like sed and grep still work well. Usually there is a module system that maps to the org structure and you don’t venture outside of your modules or dependency modules very often.

adamtaylor_13 · on Oct 23, 2024

No and neither can you. Like you, it works best with small, focused context.

These tools aren’t magic. But they do certain tasks remarkably well.

dartos · on Oct 24, 2024

> No and neither can you.

People do work on monorepos with 50 million+ files, though…

paradite · on Oct 24, 2024

I made a tool that allows you to use LLMs on large codebases. You can select which files are relevant and embed them into the prompt: https://prompt.16x.engineer/

Based on my personal experience it works well as long as each file is not too long.

salviati · on Oct 23, 2024

I believe they can as long as you're able to identify a contained task that touches no more than a handful of files. Still very useful to automate some tedious work or refactoring if you ask me.

jprete · on Oct 23, 2024

That's effectively an answer of "no".

sthatipamala · on Oct 23, 2024

I used to work in a monorepo of that size.

All of the PRs I ever submitted touched a handful of files in my project’s subdirectory.

raincole · on Oct 23, 2024

That's effectively an answer of "yes".

Or what "yes" looks like to you? It can do all the work itself, for a 50m-file monorepo, without a human guiding it which files to look at?

If it were true then human programmers would have been considered obsoleted today. There would be exactly zero human programmers who make any money in 2025.

rorytbyrne · on Oct 23, 2024

It doesn't take the whole repo as context, it tries to guess which files to look at. So if you prompt with that in mind, it works well. Haven't tried it on a very large codebase though. You can also explicitly add files, if you know where work should be done.

ziggyzecat · on Oct 24, 2024

you can make it work. just think of the many approaches and you'll see that there are actually quite many viable ways to work around pseudo-infinite context.

Volrath89 · on Oct 24, 2024

You are right, except on the part about tweaking the prompt to get your desired code styling.

The easier way to integrate into an existing code base is just to refactor the code yourself. AI gives a working version, you refactor and move on. For me this has been a huge productivity boost from writing everything from scratch

inciampati · on Oct 24, 2024

Exactly. Use aider to do this.

Tell claude or your favorite LLM to write a full plan to implement what you need in such a way that your coworker can implement it.

Copy the result into aider, and check the results!

richardw · on Oct 23, 2024

Drag in a couple files and say “make a class that does X but use this format”. I absolutely don’t rely on it for lots of things but it’s absolutely capable of working with existing code. Claude is far better than OpenAI when dealing with sets of existing files. I also like that it’s outside of my IDE so I make the final changes. LLM’s love to just write tokens so I keep a fairly short leash.

paradite · on Oct 24, 2024

I built a tool specifically for integrating LLMs like Claude into existing codebase and daily coding workflow: https://prompt.16x.engineer/

beefnugs · on Oct 24, 2024

This isn't for professional software developers. This is for managers, so they can shit out a one week test and then feel superior enough to pay software engineers less for the "easy" job they do.

codingwagie · on Oct 23, 2024

cursor.sh, add context to the prompt

rty32 · on Oct 24, 2024

I haven't spent enough time with cursor specifically, but with other similar coding assistants, adding context can take some time. And often, even if it does save time, the action of "adding context" itself gets tiring and tedious so that you don't want to bother but instead just write the code yourself. It's about mental affordability.

hackernewds · on Oct 24, 2024

You spend 40-100 hours/wk at work. Ok to spend 2 hours to give Claude the info around those conventions. It should then save you 20-40 hours/wk

Atotalnoob · on Oct 24, 2024

Just tell it those conventions or standards.

Using GitHub copilot, I just tell it to style its code like an example and it gets pretty close.

bboygravity · on Oct 24, 2024

You can feed it a bunch of your code style as part of a project and it will just adhere to that.