I feel like I’ve been gaslit by the entire GenAI industry that I’m just bad at prompt engineering. When I’m talking to an LLM about stuff unrelated to code-generation, I can get sane and reasonable responses—engaging and useful even. The same goes for image generation and even the bit of video generation I’ve tried. For me however, getting any of these models to produce reasonably sane code has proven elusive. Claude is a bit better than others IME but I can’t even get it to describe a usable project template and directory structure for anything other than very simple Scala, Java, or Python projects. The code I’m able to generate always needs dramatic and manual changes; even trying to get a model to refactor a method in the code it wrote within the current context window results in bugs and broken business logic. I dearly wish I knew how others are able to accomplish things like “it successfully ported critical in-house infrastructure from Python 3 to Go.”. To date, I’ve seen no actual evidence (aside from what are purported to be LLM-generated artifacts) that anything beyond generating (or RAG-ing existing code) is even possible. What am I missing? Is it unrealistic for me to assume that prompt engineering such a seemingly dramatic LLM-generated code rewrite is something that I could learn by example from others? If not, can somebody recommend resources related to learning how to accomplish non-trivial code generation?
> If not, can somebody recommend resources related to learning how to accomplish non-trivial code generation?
Learn how to think ontologically and break down your requests first by what you're TRULY looking for, and then understand what parts would need to be defined in order to build that system -- that "whole". Here's some guides:
> Learn how to think ontologically and break down your requests first by what you're TRULY looking for, and then understand what parts would need to be defined in order to build that system -- that "whole".
Since I’m dealing with models rather than other engineers should I expect the process of breaking down the problem to be dramatically different from that of writing design documents or API specs? I rarely have difficulty prompting (or creating useful system prompts for) models when chatting or doing RAG work with plain English docs but once I try to get coherent code from a model things fall apart pretty quickly.
That's actually a solid question! You can probably ask GPT to AI-optimize a standard technical spec you have and to "ask clarifying questions in order to optimize for the best output". I've done that several times with past specs I've had and it was quite a fruitful process!
Great idea. I’ve used that tactic in the past for non-code related prompts; not sure why I didn’t think of trying it with my code-generation prompting. I’ll give it a shot.
the "ask me what info you're missing" strategy works very well, since the AI will usually start the task every time to avoid false positives of asking a question. and then it also asks very good questions, I then realize were necessary info
This caught my eye and I’m genuinely curious about what you mean by it. Part of our success with Claude is that we don’t do abstractions, “perfect architecture”, DRY, SOLID and other religions that were written by people who sell consulting in their principles. If we ask LLMs to do any form of “Clean Code” or give them input on how we want the structure, they tend to be bad it.
Hell, if you want to “build from the bottom” you’re going to have to do it over several prompts. I had Claude build a blood bowl game for me, for the fun of it. It took maybe 50 prompts. Each focusing on different aspects. Like, I wanted it to draw the field and add mouse clickable and movable objects with SDL2, and that was one prompt. Then you feed it your code in a new prompt and let it do the next step based on what you have. If the code it outputs is bad, you’ll need to abandon the prompt again.
It’s nothing like getting an actual developer to do things. They can think for themselves and the probability engine won’t do any of that even if it pretends to. Their history for building things from scratch also seems to be quickly “tarnished” within the prompt context. Once they’ve done the original tasks I find it hard to get them to continue on it.
> This caught my eye and I’m genuinely curious about what you mean by it. Part of our success with Claude is that we don’t do abstractions, “perfect architecture”, DRY, SOLID and other religions
Within my environment, some of those “religions” are more than a requirement; they’re also critical to the long-term maintenance of a large collection of active repositories.
I think one of the problems folks tend to have with following or implementing a “religion” (by which I mean specific structural and/or stylistic patterns within a codebase) comes down to a fear of being stuck forever with a given pattern that may not fit future needs. There’s nothing wrong with iterating on your religion’s patterns as long as you have good documentation with thorough change logs; granted, that can be difficult or even out of reach for smaller shops.
My personal problem with them is that after decades in enterprise software I’ve never seen them be beneficial to long-term maintenance. People like Uncle Bob (who haven’t actually worked in software engineering since 20 years before Python was invented) will respond to that sort of criticism with a “they misunderstood the principles”. Which is completely correct in many cases, but if so many people around the world misunderstand the principles then maybe the principles simply aren’t good?
I don’t think any of them are inherently bad, but they lead to software engineering where people over complicate things. Building abstractions they might never need. I’ve specialised in the field of taking startups into enterprise, and 90% of the work is removing the complexity which has made their software development teams incapable of delivering value in a timely manner. Some of this is because they build infrastructures as though they were Netflix or Google, but a lot of times it’s because they’ve followed Clean Code principles religiously. Abstractions aren’t always bad, but you should never abstract until you can’t avoid it. Because two years down into your development you’ll end up with code bases that are so complex that it makes them hard to work with.
Especially when you get the principles wrong. Which many people do. Over all though, we’ve had 20 years of Clean Code, SOLID, DRY and so on, and if you look at our industry today, there is no less of a mess in software engineering than there were before. In fact some systems still run on completely crazy Fortran or COBOL because nobody using “modern” software engineering have been capable of replacing them. At least that’s the story in Denmark, and it hasn’t been for a lock of trying.
I think the main reason many of these principles have become religions is because they’ve created an entire industry of pseudo-jobbers who manage them, work as consultants and what not. All people who are very good at marketing their bullshit, but also people who have almost no experience actually working with code.
Like I said, nothing about them are inherently bad. If you know when to use which parts, but almost nobody does. So to me the only relevant principle is YAGNI. If you’re going to end up with a mess of a code base anyway, you might as well keep it simple and easy to change. I say this as someone who works as an external examiner for CS students, where we still teach all these things that so often never work. In fact a lot of these principles were things I was thought when I took my degree, and many haven’t really undergone any meaningful changes with the lessons learned since their initial creation.
I appreciate your perspective and I don’t disagree with you entirely. I’ve worked in environments that struggle with putting religion before productivity and maintainability; the result is often painful. I’ve also worked in environments where religion, productivity, and maintainability are equals; it makes for a nice working environment. Perhaps there’s a bit more bureaucracy involved (forced documentation can be frustrating—especially when you realize your docs don’t match the spec or the even the code) but, in my experience, the outcome is more pleasant. Scaling religious requirements while maintaining productivity can be tricky though; religion can be deeply expensive (and therefore bad business) for smaller orgs, but it can also be easily politicized in larger orgs, which often results in engineer dissatisfaction. Religion will always be controversial. :)
I think it's level of expertise. You are an expert in coding (10,000 hours and all that) so you know when the code is wrong. Everything else you put into it and get plausible sounding response is just as incorrect as the plausible sounding responses to coding questions, just you know enough to spot the errors.
LLMs are insidious, it feeds into "everything is simple" concept a lot of us have of the world. We ask an LLM for a project plan and it looks so good we're willing to fire our TPM, or a TPM asks the LLM for code and it gives them code that looks so good they question the value of an engineer. In reality, the LLM cannot do either role's job well.
> You are an expert in coding (10,000 hours and all that) so you know when the code is wrong.
While I appreciate the suggestion that I might be an expert, I am decidedly not. That said, I’ve been writing what the companies I’ve worked for would consider “mission critical” code (mostly Java/Scala, Python, and SQL) for about twenty years, I’ve been a Unix/Linux sysadmin for over thirty years, and I’ve been in IT for almost forty years.
Perhaps the modernity and/or popularity of the languages are my problem? Are the models going to produce better code if I target “modern” languages like Go/Rust, and the various HTML/JS/FE frameworks instead of “legacy” languages like Java or SQL?
Or maybe my experience is too close to bare metal and need to focus on more trivial projects with higher-level or more modern languages? (fwiw, I don’t actually consider Go/Rust/JS/etc to be higher-level or more “modern” languages than the JVM languages with which I’m experienced; I’m open to arguments though)
> LLMs are insidious, it feeds into "everything is simple" concept a lot of us have of the world.
Yah, that’s what I mean when I say I feel gaslit.
> In reality, the LLM cannot do either role's job well.
I am aware of this. I’m not looking for an agent. That said, am I being too simplistic or unreasonable in expecting that I too could leverage these models (albeit perhaps after acquiring some missing piece of knowledge) as assistants capable of reasoning about my code or even the code they generate? If so, how are others able to get LLMs to generate what they claim are “deployable” non-trivial projects or refactorings of entire “critical” projects from the Python language to Go? Is someone lying or do I just need (seemingly dramatically) deeper knowledge of how to “correctly” prompt the models? Have I simply been victim of (again, seemingly dramatically) overly optimistic marketing hype?
We have a similar amount of IT experience, although I haven't been a daily engineer for a long time. I use aider.chat extensively for fun projects, preferring the Claude backend right now, and it definitely works. This site is 90% aider, give or take, the rest my hand edits: https://beta.personacollective.ai -- and it involves solidity, react, typescript and go.
Claude does benefit from some architectural direction. I think it's better at extending than creating from whole-cloth. My workflow looks like:
1) Rough out some code, say a smart contract with the key features
2) Tell claude to finish it and write extensive testing.
3) Run abigen on the solidity to get a go library
4) Tell claude to stub out golang server event handlers for every event in the go library
5) Create a react typescript site myself with a basic page
6) Tell claude to create an admin endpoint on the react site that pulls relevant data from the smart contracts into the react site.
6.5) Tell claude to redesign the site in a preferred style.
7) Go through and inspect the code for bugs. There will be a bunch.
8) For bugs that are simple, prompt Claude to fix: "You forgot x,y,z in these files. fix it."
9) For bugs that are a misunderstanding of my intent, either code up the core loop directly that's needed, or negotiate and explain. Coding is generally faster. Then say "I've fixed the code to work how it should, update X, Y, Z interfaces / etc."
10) for really difficult bugs or places I'm stumped, tar the codebase up, go to the chat interface of claude and gpto1-preview, paste the codebase in (claude can take a longer paste, but preview is better at holistic bugfixing), and explain the problem. Wait a minute or two and read the comments. 95% of the time one of the two LLMS is correct.
This all pretty much works. For these definitions of works:
1) It needs handholding to maintain a codebase's style and naming.
2) It can be overeager: "While I was in that file, I ..."
3) If it's more familiar with an old version of a library you will be constantly fighting it to use a new API.
How I would describe my experience: a year ago; it was like working with a junior dev that didn't know much and would constantly get things wrong. It is currently like working with a B+ senior-ish dev. It will still get things wrong, but things mostly compile, it can follow along, and it can generate new things to spec if those requests are reasonable.
All that to say, my coding projects went from "code with pair coder / puppy occasionally inserting helpful things" to "most of my time is spent at the architect level of the project, occasionally up to CTO, occasionally down to dev."
Is it worth it? If I had a day job writing mission critical code, I think I'd be verrry cautious right now, but if that job involved a lot of repetition and boiler plate / API integration, I would use it in a HEARTBEAT. It's so good at that stuff. For someone like me who is like "please extend my capacity and speed me up" it's amazing. I'd say I'm roughly 5-8x more productive. I love it.
This is very good insight, the likes of which I’ve needed; thank you. Your workflow is moderately more complex and definitely less “agentic” than I’d expected/hoped but it’s absolutely not out of line with the kind of complexity I’m willing to tackle nor what I’d personally expect from pairing with or instructing a knowledgeable junior-to-mid level developer/engineer.
Totally. It’s actually an interesting philosophical question: how much can we expect at different levels of precision in requirements, and when is code itself the most efficient way to be precise? I definitely feel my communication limits more with this workflow, and often feel like “well, that’s a fair, totally wrong, but fair interpretation.”
Claude has the added benefit that you can yell at it, and it won’t hold it against you. You know, speaking of pairing with a junior dev.
You and me both man, Either I'm speaking a different language or I'm simply really bad at explaining what I need. I'd love to see someone actually do this on video.
Indeed. I’ve yet to run across an actual demonstration of an LLM that can produce useful, non-trivial code. I’m not suggesting (yet) that the capabilities don’t exist or that everyone is lying—the web is a big place after all and finding things can be difficult—but I am slowly losing faith in the capability of what the industry is selling. It seems right now one must be deeply knowledgeable of and specialized in the ML/AI/NLP space before being capable of doing anything remotely useful with LLM-based code generation.
I think there is something deeper going on: “coding” is actually 2 activities: the act of implementing a solution, and the act of discovering the solution itself. Most programmers are used to doing both at once. But to code effectively with an LLM, you need to have already discovered the solution before you attempt to implement it!
I’ve found this to be the difference between writing 50+ prompts / back and for the to get something useful, and when I can get something useful in 1-3 prompts. If you look at Simon’s post, you’ll see that these are all self-contained tools, whose entire scope has been constrained from the outset of the project.
When you go into a large codebase and have to change some behavior, 1) you usually don’t have the detailed solution articulated in your mind before looking at the codebase. 2) That “solution” likely consists of a large number of small decisions / judgements. It’s fundamentally difficult to encode a large number of nuanced details in a concise prompt, making it not worth it to use LLMs.
On the other hand, I built this tool: https://github.com/gr-b/jsonltui that I now use every day almost entirely using Claude. “CLI tool to visualize JSONL with textual interface, localizing parsing errors” almost fully qualifies this. In contrast, my last 8 line PR at my company, while it would appear much simpler on the surface level, contains many more decisions, not just of my own, but reflecting team conversations and expectations that are not written down anywhere. To communicate this shared implicit context with Claude would be so much more difficult than to perform the change myself.
You’re probably right but I’m far more interested in seeing things like how you prompted the model to produce your audio tool’s code. Did you have a design doc or did you collaborate with the model to come up with a design and its implementation ad-hoc? How much manual rewriting did you do. How much worked with little to no editing? How much did you prompt the model to fix any bugs it created? How successful was it? Did you specify a style guide up front or just use what it spat out and try to refactor later? How did that part go? You see where I’m going?
Oh, wow, it honestly just occurred to me that examples of how to prompt a model to produce a certain kind of content might be considered, more or less, some kind of trade secret vaguely akin to a secret recipe. That would be a bit depressing but I get it.
Assuming you mean the paradigm often known as FP (which makes use of concepts from the Lambda Calculus and Category Theory) and languages like Scala and Haskell that support Pure FP, well… my experience in trying to get LLMs to generate non-trivial FP (regardless the purity) has been entirely useless. I’d love to see an example of how you’re able to get useful code that is non-trivial—by which I mean code that includes useful business logic instead of what’s found in your typical “Getting Started” tutorial.
Here’s my experience. Like some of the other responses here to your comment, nothing I’ve made that’s more than a few lines of code has worked after one prompt, or even two or three. An example of something I’m working at the moment is here: https://github.com/fivestones/family-organizer. That codebase is about 99% LLM generated. I’d say it’s 60% from chatgpt 4o, 30% Claude Sonnet 3.5, and the rest mostly chatgpt o1-preview. Just the last commit has a bit of Claude Sonnet 3.5-new.
I can send you my chat transcripts if it would be helpful but it would take some work since it’s scattered over lots of different conversations.
At the beginning I was trying to describe the whole project to the LLM and then ask it to implement one feature. After maybe 5-20 prompts and iterations back and forth, I’d have something I was happy with for that feature and would move on to the next. However, I found, like some others here, the model would get bogged down in mistakes it had made previously, or would forget what I told it originally, or just wouldn’t work as well the longer my conversation went. So what I switched to, which seems to work really well, is to just paste in my entire current codebase (or at least all the relevant files) into a fresh chat, and then tell it about the one new feature I wanted. I try to focus on adding new features, or on fixing a specific problem. I’ll then sometimes (especially for a new feature) explain that this is my current code, here is the new thing I’m wanting it to do, and then tell it not to write any code for me but instead to ask me any questions it has. After this I’ll answer all its questions and tell it to ask me any follow up questions it has. “If you don’t have any more questions just say “I’m ready”. When it gets to the point of saying “I’m ready”, if working with chatgpt I would change the model from 4o to o1-preview, and then just say, “ok, go ahead”. After it spits out its response, it usually takes some iteration in the same chat: me copying and pasting code into vs code, running it, copy pasting any errors back to the LLM, or describing to it what I didn’t like about the results, and repeating. I might go through that process 5-10 times for something small, or 20-25 times for something bigger. Once I’ve gotten something working, I’ll abandon that chat and start over in a new one with my next problem or desired feature.
I basically have done nothing at all with telling it anything about how I want it to structure the code. For the project above I wanted to use instantdb so I fed it some of the instantdb documentation and examples at the beginning. Later features just worked—it followed along successfully with what it saw in my codebase already. I am also using typescript/next.js and so those were pretty much the limits of what I’ve told it as constraints as to how to structure the code.
I’m not a programmer, and I think if you look at the code you’ll probably see lots of stuff that looks bad to you if you are a programmer. But I don’t have plans to reply this code at scale—it’s just something I’m making for my family to use, and for whoever finds it on GitHub to use as well. So as long as it works and I’m happy with the result I’m not too concerned about the code. The most concern I have might be things like thinking about future features I want to add and whether or not the code I’m adding now will make future code hard to add or not. Usually I’ll just tell the LLM something like, “keep in mind when making this db schema that later we’ll need to do x or y”, and leaving it at that. The other thing is that I’ve never used react let alone next.js and have only dabbled in js here and there. But here I am, making something that works and that I’m happy with, thanks to the LLMs. I think that’s pretty amazing to me.
Sometimes I struggle to get it to do what I want, and usually then I just scrap the latest code changes back to the last commit and then start over, often with a different LLM model.
It sounds like your use case is a lot different than mine, as I’m just doing stuff in my spare time for fun and for me or my family to use. But maybe some of those ideas will help you. Let me know if you want some chat transcripts.
One other thing, I found a vs code extension that lets me choose a file or set of files in the vs code explorer, right click, and export for LLM consumption. This is really helpful. It just makes a tree of whichever files I had selected (like the output from the terminal tree command) and follow that with the full text of each file, and copies all that to the clipboard. So to start a new chat, I just select files, right click, export for LLM, and then paste into the LLM new char window.