Why LLMs can't really build software
The author does not understand what LLMs and coding tools are capable of today.
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over. This is exactly the opposite of what I am looking for. Software engineers test their work as they go. When tests fail, they can check in with their mental model to decide whether to fix the code or the tests, or just to gather more data before making a decision. When they get frustrated, they can reach for help by talking things through. And although sometimes they do delete it all and start over, they do so with a clearer understanding of the problem.
My experiences are based on using Cline with Anthropic Sonnet 3.7 doing TDD on Rails, and have a very different experience. I instruct the model to write tests before any code and it does. It works in small enough chunks that I can review each one. When tests fail, it tends to reason very well about why and fixes the appropriate place. It is very common for the LLM to consult more code as it goes to learn more.
It's certainly not perfect but it works about as well, if not better, than a human junior engineer. Sometimes it can't solve a bug, but human junior engineers get in the same situation too.
Most of this might be true for LLM's but years of investing experience has created a mental model of looking for the tech or company that sucks and yet keeps growing.
People complained endlessly about the internet in the early to mid 90s, its slow, static, most sites had under construction signs on them, your phone modem would just randomly disconnect. The internet did suck in alot of ways and yet people kept using it.
Twitter sucked in the mid 2000s, we saw the fail whale weekly and yet people continued to use it for breaking news.
Electric cars sucked, no charging, low distance, expensive and yet no matter how much people complain about them they kept getting better.
Phones sucked, pre 3G was slow, there wasn't much you could use them for before app stores and the cameras were potato quality and yet people kept using them while they improved.
Always look for the technology that sucks and yet people keep using it because it provides value. LLM's aren't great at alot of tasks and yet no matter how much people complain about them, they keep getting used and keep improving through constant iteration.
LLM"s amy not be able to build software today, but they are 10x better than where they were in 2022 when we first started using chatgpt. Its pretty reasonable to assume in 5 years they will be able to do these types of development tasks.
LLMs can’t build software because we are expecting them to hear a few sentences, then immediately start coding until there’s a prototype. When they get something wrong, they have a huge amount of spaghetti to wade through. There’s little to no opportunity to iterate at a higher level before writing code.
If we put human engineering teams in the same situation, we’d expect them to do a terrible job, so why do we expect LLMs to do any better?
We can dramatically improve the output of LLM software development by using all those processes and tools that help engineering teams avoid these problems:
https://jim.dabell.name/articles/2025/08/08/autonomous-softw...
> what they cannot do is maintain clear mental models
The more I use claude code, the more frustrated I get with this aspect. I'm not sure that a generic text-based LLM can properly solve this.
Has anyone been able to separate creativity from hallucination in LLMs? As in: we can affect one without the other, or measure one without the other. In humans we understand when someone is doing one vs. the other. The biggest difference is that a creative person knows they will have to take action to bring about their vision, but a hallucinating person thinks their vision already exists.
It seems like the LLM phenomenon that we call hallucination is descriptively the same phenomenon that we call creativity in other contexts. If the LLM adds a new function or feature to the current project as part of work on another feature, that's creativity. But if it assumes a function or type in another project, which can't easily be changed, we call that hallucination. Even though it could just as easily add that feature if it had access to that code as well.
I have never tried seriously coding with AI. I just ask ChatGPT for snippets that I can verify, to save a few round trips to Google and API docs.
However the other day I gave ChatGPT a relatively simple assignment, and it kept ignoring the rules. Every time I corrected it, it broke a different rule. I was asking it for gender-neutral names, but it kept giving last names like Orlov (which becomes Orlova), or first names that are purely masculine.
Is it the same with vibe coding?
> When you watch someone who knows what they are doing, you'll see them looping over the following steps:
> Build a mental model of the requirements
> Write code that (hopefully?!) does that
> Build a mental model of what the code actually does
> Identify the differences, and update the code (or the requirements).
This is pretty right on but I think it leaves out an aspect of writing code that I think is often pretty under appreciated. Code does two things at once: it provides a set of instructions to a machine and it communicates the authors' understanding of the program behavior those instructions are intended to express. I think this is a large part of what makes programming so fascinating and frustrating. It's what's behind the cliche that "naming things" is one of the hardest parts of programming. In growing software systems it's often not enough that a feature's implementation works. Ideally, that implementation should impose a minimum barrier to understanding for contributors to do something with it afterward. I'm not convinced this is an aspect of software development that LLMs will be able to meaningfully achieve.
I think I agree with the idea that LLMs are good at the junior level stuff.
What's happened for me recently is I've started to revisit the idea that typing speed doesn't matter.
This is an age-old thing, most people don't think it really matters how fast you can type. I suppose the steelman is, most people think it doesn't really matters how fast you can get the edits to your code that you want. With modern tools, you're not typing out all the code anyway, and there's all sorts of non-AI ways to get your code looking the way you want. And that doesn't matter, the real work of the engineer is the architecture of how the whole program functions. Typing things faster doesn't make you get to the goal faster, since finding the overall design is the limiting thing.
But I've been using Claude for a while now, and I'm starting to see the real benefit: you no longer need to concentrate to rework the code.
It used to be burdensome to do certain things. For instance, I decided to add an enum value, and now I have to address all the places where it matches on that enum. This wasn't intellectually hard in the old world, you just got the compiler to tell you where the problems were, and you added a little section for your new value to do whatever it needed, in all the places it appeared.
But you had to do this carefully, otherwise you would just cause more compile/error cycles. Little things like forgetting a semicolon will eat a cycle, and old tools would just tell you the error was there, not fix it for you.
LLMs fix it for you. Now you can just tell Claude to change all the code in a loop until it compiles. You can have multiple agents working on your code, fixing little things in many places, while you sit on HN and muse about it. Or perhaps spend the time considering what direction the code needs to go.
The big thing however is that when you're no longer held up by little compile errors, you can do more things. I had a whole laundry list of things I wanted to change about my codebase, and Claude did them all. Nothing on the business level of "what does this system do" but plenty of little tasks that previously would take a junior guy all day to do. With the ability to change large amounts of code quickly, I'm able to develop the architecture a lot faster.
It's also a motivation thing: I feel bogged down when I'm just fixing compile errors, so I prioritize what to spend my time on if I am doing traditional programming. Now I can just do the whole laundry list, because I'm not the guy doing it.
I actually don't have doubts that LLMs are quite good at writing software.
The problem for me is one of practicality. If, after hundreds of lines of AI-written code, I noticed some sort of issue (regarding scale, security, formatting, logic, etc.), I'm basically forced to start over.
We all know that reading code is way less pleasant than writing code. So, for me, LLMs can be very useful for writing code that I know is going to be correct without having to go back through it. For example, basic TRPC CRUD functions.
Only because most AI startups are doing it wrong.
I don't want a chat window.
I want AI workflows as part of my IDE, like Visual Studio, InteliJ, Android Studio are finally going after.
I want voice controlled actions on my native language.
Knowledge across everything on the project for doing code refactorings, static analysis with AI feedback loop, generating UI based out of handwritten sketches, programming on the go using handwriting, source control commit messages out of code changes,...
Yeah, I think it's pretty clear to a lot of people that LLMs aren't at the "build me Facebook, but for dogs" stage yet. I've had relatively good success with more targeted tasks, like "Add a modal that does this, take this existing modal as an example for code style". I also break my problem down into smaller chunks, and give them one by one to the LLM. It seems to work much better that way.
LLMs can't reason: They're a statistical model that can only copy.
In many cases though, the copy that the LLM generates is either "good enough" or a great thing to start with.
The 4 step process outlined at the start of this article really reminds me of Deutsch's The Beginning of Infinity:
> The real source of our theories is conjecture, and the real source of our knowledge is conjecture alternating with criticism.
(This is rephrased Karl Popper, and Popper cites an intellectual lineage beginning somewhere around Parmenides.)
I see writing tests as a criticism of the code you wrote, which itself was a conjecture. Both are attempting to approach an explanation in your mind, some platonic idea that you think you are putting on paper. The code is an attempt to do so, the test is criticism from a different direction that you have done so.
> "when test fail, they are left guessing as to whether to fix the code or the tests"
I've one thing that helps is using the "Red-Green-Refactor" language. We're in RED phase - test should fail. We're in GREEN phase - make this test pass with minimal code. We're in REFACTOR phase - improve the code without breaking tests.
This helps the LLM understand the TDD mental model rather than just seeing "broken code" that needs fixing.
These LLM discussions really need everyone to mention what LLM they're actually using.
> AI is awesome for coding! [Opus 4]
> No AI sucks for coding and it messed everything up! [4o]
Would really clear the air. People seem to be evaluating the dumbest models (apparently because they don't know any better?) and then deciding the whole AI thing just doesn't work.
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
I feel personally described by this statement. At least on a bad day, or if I'm phoning it in. Not sure if that says anything about AI - maybe just that the whole "mental models" part is quite hard.
This very afternoon I coded with ChatGPT what would have taken me maybe some more time. Maybe 30-50% more. True that it was more-or-less independent code.
I think these tools are very effective at four things:
1. initial scaffolding (and isolated scripts also) 2. reviewing code and finding insights of where an error that we humans are very bad at spotting could be, given a description and an area of code. 3. If you are not an expert in something (for me that is frontend) it helps. 4. finding ideas on how to approach a problem through a conversation
But if you are well-versed, you still have more context and better decision making. They are ok-ish but not expert level.
It also comes with downsides if you abuse them: you could end up not understanding your codebases well enough or adding bad quality code that seems to work.
Also, if you initially think you can solve a problem with AI and it leads you the bad way, you end up wasting more time than what you save.
All in all, I find them good at scaffolding, asking ideas or solutions and spotting potential bugs.
As a whole they are accelerators but not replacements, IMHO.
Current LLMs look a lot like a very advanced 'old brain' to me. While context engineering looks like optimizing the working memory.
What's missing is a part with more plasticity that can work in parallel and bi-directionally interact with the current static models in real-time.
This would mean individually trained models based on their experience so that knowledge is not translated to context, but to weight adjustments.
3 months ago, I would have agreed with much of this article, however...
In the past week, I watched this video[1] from Welch Labs about how deep networks work, and it inspired an idea. I spent some time "vibe coding" with Visual Studio Code's ChatGPT5 preview and had it generate a python framework that can take an image, and teach a small network how to generate that one sample image.
The network was simple... 2 inputs (x,y), 3 outputs (r,g,b), and a number of hidden layers with a specified number of nodes per layer.
It's an agent, it writes code, tests it, fixes problems, and it pretty much just works. As I explored the space of image generation, I had it add options over time, and it all just worked. Unlike previous efforts, I didn't have to copy/paste error messages in and try to figure out how things broke. I was pleasantly surprised that the code just worked in a manner close to what I wanted.
The only real problem I had was getting .venv working right, and that's more of an install issue rather then the LLMs fault.
I've got to say, I'm quite impressed with Python's argparse library.
It's amazing how much detail you can get out of a 4 hidden layers of 64 values, and 3 output channels (rgb), if you're willing to through a few days of CPU time at it. My goal is to see just how small of a network I can make to generate my favorite photo.
As it iterates through checkpoints, I have it output an image with the current values, to compare against the original, it's quite fascinating to watch as it folds the latent space to capture major features of the photo, then folds some more to catch smaller details, over and over, as the signal to noise ratio very slowly increases over the hours.
As for ChatGPT5, maybe I just haven't run out of context window yet, but for now, it all just seems like magic.
This is the post that the people at Anthropic and Cursor should read.
> But what they cannot do is maintain clear mental models.
The emphasis should be on maintain. At some point, the AI tends to develop a mental model, but over time, it changes in unexpected ways or becomes absent altogether. In addition, the quality of the mental models is often not that good to begin with.
Back around, I don't even know, 2013? A colleague and I were working on updating a system that scanned in letters with mail order forms. The workers would lay the items from the envelopes in order on a conveyor type scanner. They had to lay them down in order: order form, payment check, envelope. The system would scan each document and add two blank fake scanned pages after each envelope. The company that set it up billed by scanned page. We figured out that you didn't need the blank pages as a delimiter because the envelope could reliably serve as that. By the way, the OCR was so bad that they never got the order forms to scan automatically, but people had to examine the order form as a pdf doc and key in everything instead. By eliminating the fake, nonsensical blank scanned pages, we saved the company over $1M/year in costs. We never got a single accolade or pat on the back or anything for that. Can AI do that, though?
This was Peter Naur’s observation in his 1985 paper Programming as Theory Building. One of the best papers I’ve ever read. Programming IS mostly building castles in the air of your head, not the code on the screen.
If you do the thinking and let the LLM do the typing it works incredibly well. I can write code 10x faster with AI, but I’m maintaining the mental model in my head, the “theory” as Naur calls it. But if you try to outsource the theory to the LLM (build me an app that does X) you’re bound to fail in horrible ways. That’s why Claude Code is amazing but Replit can only do basic toy apps.
Bit of a click baity title since thy can definitely help in building software.
However, I agree with the main thesis (that they can’t do it on their own). Also related to this this whole idea of “we will easily fix memory next” will turn out to be the same as “we can fix vision in one summer” turned out it’s 30 years later, much improved but still not fixed. Memory is hard.
I find that if you are great at pseudocode, and great at defining a programs logic in the pseudocode, along with defining all cases possible, and trying to add user id10t error issue to the logic, you can get a pretty good framework of a program to start to work with. I also ask for comments for all logic, loops, if statements and functions that are instructional, yet easy enough that a 5th grader could understand. This framework usually comes as a version .8 for further development manually or even .9 beta test/debugging level. Occasionally, I've seen version 1.0 release candidate 1 level work, where I need to try to verify the functionality, AND try to find ways that users may try that will break functionality. Whichever version I end up with, it still involves manual coding, there is no escaping that. Using a LLM just saves (a lot) of time with the initial framework and program logic framework.
Saying LLMS are not good at x or y, is akin to saying a brain is useless without a body. Which is obvious. The success of agentic coding solutions depends on not just the model but also the system that the developers built around the model. And the companies that will succeed in this area are going to be the companies that focus on building sophisticated and capable systems that utilize said models. We are still in very early days where most organizations are only coming to terms with this realization... Only a few of them fully utilize this concept to the fullest, Claude code being the best example. The Claude models are specifically trained for tool calling and other capabilities and the Claude code cli compliments and takes advantage of those capabilities to the fullest, things like context management among other capabilities are extremely important ...
I think we just need better AI coding tools: the space is still a blue ocean for the ones that can figure it out.
The article highlights the key problem with AI tools today - which is that there doesn't seem to be a high level planning step (aka a mental model) to start with. Every time you ask a new question, the LLM starts from scratch.
Turns out, English is pretty bad for creating deterministic software. If you are vibe coding, you either are happy with the randomness generated by the LLMs or you enter a loop to try to generate a deterministic output, in which case using a programming language could have been easier.
This is underrated and applies to almost all roles involved in product development.
> the distinguishing factor of effective engineers is their ability to build and maintain clear mental models.
I'm tired of people telling me that llms are bad at building software without trying to sit down, learn how to properly use claude code, when to use it and learn when you shouldn't use it.
Cursor is a joke tho, windsurf is pretty okay.
Zed hasn't won me over from Vim (Neovim) yet.., but I will say I appreciate the keyboard navigation built into the website! KYC :)
> ...but the distinguishing factor of effective engineers is their ability to build and maintain clear mental models.
I wonder is this not just a proxy for intelligence?
60% of the complaints in this post can be solved by providing better requirements and context upfront
The 2 iOS apps that I published (mid level complexity and work well) say otherwise. I was blown away by what cursor + o3 could do.
Have faith in AI, one day it will do what we hallucinate it can do!
LLMs cannot build software on their own yet. They are can sure build software with some help.
Maybe we should let it build a mental model in documentation markdown files?
Vibing I often let it explain the implemented business logic (instead of reading the code directly) and judge that.
I wonder if some of this can be solved by removing some wrongly setup context in LLM. Or get a short summary, restructure it and againt feed to a fresh LLM context.
Side comment, I love the typography of the site. Easy to read.
> Context omission: Models are bad at finding omitted context.
> Recency bias: They suffer a strong recency bias in the context window.
> Hallucination: They commonly hallucinate details that should not be there.
To be fair, those are all issues that most human engineers I've worked with (including myself!) have struggled with to various degrees, even if we don't refer to them the same way. I don't know about the rest of you, but I've certainly had times where I found out that an important nuance of a design was overlooked until well into the process of developing something, forgotten a crucial detail that I learned months ago that would have helped me debug something much faster than if I had remembered it from the start, or accidentally make an assumption about how something worked (or misremembered it) and ended up with buggy code as a result. I've mostly gotten pretty positive feedback about my work over the course of my career, so if I "can't build software", I have to worry about the companies that have been employing me and my coworkers who have praised my work output over the years. Then again, I think "humans can't build software reliably" is probably a mostly correct statement, so maybe the lesson here is that software is hard in general.
Good diagnosis of the loop. But “LLMs can’t build software” is mostly a statement about interface, not capability. If we ask a stochastic parrot to behave like a staff engineer in its head, it fails. If we reshape the work so the mental model lives outside the model—executable specs, tight tests, ADRs as shared state, small DSLs, and an orchestrator that forces evidence before code—agents can move through the same loop on bounded problems. In other words, they don’t need an internal model if the environment gives them one.
The better question isn’t “can an LLM maintain two mental models?” but “how much of this problem can we make machine-checkable?” Where we can’t (socio-technical trade-offs, ambiguous requirements), a human owns the decisions. Where we can (migrations, glue, refactors guarded by tests), the agent owns the keystrokes.
Today’s failure modes (omission, recency bias, hallucination) are real, but mitigated by durable memory, runbooks, and mandated check-ins the tool can’t skip. So: not “can’t build software”, but “can’t be the tech lead”. Yet.
It's good at micro, but not macro. I think that will eventually change with smarter engineering around it, larger context windows, etc. Never underestimate how much code that engineers will write to avoid writing code.
I think the argument here is nonsense. LLMs clearly work differently to human cognition, so pointing out a difference between how LLMs and humans approach a problem and calling that the reason that they can't build software makes no sense. Plausibly there are many ways to build software that don't make sense to a human.
That said, I agree with the conclusion. They do seem to be missing coherent models of what they work on - perhaps part of the reason they do so poorly on benchmarks like ARC, which are designed to elicit that kind of skill?
I had an awful, terrible experience with GPT5 a few days ago, that made me remember why I don't use LLMs, and renewed my promise to not use them for at least a year more.
I am a relative newbie to GPU development, and was writing a simple 2D renderer with WebGPU and its rust implementation, wgpu. The goal is to draw a few textures to a buffer, and then draw that buffer to the screen with a CRT effect applied.
I got 99% of the way there on my own, reading the guide, but then got stumped on a runtime error message. Something like "Texture was destroyed while its semaphore wasn't released". Looking around my code, I see no textures ever being released. I decide to give the LLM a go, and ask it to help me, and it very enthusiastically gives a few thing to try.
I try them, nothing works. It corrects itself with more things to try, more modifications to my code. Each time giving a plausible explanation as to what went wrong. Each time extra confident that it got the issue pinned down this time. After maybe two very frustrating hours, I tell it to go fuck itself, close the tab and switch my brain on again.
10 minutes later, I notice my buffer's format doesn't match the one used in the render pass that draws to it. Correct that, compile, and it works.
I genuinely don't understand what those pro-LLM-coding guys are doing that they find AIs helpful. I can manage the easy parts of my job on my own, and it fails miserably on the hard parts. Are those people only writing boilerplate all day long?
I typically design solutions to minimize the cognitive load on coders boarding my projects. They should not need a mental model to add features. Same with AI.
LLMs are powerful assistants—as long as the user keeps a firm mental model of the problem. That’s why, for now, they complement software engineers rather than replace them (at least today).
When you already know exactly what needs to be built and simply want to skip the drudgery of boilerplate or repetitive tasks, a coding CLI is great: it handles the grunt work so you can stay focused on the high-level design and decision-making that truly matter (and also more fun).
> But, we firmly believe that (at least for now) you are in the drivers seat, and the LLM is just another tool to reach for.
So does Microsoft and Github. At least that's what they were telling us the whole time. Oh wait.. they changed their mind i think a week ago.
I think most people trying to touch on this topic don't consider this byline with other similar bylines like, "Why LLMs can't recognize themselves looping", or "Why LLMs can't express intent", or "Why LLMs can't recognize truth/falsity, or confidence levels of what they know vs don't know", these other bylines basically with a little thought equate to Computer Science halting problems, or the undecidability nature of mathematics.
Taken to a next step, recognizing this makes the investment in such a moonshot pipedream (overcoming these inherent problems in a deterministic way), recklessly negligent.
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
That's actually an interesting point, and something I've noticed a lot myself. I find LLMs are very good at hacking around test failures, but unless the test is failing for a trivial reason often it's pointing at some more fundamental issue with the underlying logic of the application which LLMs don't seem to be able to pick up on, likely because they don't have a comprehensive mental model of how the system should work.
I don't want to point fingers, but I've been seeing this quite a bit in the code of colleagues who heavily use LLMs. On the surface the code looks fine, and they've produced tests which pass, but when you think about it for more than a minute you realise it doesn't really capture nuance of the requirements, and in a way a human who had a mental model of the how the system probably wouldn't have done...
Sometimes humans miss things in the logic when they're writing code, but these look more like mistakes in a line rather than a fundamental failure to comprehend and model the problem. And I know this isn't the case, because when you talk to these developers they get the problem perfectly well.
To know when the code needs fixing or a test you need a very clear idea of what should be happening and LLMs just don't. I don't know why that is. Maybe it's just they're missing context from the hours of reading tickets and technical discussions, or maybe it's their failure to ask questions when they're unsure of what should be happening. I don't know if this a fundamental limitation of LLMs (I'd suspect not personally), but this is a problem when using LLMs to code today and one that more compute alone probably can't fix.
They can read and mind the error then figure out the best way to resolve. It is the best part about llm. No human can do it better than an llm. But they are not your mind reader. It is where things fall apart.
I think they're another tool in the toolbox not a new workshop. You have to build a good strategy around LLM usage when developing software. I think people are naturally noticing that and adapting.
the mentioned "two similar mental models" is an interesting of looking at the problem. if that is the actual case, seems to me that a much better model plus an smart enough agent should be able to largely solve the problem.
interesting time, interesting issue.
Well, welcome to the club of awareness :)
..."(at least for now) you are in the drivers seat, and the LLM is just another tool to reach for."
Improvements in model performance seem to be approaching the peak rather than demonstrating exponential gains. Is the quote above where we land in the end?
Am I the only one continuously astounded at how well Opus 4 actually does build mental models when prompted correctly?
I find Sonnet frequently loses the plot, but Opus can usually handle it (with sufficient clarity in prompting).
I decided to jump into the deep end of the pool and complete two projects using Cursor with it's default AI setup.
The first project is a C++ embedded device. The second is a sophisticated Django-based UI front end for a hardware device (so, python interacting with hardware and various JS libraries handling most of the front end).
So far I am deeper into the Django project than the C++ embedded project.
It's interesting.
I had already hand-coded a conceptual version of the UI just to play with UI and interaction ideas. I handed this to Cursor as well as a very detailed specification for the entire project, including directory structure, libraries, where to use what and why, etc. In other words, exactly what I would provide a contractor or company if I were to outsource this project. I also told it to take a first stab at the front end based on the hand-coded version I plopped into a temporary project directory.
And then I channeled Jean-Luc Picard and said "Engage!".
The first iteration took a few minutes. It was surprisingly functional and complete. Yet, of course, it had problems. For example, it failed to separate various screens into separate independent Django apps. It failed to separate the one big beautiful CSS and JS files into independent app-specific CSS and JS files. In general, it ignored separation of concerns and just made it all work. This is the kind of thing you might expect from a junior programmer/fresh grad.
Achieving separation of concerns and other undesirable cross-pollination of code took some effort. LLM's don't really understand. They simulate understanding very well, but, at the end of the day, I don't think we are there. They tend to get stuck and make dumb mistakes.
The process to get to something that is now close to a release candidate entailed an interesting combination of manual editing and "molding" of the code base with short, precise and scope-limited instructions for Cursor. For my workflow I am finding that limiting what I ask AI to do delivers better results. Go too wide and it can be in a range between unpredictable and frustrating.
Speaking of frustrations, one of the most mind-numbing things it does every so often is also in a range, between completely destroying prior work or selectively eliminating or modifying functionality that used to work. This is why limiting the scope, for me, has been a much better path. If I tell it to do something in app A, there's a reasonable probability that it isn't going to mess with and damage the work done in app B.
This issue means that testing become far more important in this workflow, because, on every iteration, you have no idea what functionality may have been altered or damaged. It will also go nuts and do things you never asked it to do. For example, I was in the process of redoing the UI for one of the apps. For some reason it decided it was a good idea to change the UI for one of the other apps, remove all controls and replace them with controls it thought were appropriate or relevant (which wasn't even remotely the case). And, no, I did not ask it to touch anything other than the app we were working on.
Note: For those not familiar with Django, think of an app as a page with mostly self-contained functionality. Apps (pages) can share data with each other through various means, but, for the most part, the idea is that they are designed as independent units that can be plucked out of a project and plugged into another (in theory).
The other thing I've been doing is using ChatGPT and Cursor simultaneously. While Cursor is working I work with ChatGPT on the browser to plan the next steps, evaluate options (libraries, implementation, etc.) and even create quick stand-alone single file HTML tests I can run without having to plug into the Django project to test ideas. I like this very much. It works well for me. It allows me to explore ideas and options in the context of an OpenAI project and test things without the potential to confuse Cursor. I have been trying to limit Cursor to being a programmer, rather than having long exploratory conversations.
Based on this experience, one thing is very clear to me: If you don't know what you are doing, you are screwed. While the OpenAI demo where they have v5 develop a French language teaching app is cool and great, I cannot see people who don't know how to code producing anything that would be safe to bet the farm on. The code can be great and it can also be horrific. It can be well designed and it can be something that would cause you to fail your final exams in a software engineering course. There's great variability and you have to get your hands in there, understand and edit code by hand as part of the process.
Overall, I do like what I am seeing. Anyone who has done non-trivial projects in Django knows that there's a lot of busy boilerplate typing that is just a pain in the ass. With Cursor, that evaporates and you can focus on where the real value lies: The problem you are trying to solve.
I jump into the embedded C++ project next week. I've already done some of it, but I'm in that mental space 100% next week. Looking forward to new discoveries.
The other reality is simple: This is the worse this will ever be. And it is already pretty good.
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
Great, concise article. Nothing important to add, except that AI snake-oil salesmen will continue spreading their exaggerations far and wide, at least we who are truly in this business agree on the facts.
I am not a fan of today's concept of "AI", but to be fair, building today's software is not for the faint of heart, very few people gets it right on try 1.
Years ago I gave up compiling these large applications all together. I compiled Firefox via FreeBSD's (v8.x) ports system, that alone was a nightmare.
I cannot imagine what it would be like to compile GNOME3 or KDE or Libreoffice. Emacs is the largest thing I compile now.
On the contrary, Kiro (https://kiro.dev) is showing that it can be done by breaking down software engineering into multiple stages (requirements, design, and tasks) and then breaking the tasks down into discrete subtasks. Each of those can then be customized and refined as much as you like. It will even sketch out initial documents for all three.
It’s still early days, but we are learning that as with software written exclusively by humans, the more specific the specifications are, the more likely the result will be as you intended.
This is a low information density blog post. I’ve really liked Zed’s blog posts in the past (especially about the editor internals!) so I hope this doesn’t come the wrong way, but this seems to be a loose restatement of what many people are empirically finding out by using LLM agents.
Perhaps good for someone just getting their feet wet with these computational objects, but not resolving or explaining things in a clear way, or highlighting trends in research and engineering that might point towards ways forward.
You also have a technical writing no no where you cite a rather precise and specific study with a paraphrase to support your claims … analogous to saying “Godel’s incompleteness theorem means _something something_ about the nature of consciousness”.
A phrase like: “Unfortunately, for now, they cannot (beyond a certain complexity) actually understand what is going on” referencing a precise study … is ambiguous and shoddy technical writing — what exactly does the author mean here? It’s vague.
I think it is even worse here because _the original study_ provides task-specific notions of complexity (a critique of the original study! Won’t different representations lead to different complexity scaling behavior? Of course! That’s what software engineering is all about: I need to think at different levels to control my exposure to complexity)
> We don't just keep adding more words to our context window, because it would drive us mad.
That, and we also don't only focus on the textual description of a problem when we encounter a problem. We don't see the debugger output and go "how do I make this bad output go away?!?". Oh, I am getting an authentication error. Well, meaybe I should just delete the token check for that code path...problem solved?!
No. Problem very much not-solved. In fact, problem very much very bigger big problem now, and [Grug][1] find himself reaching for club again.
Software engineers are able to step back, think about the whole thing, and determine the root cause of a problem. I am getting an auth error...ok, what happens when the token is verified...oh, look, the problem is not the authentication at all...in fact there is no error! The test was simply bad and tried to call a higher privilege function as a lower privilege user. So, test needs to be fixed. And also, even though it isn't per-se an error, the response for that function should maybe differentiate between "401 because you didn't authenticate" and "401 because your privileges are too low".
[1]: https://grugbrain.dev