It matters what you measure. The studies only looked at Copilot usage.
I’m an experienced engineer. Copilot is worse than useless for me. I spend most of my time understanding the problem space, understanding the constraints and affordances of the environment I’m in and thinking about the code I’m going to write app. When I start typing code, I know what I’m going to write, and so a “helpful” Copilot autocomplete is just distraction for me. It makes my workflow much much worse.
On the other hand, AI is incredibly useful for all of those steps I do before actually coding. And sometimes getting the first draft of something is as simple as a well crafted prompt (informed by all the thinking I’ve done prior to starting. After that, pairing with an LLM to get quick answers for all the little unexpected things that come up is extremely helpful.
So, contrary to this report, I think that if experienced developers use AI well, they could benefit MORE than inexperienced developers.
I wonder if the study includes the technical debt that more experienced developers had to tackle after the less experienced devs have contributed their AI-driven efforts. Because my personal experience has involved a lot of that in one of the companies listed in the study.
Also, I've personally seen more interest in AI in devs that have little interest in technology, but a big interest in delivering. PMs love them though.
It's probably worth going a bit deeper into the paper before picking up conclusions. And I think the study could really do a bit of a better job of summarizing its results.
The abstract and the conclusion only give a single percentage figure (26.08% increase in productivity, which probably has too many decimals) as the result. If you go a bit further, they give figures of 27 to 39 percent for juniors and 8 to 13 percent for seniors.
But if you go deeper, it looks like there's a lot of variation not only by seniority, but also by the company. Beside pull requests, results on the other outcome measures (commits, builds, build success rate) don't seem to be statistically significant at Microsoft, from what I can tell. And the PR increases only seem to be statistically significant for Microsoft, not for Accenture. And even then possibly only for juniors, but I'm not sure I can quite figure out if I've understood that correctly.
Of course the abstract and the conclusion have to summarize. But it really looks like the outcomes vary so much depending on the variables that I'm not sure it makes sense to give a single overall number even as a summary. Especially since statistical significance seems a bit hit-and-miss.
edit: better readability
My hunch - it's just a hunch - is that LLM-assisted coding is detrimental to one's growth as a developer. I'm fairly certain it can only boost productivity to a certain level - one which may be tedium for more senior developers, but formative for juniors.
My experience is that the LLM isn't just used for "boilerplate" code, but rather called into action when a junior developer is faced with a fairly common task they've still not (fully) understood. The process of experimenting, learning and understanding is then largely replaced by the LLM, and the real skill becomes applying prompt tweaks until it looks like stuff works.
The most interesting thing about this study for me is that when they break it down by experience levels, developers who are above the median tenure show no statistically significant increase in 'productivity' (for some bad proxies of productivity), with the 95% confidence intervals actually dipping deep into the negatives on all metrics (though leaning slightly positive).
This tracks with my own experience: Copilot is nice for resolving some tedium and freeing up my brain to focus more on deeper questions, but it's not as world-altering as junior devs describe it as. It's also frequently subtly wrong in ways that a newer dev wouldn't catch, which requires me to stop and tweak most things it generates in a way that a less experienced dev probably wouldn't know to. A few years into it I now have a pretty good sense for when to use Copilot and when not to—so I think it's probably a net positive for me now—but it certainly wasn't always that way.
I also wonder if the possibly-decreased 'productivity' for more senior devs stems in part from the increase in 'productivity' from the juniors in the company. If the junior devs are producing more PRs that have more mistakes and take longer to review, this would potentially slow down seniors, reducing their own productivity gains proportionally.
A 26% productivity increase sounds inline with in my experience. I think one dimension they should explore is whether you're working with a new technology or one that you're already familiar with. AI helps me much more with languages/frameworks that I'm trying to learn.
It lets people make more PRs. Woohoo. Who cares?
Does it increase the number of things that pass QA?
Do the things done with AI assistance have fewer bugs caught after QA?
Are they easier to extend or modify later? Or do they have rigid and inflexible designs?
A tool that can help turn developers into unknown quality code monkeys is not something I’m looking for. I’m looking for a tool that helps developers find bugs or design flaws in what they’re doing. Or maybe write well designed tests.
Just counting PRs doesn’t tell me anything useful. But it triggers my gut feeling that more code per unit time = lower average quality.
This was likely Copilot based on GPT 3.5.
Microsoft: September 2022 to May 3rd, 2023
Accenture: July 2023 to December 2023
Anonymous Company: October 2023 to ?
Copilot _Chat_ update to GPT-4 was Nov 30, 2023: https://github.blog/changelog/label/copilot/
For me AI just brought back documentation. All new frameworks lack documentation big time. The last good one for me was a DOS book! I don't think newer developers even have an idea of what good documentation looks like.
Even so, AI will propose different things at different times and you still need an experienced developer to make the call. In the end it replaces documentation and typing.
The result is "less experienced people got more stuff done". I do not see an assessment of whether the stuff that got done was well done.
The output of these tools today is unsafe to use unless you possess the ability to assess its correctness. The less able you are to perform that assessment, the more likely you are to use these tools.
Only one of many problems with this direction, but gravity sucks, doesn't it.
When I'm using genai to write some code for me, I lose the internal mental state of what my code is doing.
As such, when I do have to debug problems myself, or dream up ideas of improvements, I no longer can do this properly due to lack of internal mental state.
Wonder how people who have used genai coding successfully get around this?
They also added lots of technical debt as I'm sure they used the AI to generate tests and some of those tests could be actually testing bugs as the correct behavior.
I've already fixed a couple of tests like this, where people clearly used AI and didn't think about it, when in reality it was testing something wrong.
Not to mention the rest of the technical debt added... looking at productivity in software development by amount of tasks is so wrong.
Empirical studies like this are hard to conduct... I'm curious though. This study was authored by at least two folks from Microsoft and one of the sample groups in the study was also from Microsoft. The reason this seems to stand out to me as odd is that Microsoft also owns the AI tool being used in the study and would definitely want a favourable conclusion in this paper.
Is that a flag we should be watching out for?
Can someone potentially smarter than me explain how the data, which in table I clearly shows the majority of the means for each experiment metric being less than the SD could even hope to be salvaged? Taken blindly, the results are simply unbelieveable to outright lying, the sort of thing you see submitted to garbage open access journals. The text describing model they employ afterwards is not convincing enough for me and seems light on details. I mean, wouldn't any reasonable reviewer demand more?
I know preprints don't need polish but this is even below the standard of a preprint, imo.
Reminds me of a situation I've been in a few times already:
Dev: Hey einpoklum, how do I do XYZ?
Me: Hmm, I think I remember that... you could try AB and then C.
Dev: Ok, but isn't there a better/easier way? Let me ask ChatGPT.
...
Dev: Hey einpoklum, ChatGPT said I should do AB and then C.
Me: Let me have a look at that for a second.
Me: -Right, so it's just what I read on StackOverflow about this, a couple of years ago.
Sometimes it's even the answer that _I_ wrote on StackOverflow and then I feel cheated.I'm guiding a few and sometimes they write pretty good code with the help of GPT but then in meetings have trouble understanding and explaining things.
I think it's a big productivity boost, but also a chance that the learning rate might actually be significantly slower.
Microsoft people in research team proving Microsoft tools are good. Elsevier will now do ads in research papers.
This is a decently-thorough study, using PRs as a productivity metric while also tracking build failures (which remained constant at MSFT but increased at Accenture).
Would love to see it replicated by researchers at a company that does not have a clear financial interest in the outcome (the corresponding author here was working at Microsoft Research during the study period).
> Before moving on, we discuss an additional experiment run at Accenture that was abandoned due to a large layoff affecting 42% of participants
Eek
Study funded by Microsoft, conducted by Microsoft engineers, says Microsoft product makes engineers +X% more productive.
It's very exciting that generative ai lets people get more code written, especially in unfamiliar domains. It feels great to ship loads of code that does a thing. The sure result is an unprecedented increase in the amount of code in products.
A minor drawback to that enthusiasm is that a lot of the code I read didn't need to exist in the first place, even before this wave. Lots of it can be attributed to the path dependence of creation as opposed to what it is trying to do. This should be a rich time to change to security / exploit work - the random search tools are great and the target just keeps getting easier.
What our industry really desperately needed was to drive the quality of implementation right down. It's going to be an exciting time to be alive.
> Notably, less experienced developers showed higher adoption rates and greater productivity gains.
And that is why demand for senior developers is going to go through the roof. Who is going to unfuck the giant balls of mud those inexperienced devs are slinging together? Who’s going to keep the lights on?
I've used Copilot and chatGPT to help with algorithms where I'm unsure where to start. Actual case: "Write a performant algorithm to find the number of work days given a future date". It's trickier than you think and makes a great interview question.
Both AI tools came back with...garbage. Loops within loops within loops as they iterated through each day to check if the day is a weekend or not, is a leap year and to account for the extra day, is it a holiday or not, etc.
However, chatGPT provided a clever division to cut the dataset down to weeks, then process the result. I ended up using that portion in my final algorithm creation.
So, my take on AI coding tools are: "Buyer beware. Your results may vary".
The study doesn't cover the long term effects of using generative AI on a project, which is the deskilling of your workforce.
Because development will become an auction-like activity where the one that accepts more suggestions wins.
"Numbers of completed tasks" might not be the best metric to use here.
No two tasks are the same level of complexity and one task may take 5x longer than another to complete.
I am an average developer with more than five years experience in Python. I was using chatgpt to create prototypes of what to do in something I was familiar with and was able to debug the task to make it work. I wasn’t specific enough to specify the epaper display had seven colors instead of black and white.
When I was using chatgpt to do qualifiers for a CTF called Hack A Sat at defcon 31 I could not get anything to work such as gnu radio programs.
If you have the ability to debug then I have experienced that it is productive but when you don’t understand you run into problems.
The effect it’s had on my company: firing all of the over employed retail devs because all they submit is GPT slop that doesn’t even work.
On the surface, this appears to be great news.
However, there's a big question as to whether these are short productivity gains vs longer lasting gains. There's a hypothesis that the AI generate code will slowly spaghetti-fy a codebase.
Is 1-2 years sufficiently long enough to take this into consideration? Or disprove the spaghettification?
I think that over time, AI will tend to make all programmers into average programmers.
For those who are beginners, it can bring their skills up and make them look like better developers than they are.
More insidiously, expert programmers who overuse of AI might also regress to the mean as their skills deteriorate.
> Notably, less experienced developers showed higher adoption rates and greater productivity gains.
This is what I’ve seen too, I don't think less experienced developers have gotten better in their understanding of anything just more exposed and quicker, while I do think more experience developers have stagnated
B(i)ased as fuck.
I think this post from the other day adds some important context[0]. In that study kids with access to GPT did way more practice problems but worse on the test. But the most important part was that they found that while GPT usually got the final answer right that the logic was wrong, meaning that the answer is wrong. This is true for math and code.
There's the joke: there's two types of 10x devs, those that do 10x work and those who finish 10x jira tickets. The problem with this study is the assumptions that it makes, which is quite common and naive in our industry. They assume that PRs and commits are measures of productivity and they assume passing review is a good quality metric. These are so variable between teams. Plenty are just "lgtm" reviews.
The issue here is that there's no real solid metric for things like good code. Meeting the goals of a ticket doesn't mean you haven't solved the problem so poorly you are the reason 10 new tickets will be created. This is the real issue here and the only real way to measure it is using Justice Potter's test (I know it when I see it), and requires an expert evaluator. In other words, tech debt. Which is something we're seeing a growing rise in, all the fucking enshitification.
So I don't think that study here contradicts [0], in fact I think they're aligned. But I suspect people who are poor programmers (or non programmers) will use this at evidence for what they want to see. Believing naive things like lines of code, number of commits/PRs, etc are measures of productivity rather than hints of measure. I'm all for "move fast and break things" as long as there's time set aside to clean up the fucking mess you left behind. But there never is. It's like we have businesses ADHD. There's so much lost productivity because so much focus is placed on short term measurements and thinking. I know medium and long term thinking are hard, but humans do hard shit every day. We can do a lot better than a shoddy study like this.
ChatGPT and to a lesser degree Copilot have been very valuable to me this year.
Copilot often saves me a lot of typing on a 1-3 line scope, occasionally surprising me with exactly what I was about to write on a 5-10 line scope. It’s really good during rearrangement and early refactoring (as you are building a new thing and changing your mind as you go about code organization).
ChatGPT, or “Jimmy” - as I like to call him - has been great for answering syntax questions, idiom questions, etc. when applying my general skills based on other languages to ones I’m less familiar with.
It has also been good for “discussing” architecture approaches to a problem with respect to a particular toolset.
With proper guidance and very clear prompting, I usually get highly value responses.
I would rough guess that these two tools have saved me 2-3 months of solo time this year - nay, since April.
One I get down in the deep details, I use Jimmy much less often. But when I hit something new, or something I long since forgot, he’s ready to be relative expert / knowledge base.
You can easily evaluate the impact of GitHub Copilot on your org using Faros AI - https://github.com/marketplace/faros-ai (disclamer - I work there)
I used to be super productive at the raw keyboard. Then RSI got to me. But with CoPilot, I’m back to my normal productivity. For me it’s a life-saver as it allows fast typing with minimal hand strain.
>Notably, less experienced developers showed higher adoption rates and greater productivity gains.
If you start low it's easier to get greater growth rates.
The biggest is the first step, 0% to 1% is infinite growth.
Interestingly, they compare number of pull requests as a statistic for productivity. Not that I know of a better metric, but I wonder if this is an accurate metric. It seems similarly misguided as looking at lines of code.
If an AI tool makes me more productive, I would probably either spend the time won browsing the internet, or use it to attempt different approaches to solve the problem at hand. In the latter case, I would perhaps make more reliable or more flexible software. Which would also be almost impossible to measure in a scientific investigation.
In my experience, the differences in developer productivity are so enormous (depending on existing domain knowledge, motivation, or management approach), that it seems pretty hard to make any scientific claim based on looking at large groups of developers. For now, I prefer the individual success story.
Didn't Microsoft find way higher gains in a previous study? I wonder where the differences (also between companies) come from.
This is obvious. Right, of course you get an increase in productivity, especially as a junior - when current AI is able to solve leetcode.
BUT I think a lot of people mentioned that, you get code - that the person which wrote it do not understand. So the next time you get a bug there, good luck fixing it.
My take so far. AI is great, but only for non critical, non core code. Everything that is done for plotting and scripting is awesome (which can take days to implement and in minutes with AI) - but core lib functions - wouldn't outsource it to the AI right now.
It beats StackOverflow, but that might be saying more about SO.
Yes, but what about all the bugs created and time to debug?
I sometimes wonder if LeetCode killed the Software Star
Probably similar results as pair programming.
An interesting topic that will, given this site's community, attract a lot of strong opinions.
I, for one, only decide whether CoPilot's productivity increase is worth the $10 it costs per month.
It doesn't really matter whether you're an employer getting a 3–30% increase in productivity or whether you pay for it personally and finish 2 hours faster every week and log off illegaly. It's easily worth its money. What more to consider?
[flagged]
I sometimes wonder about whether the decline in IT worker quality is down to companies trying to force more and more roles onto each worker to reduce headcount.
Developers, Operations, and Security used to be dedicated roles.
Then we made DevOps and some businesses took that to mean they only needed 2/3 of the headcount, rather than integrating those teams.
Then we made DevSecOps, and some businesses took that to mean they only needed 1/3 the original roles, and that devs could just also be their operations and appsec team.
That's not a knock on shift-left and integrated operations models; those are often good ideas. It's just the logical outcome of those models when execs think they can get a bigger bonus by cutting costs by cutting headcounts.
Now you have new devs coming into insanely complex n-microservice environments, being asked to learn the existing codebase, being asked to learn their 5-tool CI/CD pipelines (and that ain't being taught in school), being asked to learn to be DBAs, and also to keep up a steady code release cycle.
Is anyone really surprised they are using ChatGPT to keep up?
This is going to keep happening until IT companies stop cutting headcounts to make line go up (instead of good business strategy).