Hacker News

dheerajvs

Measuring the impact of AI on experienced open-source developer productivity metr.org

simonw17 hours ago

Here's the full paper, which has a lot of details missing from the summary linked above: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.

They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.

So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.

A quarter of the participants saw increased performance, 3/4 saw reduced performance.

One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:

> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.

My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

ivanovm11 hours ago

I find the very popular response of "you're just not using it right" to be big copout for LLMs, especially at the scale we see today. It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user. Typically if a user doesn't find value in the product, we agree that the product is poorly designed/implemented, not that the user is bad. But AI seems somehow exempt from this sentiment

viraptor11 hours ago

> It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

It's completely normal in development. How many years of programming experience you need for almost any language? How many days/weeks you need to use debuggers effectively? How long from the first contact with version control until you get git?

I think it's the opposite actually - it's common that new classes of tools in tech need experience to use well. Much less if you're moving to something different within the same class.

intended6 hours ago

> LLMs, especially at the scale we see today

The OP qualifies how the marketing cycle for this product is beyond extreme, and its own category.

Normal people are being told to worry about AI ending the world, or all jobs disappearing.

Simply saying “the problem is the user”, without acknowledging the degree of hype, and expectation setting, the is irresponsible.

TeMPOraL4 hours ago

AI marketing isn't extreme - not on the LLM vendor side, at least; the hype is generated downstream of it, for various reasons. And it's not the marketing that's saying "you're using it wrong" - it's other users. So, unless you believe everyone reporting good experience with LLMs is a paid shill, there might actually be some merit to it.

intended3 hours ago

It is extreme, and on the vendor side. The OpenAI non profit vs profit saga, was about profit seeking vs the future of humanity. People are talking about programming 3.0.

I can appreciate that it’s other users who are saying it’s wrong, but that doesn’t escape the point on ignoring the context.

Moreover, it’s unhelpful communication. Its gives up acknowledging a mutually shared context, the natural confusion that would arise from the ambiguous, high level hype, and the actual down to earth reality.

Even if you have found a way to make it work, having someone understand your workflow can’t happen without connecting the dots between their frame of reference and yours.

pera20 minutes ago

It really is, for example here is a quote from AI 2027:

> By early 2030, the robot economy has filled up the old SEZs, the new SEZs, and large parts of the ocean. The only place left to go is the human-controlled areas. [...]

> The new decade dawns with Consensus-1’s robot servitors spreading throughout the solar system. By 2035, trillions of tons of planetary material have been launched into space and turned into rings of satellites orbiting the sun. The surface of the Earth has been reshaped into Agent-4’s version of utopia: datacenters, laboratories, particle colliders, and many other wondrous constructions doing enormously successful and impressive research.

This scenario prediction, which is co-authored by a former OpenAI researcher (now at Future of Humanity Institute), received almost 1 thousand upvotes here on HN and the attention of the NYT and other large media outlets.

If you read that and still don't believe the AI hype is _extreme_ then I really don't know what else to tell you.

https://news.ycombinator.com/item?id=43571851

OccamsMirror2 hours ago

I think the relentless podcast blitz by OpenAI and Anthropic founders suggests otherwise. They're both keen to confirm that yes, in 5 - 10 years, no one will have any jobs any more. They're literally out there discussing a post employment world like it's an inevitability.

That's pretty extreme.

disgruntledphd2an hour ago

Those billions won't raise themselves, you know.

More generally, these execs are talking their book as they're in a low margin capital intensive businesses whose future is entirely dependent on raising a bunch more money, so hype and insane claims are necessary for funding.

Now, maybe they do sortof believe it, but if so, why do they keep hiring software engineers and other staff?

carschno3 hours ago

It's called grassroots marketing. It works particularly well in the context of GenAI because it is fed with esoteric and ideological fragments that overlap with common beliefs and political trends. https://en.wikipedia.org/wiki/TESCREAL

Therefore, classical marketing is less dominant, although more present at down-stream sellers.

TeMPOraL3 hours ago

Right. Let's take a bunch of semi-related groups I don't like, and make up an acronym for them so any of my criticism can be applied to some subset of those groups in some form, thus making it seem legitimate and not just a bunch of half-assed strawman arguments.

Also, I guess you're saying I'm a paid shill, or have otherwise been brainwashed by marketing of the vendors, and therefore my positive experiences with LLMs are a lie? :).

I mean, you probably didn't mean that, but part of my point is that you see those positive reports here on HN too, from real people who've been in this community for a while and are not anonymous Internet users - you can't just dismiss that as "grassroot marketing".

carschnoan hour ago

> I mean, you probably didn't mean that

Correct, I think you've read too much into it. Grassroots marketing is not a pejorative term, either. Its strategy is to trigger positive reviews about your product, ideally by independent, credible community members, indeed.

That implies that those community members have motivations other than being paid. Ideologies and shared beliefs can be some of them. Being happy about the product is a prerequisite, whatever that means for the individual user.

Avshalom10 hours ago

Linus did not show up in front of congress talking about how dangerously powerful unregulated version control was to the entirety of human civilization a year before he debuted Git and charged thousands a year to use it.

sanderjd9 hours ago

This seems like a non sequitur. What does this have to do with this thread?

Avshalom2 hours ago

It is completely reasonable to hold cursor/claude to a different standard than gdb or git.

staunton41 minutes ago

What standard would that be?

viraptor9 hours ago

Ok. You seem to be taking about a completely different issue of regulation.

KaiserPro3 hours ago

> How many days/weeks you need to use debuggers effectively

I understand your point, but would counter with: gdb isn't marketed as a cuddly tool that can let anyone do anything.

blub6 hours ago

It is completely typical, but at the same time abnormal to have tools with such poor usability.

A good debugger is very easy to use. I remember the Visual Studio debugger or the C++ debugger on Windows were a piece of cake 20 years ago, while gdb is still painful today. Java and .NET had excellent integrated debuggers while golang had a crap debugging story for so long that I don’t even use a debugger with it. In fact I almost never use debuggers any more.

Version control - same story. CVS for all its problems I had learned to use almost immediately and it had a GUI that was straightforward. git I still have to look up commands for in some cases. Literally all the good git UIs cost a non-trivial amount of money.

Programming languages are notoriously full of unnecessary complexity. Personal pet peeve: Rust lifetime management. If this is what it takes, just use GC (and I am - golang).

pbasista4 hours ago

> git I still have to look up commands for in some cases

I believe that this is okay. One does not need to know the details about every specific git command in order to be able to use it efficiently most of the time.

It is the same with a programming language. Most people are unfamiliar with every peculiarity of every standard library function that the language offers. And that is okay. It does not prevent them from using language efficiently most of the time.

Also in other aspects of life, it is unnecessary to know everything by memory. For example, one does not need to know how to e.g. replace a blade on a lawn mower. But that is okay. It does not prevent them from using it efficiently most of the time.

The point is that if something is done less often, it is unnecessary to remember the specifics of it. It is fine to look it up when needed.

zingar4 hours ago

Nitpick: magit for emacs is good enough for everyone whom I’ve seen talk about it describe as “the best git correct” and it is completely free.

Lerc10 hours ago

>It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

Is that perhaps because of the nature of the category of 'tech peoduct'. In other domains, this certainly isn't the case. Especially if the goal is to get the best result instead of the optimum output/effort balance.

Musical instruments are a clear case where the best results are down to the user. Most crafts are similar. There is the proverb "A bad craftsman blames his tools" that highlights that there are entire fields where the skill of the user is considered to be the most important thing.

When a product is aimed at as many people as the marketers can find, that focus on individual ability is lost and the product targets the lowest common denominator.

They are easier to use, but less capable at their peak. I think of the state of LLMs analogous to home computing at a stage of development somewhere around Altair to TRS-80 level. These are the first ones on the scene, people are exploring what they are good for, how they work, and sometimes putting them to effective use in new and interesting ways. It's not unreasonable to expect a degree of expertise at this stage.

The LLM equivalent of a Mac will come, plenty of people will attempt to make one before it's ready. There will be a few Apple Newtons along the way that will lead people to say the entire notion was foolhardy. Then someone will make it work. That's when you can expect to use something without expertise. We're not there yet.

sanderjd9 hours ago

> It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

Maybe, but it isn't hard to think of developer tools where this is the case. This is the entire history of editor and IDE wars.

Imagine running this same study design with vim. How well would you expect the not-previously-experienced developers to perform in such a study?

fingerlocks5 hours ago

No one is claiming 10x perf gains in vim.

It’s just a fun geeky thing to use with a lot of zany customizations. And after two hellish years of memory muscling enough keyboard bindings to finally be productive, you earned it! It’s a badge of pride!

But we all know you’re still fat fingering ggdG on occasion and silently cursing to yourself.

TeMPOraL4 hours ago

> No one is claiming 10x perf gains in vim.

Sure they are - or at least were, unitl the last couple years. Same thing with Emacs.

It's hard to claim this now, because the entire industry shifted towards webshit and cloud-based practices across the board, and the classical editors just can't keep up with VS Code. Despite the latter introducing LSP, which leveled the playing field wrt. code intelligence itself, the surrounding development process and the ecosystem increasingly demands you use web-based or web-derived tools and practices, which all see a browser engine as a basic building block. Classical editors can't match the UX/DX on that, plus the whole thing breaks basic assumptions about UI that were the source of the "10x perf gains" in vim and Emacs.

Ironically, a lot of the perf gains from AI come from letting you avoid dealing with the brokenness of the current tools and processes, that vim and Emacs are not equipped to handle.

fingerlocks2 hours ago

Yeah I’m in my 40s and have been using vim for decades. Sure there was an occasional rando stirring up the forums about made-up productivity gains to get some traffic to their blog, but that was it. There has always been push back from many of the strongest vim advocates that the appeal is not about typing speed or whatever it was they were claiming. It’s just ergonomics and power.

It’s just not comparable to the LLM crazy hype train.

And to belabor your other point, I have treesitter, lsp, and GitHub Copilot agent all working flawlessly in neovim. Ts and lsp are neovim builtins now. And it’s custom built for exactly how I want it to be, and none of that blinking shit or nagging dialog boxes all over VSCode.

I have VScode and vim open to the same files all day quite literally side by side, because I work at Microsoft, share my screen often, and there are still people that have violent allergic reactions to a terminal and vim. Vim can do everything VSCode does and it’s not dogshit slow.

oytis3 hours ago

What I like about IDE wars is that it remained a dispute between engineers. Some engineers like fancy pants IDEs and use them, some are good with vim and stick with that. No one ever assumed that Jetbrains autocomplete is going to replace me or that I am outdated for not using it - even if there might be a productivity cost associated with that choice.

edmundsauto10 hours ago

New technologies that require new ways of thinking are always this way. "Google-fu" was literally a hirable career skill in 2004 because nobody knew how to search to get optimal outcomes. They've done alright improving things since then - let's see how good Cursor is in 10 years.

lmeyerov2 hours ago

It's a specialist tool. You wouldn't be surprised that it took awhile for someone to take a big to get at typed programming, parallel programming, docker, IaaC, etc. either.

We have 2 sibling teams, one the genAI devs and the other the regular GPU product devs. It is entirely unsurprising to me that the genAI developers are successfully using coding agents with long-running plans, while the GPU developers are still more at the level of chat-style back-and-forth.

At the same time, everyone sees the potential, and just like other automation movements, are investing in themselves and the code base.

Maxious7 hours ago

>It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

Apple's Response to iPhone 4 Antenna Problem: You're Holding It Wrong https://www.wired.com/2010/06/iphone-4-holding-it-wrong/

wiether5 hours ago

I don't see how the Antennagate can be qualified as "acceptable" since it caused a big public uproar and Apple had to settle a class action lawsuit.

https://www.businessinsider.com/apple-antennagate-scandal-ti...

8note5 hours ago

it didnt end the iphone as a brand, or end smart phones altogether though.

how much did that uproar and settlement matter?

davely4 hours ago

Mobile phone manufacturers were telling users this long before the iPhone was ever invented.

e.g., Nokia 1600 user guide from 2005 (page 16) [0]

[0] https://www.instructionsmanuals.com/sites/default/files/2019...

TeMPOraL4 hours ago

The important difference is that in your example, it was the manufacturer telling customers they're holding it wrong. With LLMs, the vendors say no such things - it's the actual users that are saying this to their peers.

jeswin6 hours ago

Not every tool can be figured out in a day (or a week or more). That doesn't mean that the tool is useless, or that the user is incapable.

milchek9 hours ago

I think the reason for that is maybe you’re comparing to traditional products that are deterministic or have specific features that add value?

If my phone keeps crashing or if the browser is slow or clunky then yes, it’s not on me, it’s the phone, but an LLM is a lot more open ended in what it can do. Unlike the phone example above where I expect it to work from a simple input (turning it on) or action (open browser, punch in a url), what an LLM does is more complex and nuanced.

Even the same prompt from different users might result in different output - so there is more onus on the user to craft the right input.

Perhaps that’s why AI is exempt for now.

DanielVZ3 hours ago

> It's hard to think of any other major tech product where it's acceptable to shift so much blame on the user.

Sorry to be pedantic but this is really common in tech products: vim, emacs, any second-brain app, effectiveness of IDEs depending on learning its features, git, and more.

ndsipa_pomu3 hours ago

Well, surely vim is easy to use - I started it and and haven't stopped using it yet (one day I'll learn how to exit)

[deleted]8 hours agocollapsed

ay9 hours ago

Just a few examples: Bicycle. Car(driving). Airplane(piloting). Welder. CNC machine. CAD.

All take quite an effort to master, until then they might slow one down or outright kill.

narush17 hours ago

Hey Simon -- thanks for the detailed read of the paper - I'm a big fan of your OS projects!

Noting a few important points here:

1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.

2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.

3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!

4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.

5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.

In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).

I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!

(You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)

jspdown2 minutes ago

With today's state of LLMs and Agents, it's still not good for all the tasks. It took me couple of weeks before being able to correctly adjust on what I can ask and what I can expect. As a result, I don't use Claude Code for everything and I think I'm able to better pick the right task and the right size of task to give it. These adjustment depends on what you are doing, the complexity of and the maturity of the project at play.

Very often, I have entire tasks that I can't offload to the Agent. I won't say I'm 20x more productive, it's probably more in the range of 15% to 20% (but I can't measure that obviously).

bilbo-b-baggins13 minutes ago

Your next study should be very experienced devs working in new or early life repos where AI shines for refactoring and structured code suggestion, not to mention documentation and tests.

It’s much more useful getting something off the ground than maintaining a huge codebase.

amirhirsch15 hours ago

Figure 6 which breaks-down the time spent doing different tasks is very informative -- it suggest: 15% less active coding 5% less testing, 8% less research and reading

4% more idle time 20% more AI interaction time

The 28% less coding/testing/research is why developers reported 20% less work. You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.

I think the AI skill-boost comes from having work flows that let you shave half that git-ops time, cut an extra 5% off coding, but cut the idle/waiting and do more prompting of parallel agents and a bit more testing then you really are a 2x dev.

viraptor10 hours ago

> You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.

This is going to be interesting long-term. Realistically people don't spend anywhere close to 100% of time working and they take breaks after intense periods of work. So the real benefit calculation needs to include: outcome itself, time spent interacting with the app, overlap of tasks while agents are running, time spent doing work over a long period of time, any skill degradation, LLM skills, etc. It's going to take a long time before we have real answers to most of those, much less their interactions.

amirhirsch15 hours ago

i just realized the figure is showing the time breakdown as a percentage of total time, it would be more useful to show absolute time (hours) for those side-by-side comparisons since the implied hours would boost the AI bars height by 18%

narush15 hours ago

There's additional breakdown per-minute in the appendix -- see appendix E.4!

Guillaume86an hour ago

Using devs working in their own repository is certainly understandable, but it might also explain in part the results. Personally I barely use AI for my own code, while on the other hand when working on some one off script or unfamiliar code base, I get a lot more value from it.

simonw17 hours ago

Thanks for the detailed reply! I need to spend a bunch more time with this I think - above was initial hunches from skimming the paper.

narush17 hours ago

Sounds great. Looking forward to hearing more detailed thoughts -- my emails in the paper :)

jdp2316 hours ago

Really interesting paper, and thanks for the followon points.

The over-optimism is indeed a really important takeaway, and agreed that it's not tool-dependent.

paulmist17 hours ago

Were participants given time to customize their Cursor settings? In my experience tool/convention mismatch kills Cursor's productivity - once it gets going with a wrong library or doesn't use project's functions I will almost always reject code and re-prompt. But, especially for large projects, having a well-crafted repo prompt mitigates most of these issues.

gojomo16 hours ago

Did each developer do a large enough mix of AI/non-AI tasks, in varying orders, that you have any hints in your data whether the "AI penalty" grew or shrunk over time?

narush15 hours ago

You can see this analysis in the factor analysis of "Below-average use of AI tools" (C.2.7) in the paper [1], which we mark as an unclear effect.

TLDR: over the first 8 issues, developers do not appear to get majorly less slowed down.

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

gojomo14 hours ago

Thanks, that's great!

But: if all developers did 136 AI-assisted issues, why only analyze excluding the 1st 8, rather than, say, the first 68 (half)?

narush13 hours ago

Sorry, this is the first 8 issues per-developer!

grey-area16 hours ago

Well, there are two possible interpretations here of 75% of participants (all of whom had some experience using LLMs) being slower using generative AI:

LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).

Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

atiedebee14 hours ago

Let me bring you a third (not necessarily true) interpretation:

The developer who has experience using cursor saw a productivity increase not because he became better at using cursor, but because he became worse at not using it.

literalAardvark34 minutes ago

Became worse is possible

Became worse in 50 hours? Super unlikely

card_zero14 hours ago

Or, one person in 16 has a particular personality, inclined to LLM dependence.

cutemonster12 hours ago

Didn't they rather mean:

Developers' own skills might atrophy, when they don't write that much code themselves, relying on AI instead.

And now when comparing with/without AI they're faster with. But a year ago they might have been that fast or faster without an AI.

I'm not saying that that's how things are. Just pointing out another way to interpret what GP said

runarberg13 hours ago

Invoking personality is to the behavioral science as invoking God is to the natural sciences. One can explain anything by appealing to personality, and as such it explains nothing. Psychologists have been trying to make sense of personality for over a century without much success (the best efforts so far have been a five factor model [Big 5] which has ultimately pretty minor predictive value), which is why most behavioral scientists have learned to simply leave personality to the philosophers and concentrate on much simpler theoretical framework.

A much simpler explanation is what your parent offered. And to many behavioralists it is actually the same explanation, as to a true scotsm... [cough] behavioralist personality is simply learned habits, so—by Occam’s razor—you should omit personality from your model.

suddenlybananas5 hours ago

Behaviorism is a relic of the 1950s

card_zero13 hours ago

Fair comment, but I'm not down with behavioralism, and people have personalities, regrettably.

runarberg13 hours ago

This is still ultimately a research within the field of the behavior sciences, and as such the laws of human behavior apply, where behaviorism offers a far more successful theoretical framework than personality psychology.

Nobody is denying that people have personalities btw. Not even true behavioralists do that, they simply argue from reductionism that personality can be explained with learning contingencies and the reinforcement history. Very few people are true behavioralists these days though, but within the behavior sciences, scientists are much more likely to borrow missing factors (i.e. things that learning contingencies fail to explain) from fields such as cognitive science (or even further to neuroscience) and (less often) social science.

What I am arguing here, however, is that the appeal to personality is unnecessary when explaining behavior.

As for figuring out what personality is, that is still within the realm of philosophy. Maybe cognitive science will do a better job at explaining it than psychometricians have done for the past century. I certainly hope so, it would be nice to have a better model of human behavior. But I think even if we could explain personality, it still wouldn’t help us here. At best we would be in a similar situation as physics, where one model can explain things traveling at the speed of light, while another model can explain things at the sub-atomic scale, but the two models cannot be applied together.

burnte14 hours ago

> Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

I would argue you don't need the "as a programming assistant" phrase as right now from my experience over the past 2 years, literally every single AI tool is massively oversold as to its utility. I've literally not seen a single one that delivers on what it's billed as capable of.

They're useful, but right now they need a lot of handholding and I don't have time for that. Too much fact checking. If I want a tool I always have to double check, I was born with a memory so I'm already good there. I don't want to have to fact check my fact checker.

LLMs are great at small tasks. The larger the single task is, or the more tasks you try to cram into one session, the worse they fall apart.

[deleted]12 hours agocollapsed

steveklabnik15 hours ago

> Current LLMs

One thing that happened here is that they aren't using current LLMs:

> Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.

That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.

blibble15 hours ago

> One thing that happened here is that they aren't using current LLMs

I've been hearing this for 2 years now

the previous model retroactively becomes total dogshit the moment a new one is released

convenient, isn't it?

nalllar14 hours ago

If you interact with internet comments and discussions as an amorphous blob of people you'll see a constant trickle of the view that models now are useful, and before were useless.

If you pay attention to who says it, you'll find that people have different personal thresholds for finding llms useful, not that any given person like steveklabnik above keeps flip-flopping on their view.

This is a variant on the goomba fallacy: https://englishinprogress.net/gen-z-slang/goomba-fallacy-exp...

steveklabnik15 hours ago

Sorry, that’s not my take. I didn’t think these tools were useful until the latest set of models, that is, they crossed the threshold of usefulness to me.

Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.

mattmanser14 hours ago

Do you really see a massive jump?

For context, I've been using AI, a mix of OpenAi + Claude, mainly for bashing out quick React stuff. For over a year now. Anything else it's generally rubbish and slower than working without. Though I still use it to rubber duck, so I'm still seeing the level of quality for backend.

I'd say they're only marginally better today than they were even 2 years ago.

Every time a new model comes out you get a bunch of people raving how great the new one is and I honestly can't really tell the difference. The only real difference is reasoning models actually slowed everything down, but now I see its reasoning. It's only useful because I often spot it leaving out important stuff from the final answer.

simonw13 hours ago

The massive jump in the last six months is that the new set of "reasoning" models got really good at reasoning about when to call tools, and were accompanied is by a flurry of tools-in-loop coding agents - Claude Code, OpenAI Codex, Cursor in Agent mode etc.

An LLM that can test the code it is writing and then iterate to fix the bugs turns out to be a huge step forward from LLMs that just write code without trying to then exercise it.

vidarh12 hours ago

I've gone from asking the tools how to do things, and cut and pasting the bits (often small) that'd be helpful, via using assistants that I'd review every decision of and often having to start over, to now often starting an assistant with broad permissions and just reviewing the diff later, after they've made the changes pass the test suite, run a linter and fixed all the issues it brought up, and written a draft commit message.

The jump has been massive.

steveklabnik14 hours ago

Yes. In January I would have told you AI tools are bullshit. Today I’m on the $200/month Claude Max plan.

As with anything, your miles may vary: I’m not here to tell anyone that thinks they still suck that their experience is invalid, but to me it’s been a pretty big swing.

Uehreka14 hours ago

> In January I would have told you AI tools are bullshit. Today I’m on the $200/month Claude Max plan.

Same. For me the turning point was VS Code’s Copilot Agent mode in April. That changed everything about how I work, though it had a lot of drawbacks due to its glitches (many of these were fixed within 6 or so weeks).

When Claude Sonnet 4 came out in May, I could immediately tell it was a step-function increase in capability. It was the first time an AI, faced with ambiguous and complicated situations, would be willing to answer a question with a definitive and confident “No”.

After a few weeks, it became clear that VS Code’s interface and usage limits were becoming the bottleneck. I went to my boss, bullet points in hand, and easily got approval for the Claude Max $200 plan. Boom, another step-function increase.

We’re living in an incredibly exciting time to be a skilled developer. I understand the need to stay skeptical and measure the real benefits, but I feel like a lot of people are getting caught up in the culture war aspect and are missing out on something truly wonderful.

mattmanser13 hours ago

Ok, I'll have to try it out then. I've got a side project I've 3/4 finished and will let it loose on it.

So are you using Claude Code via the max plan, Cursor, or what?

I think I'd definitely hit AI news exhaustion and was viewing people raving about this agentic stuff as yet more AI fanbois. I'd just continued using the AI separate as setting up a new IDE seemed like too much work for the fractional gains I'd been seeing.

steveklabnik13 hours ago

I had a bad time with Cursor. I use Claude Code inside of VS: Code. You don't necessarily need Max, but you can spend a lot of money very quickly on API tokens, so I'd recommend to anyone trying, start with the $20/month one, no need to spend a ton of money just to try something out.

There is a skill gap, like, I think of it like vim: at first it slows you down, but then as you learn it, you end up speeding up. So you may also find that it doesn't really vibe with the way you work, even if I am having a good time with it. I know people who are great engineers who still don't like this stuff, just like I know ones that do too.

mh-10 hours ago

Worth noting for the folks asking: there's an official Claude Code extension for VS Code now [0]. I haven't tried it personally, but that's mostly because I mainly use the terminal and vim.

[0]: https://marketplace.visualstudio.com/items?itemName=anthropi...

steveklabnik8 hours ago

Yes, it’s not necessary but it is convenient for viewing diffs in Code’s diff view. The terminal is a fine way to interact with it though.

8note4 hours ago

id say thats not gonna be the best use for it, unless what you really want is to first document in detail everything about it.

im using claude + vscode's cline extension for the most part, but where it tends to excel is helping you write documentation, and then using that documentation to write reasonable code.

if you're 3/4 of the way done, a lot of the docs of what it wants to work well are gonna be missing, and so a lot of your intentions about why you did or didnt make certain choices will be missing. if you've got good docs, make sure to feed those in as context.

the agentic tool on its own is still kinda meh, if you only try to write code directly from it. definitely better than the non-agentic stuff, but if you start with trying to get it to document stuff, and ask you questions about what it should know in order to make the change its pretty good.

even if you dont get perfect code, or it spins in a feedback loop where its lost the plot, those questions it asks can be super handy in terms of code patterns that you havent thought about that apply to your code, and things that would usually be undefined behaviour.

my raving is that i get to leave behind useful docs in my code packages, and my team members get access to and use those docs, without the usual discoverability problems, and i get those docs for... somewhat slower than i could have written the code myself, but much much faster than if i also had to write those docs

hombre_fatal14 hours ago

I see a massive jump every time.

Just two years ago, this failed.

> Me: What language is this: "esto está escrito en inglés"

> LLM: English

Gemini and Opus have solved questions that took me weeks to solve myself. And I'll feed some complex code into each new iteration and it will catch a race condition I missed even with testing and line by line scrutiny.

Consider how many more years of experience you need as a software engineer to catch hard race conditions just from reading code than someone who couldn't do it after trying 100 times. We take it for granted already since we see it as "it caught it or it didn't", but these are massive jumps in capability.

ipaddr14 hours ago

Wait until the next set. You will find you the previous ones weren't useful after all.

steveklabnik14 hours ago

This makes no sense to me. I’m well aware that I’m getting value today, that’s not going to change in the future: it’s already happened.

Sure they may get even more useful in the future but that doesn’t change my present.

bix613 hours ago

Everything actually got better. Look at the image generation improvements as an easily visible benchmark.

I do not program for my day job and I vibe coded two different web projects. One in twenty mins as a test with cloudflare deployment having never used cloudflare and one in a week over vacation (and then fixed a deep safari bug two weeks later by hammering the LLM). These tools massively raise the capabilities for sub-average people like me and decrease the time / brain requirements significantly.

I had to make a little update to reset the KV store on cloudflare and the LLM did it in 20s after failing the syntax twice. I would’ve spent at least a few minutes looking it up otherwise.

mwigdahl12 hours ago

I've been a proponent for a long time, so I certainly fit this at least partially. However, the combination of Claude Code and the Claude 4 models has pushed the response to my demos of AI coding at my org from "hey, that's kind of cool" to "Wow, can you get me an API key please?"

It's been a very noticeable uptick in power, and although there have been some nice increases with past model releases, this has been both the largest and the one that has unlocked the most real value since I've been following the tech.

achierius11 hours ago

Is that really the case vs. 3.7? For me that was the threshold, and since then the improvements have been nice but not as significant.

mwigdahl11 hours ago

I would agree with you that the jump from Sonnet 3.7 to Sonnet 4 feels notable but not shocking. Opus 4 is considerably better, and Opus 4 combined with the Claude Code harness is what really unlocks the value for me.

cfst14 hours ago

The current batch of models, specifically Claude Sonnet and Opus 4, are the first I've used that have actually been more helpful than annoying on the large mixed-language codebases I work in. I suspect that dividing line differs greatly between developers and applications.

Aeolun12 hours ago

It’s true though? Previous models could do well in specifically created settings. You can throw practically everything at Opus, and it’ll work mostly fine.

simonw15 hours ago

The previous model retroactively becomes not as good as the best available models. I don't think that's a huge surprise.

cwillu15 hours ago

The surprise is the implication that the crossover between net-negative and net-positive impact happened to be in the last 4 months, in light of the initial release 2 years ago and sufficient public attention for a study to be funded and completed.

Yes, it might make a difference, but it is a little tiresome that there's always a “this is based on a model that is x months old!” comment, because it will always be true: an academic study does not get funded, executed, written up, and published in less time.

Ntrails14 hours ago

Some of it is just that (probably different) people said the same damn things 6 months ago.

"No, the 2.8 release is the first good one. It massively improves workflows"

Then, 6 months later, the study comes out.

"Ah man, 2.8 was useless, 3.0 really crossed the threshold on value add"

At some point, you roll your eyes and assume it is just snake oil sales

steveklabnik14 hours ago

There’s a lot of confounding factors here. For example, you could point to any of these things in the last ~8 months as being significant changes:

* the release of agentic workflow tools

* the release of MCPs

* the release of new models, Claude 4 and Gemini 2.5 in particular

* subagents

* asynchronous agents

All or any of these could have made for a big or small impact. For example, I’m big on agentic tools, skeptical of MCPs, and don’t think we yet understand subagents. That’s different from those who, for example, think MCPs are the future.

> At some point, you roll your eyes and assume it is just snake oil sales

No, you have to realize you’re talking to a population of people, and not necessarily the same person. Opinions are going to vary, they’re not literally the same person each time.

There are surely snake oil salesman, but you can’t buy anything from me.

Filligree14 hours ago

Or you accept that different people have different skill levels, workflows and goals, and therefore the AIs reach usability at different times.

rsynnott15 minutes ago

The complication is that, as noted in the above paper, _people are bad at self-reporting on whether the magic robot works for them_. Just because someone _believes_ they are more effective using LLMs is not particularly strong evidence that they actually are.

foobarqux14 hours ago

That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before; except that that is the same thing the same people say for every model release, including at the time or release of the previous one, which is now acknowledged to be seriously flawed; and including the future one, at which time the current models will similarly be acknowledged to be, not only less performant that the future models, but inherently flawed.

Of course it's possible that at some point you get to a model that really works, irrespective of the history of false claims from the zealots, but it does mean you should take their comments with a grain of salt.

steveklabnik14 hours ago

> That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before

Right.

> except that that is the same thing the same people say for every model release,

I did not say that, no.

I am sure you can find someone who is in a Groundhog Day about this, but it’s just simpler than that: as tools improve, more people find them useful than before. You’re not talking to the same people, you are talking to new people each time who now have had their threshold crossed.

blibble14 hours ago

> You’re not talking to the same people, you are talking to new people each time who now have had their threshold crossed.

no, it's the same names, again and again

simonw13 hours ago

Got receipts?

That sounds like a claim you could back up with a little bit of time spent using Hacker News search or similar.

(I might try to get a tool like o3 to run those searches for me.)

blibble13 hours ago

try asking it what sealioning is

maxbond9 hours ago

You've no obligation to answer, no one is entitled to your time, but it's a reasonable request. It's not sealioning to respectfully ask for directly relevant evidence that takes about 10-15m to get.

pdabbadabba15 hours ago

Maybe it's convenient. But isn't it also just a fact that some of the models available today are better than the ones available five months ago?

bryanrasmussen14 hours ago

sure, but after having spent some time trying to get anything useful - programmatically - out of previous models and not getting anything once a new one is announced how much time should one spend.

Sure you may end up missing out on a good thing and then having to come late to the party, but coming early to the party too many times and the beer is watered down and the food has grubs is apt to make you cynical the next time a party announcement comes your way.

Terr_14 hours ago

Plus it's not even possible to miss the metaphorical party: If it gets going, it will be quite obvious long before it peaks.

(Unless one believes the most grandiose prophecies of a technological-singularity apocalypse, that is.)

Terr_14 hours ago

That's not the issue. Their complaint is that proponents keep revising what ought to be fixed goalposts... Well, fixed unless you believe unassisted human developers are also getting dramatically better at their jobs every year.

Like the boy who cried wolf, it'll eventually be true with enough time... But we should stop giving them the benefit of the doubt.

_____

Jan 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Feb 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Mar 2025: "Ignore last month's models, they aren't good enough to show a marked increase in human productivity, test with this month's models and the benefits are obvious."

Apr 2025: [Ad nauseam, you get the idea]

pdabbadabba14 hours ago

Fair enough. For what it's worth, I've always thought that the more reasonable claim is that AI tools make poor-average developers more productive, not necessarily expert developers.

bluefirebrand12 hours ago

Personally I don't want poor-average developers to be more productive, I want them to be more expert

pdabbadabba9 hours ago

Sure. But what would you suppose the ratio is between expert, average, and mediocre coders in the average organization? I think a small minority would be in the first category, and I don’t see a technology on the horizon that will change that except for LLMs, which seem like they could make mediocre coders both more productive and produce higher quality output.

bluefirebrand5 hours ago

They definitely aren't producing higher quality output imo, but definitely producing low quality output faster

That's not a tradeoff that I like

Terr_9 hours ago

"Compared to last quarter, we've shipped 40% more spaghetti-code!"

jstummbillig15 hours ago

Convenient for whom and what...? There is nothing tangible to gain from you believing or not believing that someone else does (or does not) get a productivity boost from AI. This is not a religion and it's not crypto. The AI users' net worth is not tied to another ones use of or stance on AI (if anything, it's the opposite).

More generally, the phenomenon this is quite simply explained and nothing surprising: New things improve, quickly. That does not mean that something is good or valuable but it's how new tech gets introduced every single time, and readily explains changing sentiment.

leshow13 hours ago

I think you're missing the broader context. There is a lot of people very invested in the maximalist outcome which does create pressure for people to be boosters. You don't need a digital token for that to happen. There's a social media aspect as well that creates a feedback loop about claims.

We're in a hype cycle, and it means we should be extra critical when evaluating the tech so we don't get taken in by exaggerated claims.

jstummbillig12 hours ago

I mostly don't agree. Yes, there is always social pressure with these things, and we are in a hype cycle, but the people "buying in" are simply not doing much at all. They are mostly consumers, waiting for the next model, which they have no control over or stake in creating (by and large).

The people not buying into the hype, on the other hands, are actually the ones that have a very good reason to be invested, because if they turn out to be wrong they might face some very uncomfortable adjustments in the job landscape and a lot of the skills that they worked so hard to gain and believed to be valuable.

As always, be weary of any claims, but the tension here is very much the reverse of crypto and I don't think that's very appreciated.

card_zero14 hours ago

I saw that edit. Indeed you can't predict that rejecting a new thing is part of a routine of being wrong. It's true that "it's strange and new, therefore I hate it" is a very human (and adorable) instinct, but sometimes it's reasonable.

jstummbillig14 hours ago

"I saw that edit" lol

card_zero14 hours ago

Sorry, just happened to. Slightly rude of me.

jstummbillig13 hours ago

Ah, you do you. It's just a fairly kindergarten thing to point out and not something I was actively trying to hide. Whatever it was.

Generally, I do a couple of edits for clarity after posting and reading again. Sometimes that involves removing something that I feel could have been said better. If it does not work, I will just delete the comment. Whatever it was must not have been a super huge deal (to me).

maxbond8 hours ago

FYI there's a "delay" setting in your profile that allows you to make your comment invisible for up to ten minutes.

grey-area14 hours ago

Honestly the hype cycle feels very like crypto, and just like crypto prominent vcs have a lot of money riding on the outcome.

jstummbillig14 hours ago

Of course, lot's of hype, but my point is that the reason why is very different and it matters: As an early bc adopter making your believe in bc is super important to my net worth (and you not believing in bc makes me look like an idiot and lose a lot of money).

In contrast, what do I care if you believe in code generation AI? If you do, you are probably driving up pricing. I mean, I am sure that there are people that care very much, but there is little inherent value for me in you doing so, as long as the people who are building the AI are making enough profit to keep it running.

With regards to the VCs, well, how many VCs are there in the world? How many of the people who have something good to say about AI are likely VCs? I might be off by an order of magnitude, but even then it would really not be driving the discussion.

leshow13 hours ago

I don't find that a compelling argument, lots of people get taken in by hype cycles even when they don't profit directly from it.

steveklabnik14 hours ago

I agree with you, and I think that’s coloring a lot of people’s perceptions. I am not a crypto fan but am an LLM fan.

Every hype cycle feels like this, and some of them are nonsense and some of them are real. We’ll see.

giantg211 hours ago

The third option is that the person who used Cursor before had some sort of skill atrophy that led to lower unassisted speed.

I think an easy measure to help identify why a slow down is happening would be to measure how much refactoring happened on the AI generated code. Often times it seems to be missing stuff like error handling, or adds in unnecessary stuff. Of course this assumes it even had a working solution in the first place.

Terr_15 hours ago

> people consistently predict and self-report in the wrong direction

I recall an adage about work-estimation: As chunks get too big, people unconsciously substitute "how possible does the final outcome feel" with "how long will the work take to do."

People asked "how long did it take" could be substituting something else, such as "how alone did I feel while working on it."

sandinmyjoints15 hours ago

That’s an interesting adage. Any ideas of its source?

Dilettante_15 hours ago

It might have been in Kahneman's "Thinking, Fast and Slow"

Terr_15 hours ago

I'm not sure, but something involving Kahneman et al. seems very plausible: The relevant term is probably "Attribute Substitution."

https://en.wikipedia.org/wiki/Attribute_substitution

robwwilliams14 hours ago

Or a sampling artifact. 4 vs 12 does seem significant within a study, but consider a set of N such studies.

I assume that many large companies have tested efficiency gains and losses of there programmers much more extensively than the authors of this tiny study.

A survey of companies and their evaluation and conclusions would carry more weight—-excluding companies selling AI products, of course.

rs18612 hours ago

If you use binomial test, P(X<=4) is about 0.105 which means p = 0.21.

bilbo-b-baggins17 minutes ago

I can say that in my experience AI is very good at early codebases and refactoring tasks that come with that.

But for very large stable codebases it is a mixed bag of results. Their selection of candidates is valid but it probably illustrates a worst case scenario for time based measurement.

If an AI code editor cannot make more changes quicker than a dev or cannot provide relevant suggestions quick enough/without being distracting then you lose time.

furyofantares17 hours ago

> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

I totally agree with this. Although also, you can end up in a bad spot even after you've gotten pretty good at getting the AI tools to give you good output, because you fail to learn the code you're producing well.

A developer gets better at the code they're working on over time. An LLM gets worse.

You can use an LLM to write a lot of code fast, but if you don't pay enough attention, you aren't getting any better at the code while the LLM is getting worse. This is why you can get like two months of greenfield work done in a weekend but then hit a brick wall - you didn't learn anything about the code that was written, and while the LLM started out producing reasonable code, it got worse until you have a ball of mud that neither the LLM nor you can effectively work on.

So a really difficult skill in my mind is continually avoiding temptation to vibe. Take a whole week to do a month's worth of features, not a weekend to do two month's worth, and put in the effort to guide the LLM to keep producing clean code, and to be sure you know the code. You do want to know the code and you can't do that without putting in work yourself.

danieldk16 hours ago

So a really difficult skill in my mind is continually avoiding temptation to vibe.

I agree. I have found that I can use agents most effectively by letting it write code in small steps. After each step I do review of the changes and polish it up (either by doing the fixups myself or prompting). I have found that this helps me understanding the code, but also avoids that the model gets in a bad solution space or produces unmaintainable code.

I also think this kind of close-loop is necessary. Like yesterday I let an LLM write a relatively complex data structure. It got the implementation nearly correct, but was stuck, unable to find an off-by-one comparison. In this case it was easy to catch because I let it write property-based tests (which I had to fix up to work properly), but it's easy for things to slip through the cracks if you don't review carefully.

(This is all using Cursor + Claude 4.)

bluefirebrand16 hours ago

> Take a whole week to do a month's worth of features

Everything else in your post is so reasonable and then you still somehow ended up suggesting that LLMs should be quadrupling our output

furyofantares15 hours ago

I'm specifically talking about greenfield work. I do a lot of game prototypes, it definitely does that at the very beginning.

Dzugaru14 hours ago

This is really interesting, because I do gamejams from time to time - and I try every time to make it work, but I'm still quite a lot faster doing stuff myself.

This is visible under extreme time pressure of producing a working game in 72 hours (our team scores consistenly top 100 in Ludum Dare which is a somewhat high standard).

We use a popular Unity game engine all LLMs have wealth of experience (as in game development in general), but the output is 80% so strangely "almost correct but not usable" that I cannot take the luxury of letting it figure it out, and use it as fancy autocomplete. And I also still check docs and Stackoverflow-style forums a lot, because of stuff it plainly mades up.

One of the reasons is maybe our game mechanics often is a bit off the beaten road, though the last game we made was literally a platformer with rope physics (LLM could not produce a good idea how to make stable and simple rope physics under our constraints codeable in 3 hours time).

bluefirebrand15 hours ago

Greenfield is still such a tiny percentage of all software work going on in the world though :/

Filligree14 hours ago

It’s a tiny percentage of software work because the programming is slow, and setting up new projects is even slower.

It’s been a majority of my projects for the past two months. Not because work changed, but because I’ve written a dozen tiny, personalised tools that I wouldn’t have written at all if I didn’t have Claude to do it.

Most of them were completed in less than an hour, to give you an idea of the size. Though it would have easily been a day on my own.

furyofantares14 hours ago

I agree, that's fair. I think a lot of people are playing around with AI on side projects and making some bad extrapolations from their initial experiences.

It'll also apply to isolated-enough features, which is still a small amount of someone's work (not often something you'd work on for a full month straight), but more people will have experience with this.

lurking_swe14 hours ago

greenfield development is also the “easiest” and most fun part of software development. As the famous saying goes, the last 10% of the project takes 90% of the time lol.

I’ve also noticed that, generally, nobody likes maintaining old systems.

so where does this leave us as software engineers? Should I be excited that it’s easy to spin up a bunch of code that I don’t deeply understand at the beginning of my project, while removing the fun parts of the project?

I’m still grappling with what this means for our industry in 5-10 years…

WD-4215 hours ago

I feel the same way. I use it for super small chunks, still understand everything it outputs, and often manually copy/paste or straight up write myself. I don't know if I'm actually faster before, but it feels more comfy than alt-tabbing to stack overflow, which is what I feel like it's mostly replaced.

Poor stack overflow, it looks like they are the ones really hurting from all this.

jona777than14 hours ago

> but then hit a brick wall

This is my intuition as well. I had a teammate use a pretty good analogy today. He likened vibe coding to vacuuming up a string in four tries when it only takes one try to reach down and pick it up. I thought that aligned well with my experience with LLM assisted coding. We have to vacuum the floor while exercising the "difficult skill [of] continually avoiding temptation to vibe"

smokel17 hours ago

I notice that some people have become more productive thanks to AI tools, while others are not.

My working hypothesis is that people who are fast at scanning lots of text (or code for that matter) have a serious advantage. Being able to dismiss unhelpful suggestions quickly and then iterating to get to helpful assistance is key.

Being fast at scanning code correlates with seniority, but there are also senior developers who can write at a solid pace, but prefer to take their time to read and understand code thoroughly. I wouldn't assume that this kind of developer gains little profit from typical AI coding assistance. There are also juniors who can quickly read text, and possibly these have an advantage.

A similar effect has been around with being able to quickly "Google" something. I wouldn't be surprised if this is the same trait at work.

luxpir14 hours ago

Just to thank you for that point. I think it's likely more true than most of us realise. That and maybe the ability to mentally scaffold or outline a system or solution ahead of time.

Filligree14 hours ago

An interesting point. I wonder how much my decades-old habit of watching subtitled anime helps there—it’s definitely made me dramatically faster at scanning text.

blub6 hours ago

One has to take time to review code and think through different aspects of execution (like memory management, concurrency, etc). Plenty of code cannot be scanned.

That said, if the language has GC and other helpers, it makes it easier to scan.

Code and architecture review is an important part of my role and I catch issues that others miss because I spend more time. I did use AI for review (GPT 4.1), but only as an addition, since not reliable enough.

bgwalter16 hours ago

We have heard variations of that narrative for at least a year now. It is not hard to use these chatbots and no one who was very productive in open source before "AI" has any higher output now.

Most people who subscribe to that narrative have some connection to "AI" money, but there might be some misguided believers as well.

thesz14 hours ago

  > My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This is what I heard about strong type systems (especially Haskell's) about 20-15 years ago.

"History does not repeat, but it rhymes."

If we rhyme "strong types will change the world" with "agentic LLMs will change the world," what do we get?

My personal theory is that we will get the same: some people will get modest-to-substantial benefits there, but changes in the world will be small if noticeable at all.

leshow13 hours ago

I don't think that's a fair comparison. Type systems don't produce probabilistic output. Their entire purpose is to reduce the scope of possible errors you can write. They kind of did change the world, didn't they? I mean, not everyone is writing Haskell but Rust exists and it's doing pretty well. There was also not really a case to be made where type systems made software in general _worse_. But you could definitely make the case that LLM's might make software worse.

thesz3 hours ago

That probabilistic output has to be symbolically constrained - SQL/JSON/other code is generated through syntax constrained beam search.

You brought up Rust, it is fascinating.

The Rust's type system differs from typical Hindle-Milner by having operations that can remove definitions from environment of the scope.

Rust was conceived in 2006.

In 2006 there already were HList papers by Oleg Kiselyov [1] that had shown how to keep type level key-value lists with addition, removal and lookup, and type-level stateful operations like in [2] were already possible, albeit, most probably, not with nice monadic syntax support.

  [1] https://okmij.org/ftp/Haskell/HList-ext.pdf
  [2] http://blog.sigfpe.com/2009/02/beyond-monads.html

It was entirely possible to have prototype Rust to be embedded into Haskell and have borrow checker implemented as type-level manipulation over double parameterized state monad.

But it was not, Rust was not embedded into Haskell and now it will never get effects (even as weak as monad transformers) and, as a consequence, will never get proper high performance software transactional memory.

So here we are: everything in Haskell's strong type system world that would make Rust better was there at the very beginning of the Rust journey, but had no impact on Rust.

Rhyme that with LLM.

atlintots12 hours ago

Its too bad the management people never pushed Haskell as hard as they're pushing AI today! Alas.

ruszki13 hours ago

Maybe it depends on the task. I’m 100% sure, that if you think that type system is a drawback, then you have never code in a diverse, large codebase. Our 1.5 million LOC 30 years old monolith would be completely unmaintainable without it. But seriously, anything without a formal type system above 10 LOC after a few years is unmaintainable. An informal is fine for a while, but not long for sure. On a 30 years old code, basically every single informal rules are broken.

Also, my long experience is that even in PoC phase, using a type system adds almost zero extra time… of course if you know the type system, which should be trivial in any case after you’ve seen a few.

thesz3 hours ago

Contrarily I believe that strong type system is a plus. Please, look at my other comment: https://news.ycombinator.com/item?id=44529347

My original point was about history and about how can we extract possible outcome from it.

My other comment tries to amplify that too. Type systems were strong enough for several decades now, had everything Rust needed and more years before Rust began, yet they have little penetration into real world, example being that fancy dandy Rust language.

sfn423 hours ago

It's generally trivial for conventional class-based type systems like those in Java and C#, but TypeScript is a different beast entirely. On the surface it seems similar but it's so much deeper than the others.

I don't like it. I know it is the way it is because it's supposed to support all the cursed weird stuff you can do in JS, but to me as a fullstack developer who's never really taken the time to deep dive and learn TS properly it often feels more like an obstacle. For my own code it's fine, but when I have to work with third party libraries it can be really confusing. It's definitely a skill issue though.

ruszki2 hours ago

I agree. Typescript is different for another reason too. They ignore edge cases many times, and because of that you can do really-really nice things with it (when it’s not broken). I wondered a lot of times why Java doesn’t include a few things which would be appropriate even in that world, and the answer is almost always because Java cares about edge cases. There are notes about those in Typescript’s doc or issues.

benreesman7 hours ago

I don't even think we know how to do it yet. I revise my whole attitude and all of my beliefs about this stuff every week: I figure out things that seemed really promising don't pan out, I find stuff that I kick myself for not realizing sooner, and it's still this high-stakes game. I still blow a couple of days and wish I had just done it the old-fashioned way, and then I'll catch a run where it's like, fuck, I was never that good, that's the last 5-10% that breaks a PB.

I very much think that these things are going to wind up being massive amplifiers for people who were already extremely sophisticated and then put massive effort into optimizing them and combining them with other advanced techniques (formal methods, top-to-bottom performance orientation).

I don't think this stuff is going to democratize software engineering at all, I think it's going to take the difficulty level so high that it's like back when Djikstra or Tony Hoare was a fairly typical computer programmer.

jprokay137 hours ago

My personal experience was that of a decrease in productivity until I spent significant time with it. Managing configurations, prompting it the right way, asking other models for code reviews… And I still see there is more I can unlock with more time learning the right interaction patterns.

For nasty, legacy codebases there is only so much you can do IMO. With green field (in certain domains), I become more confident every day that coding will be reduced to an AI task. I’m learning how to be a product manager / ideas guy in response

gexla8 hours ago

In addition to the learning curve of the tooling, there's also the learning curve of the models. Each have a certain personality that you have to figure out so that you can catch the failure patterns right away.

ummonk14 hours ago

Devil's advocate: it's also possible the one developer hasn't become more productive with Cursor, but rather has atrophied their non-AI productivity due to becoming reliant on Cursor.

bluefirebrand8 hours ago

I suspect you're onto something here but I also think it would be an extremely dramatic atrophy to have occurred in such a short period of time...

rafaelmn15 hours ago

>My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

Are we are still selling the "you are an expert senior developer" meme ? I can completely see how once you are working on a mature codebase LLMs would only slow you down. Especially one that was not created by an LLM and where you are the expert.

bicx15 hours ago

I think it depends on the kind of work you're doing, but I use it on mature codebases where I am the expert, and I heavily delegate to Claude Code. By being knowledgeable of the codebase, I know exactly how to specify a task I need performed. I set it to work on one task, then I monitor it while personally starting on other work.

I think LLMs shine when you need to write a higher volume of code that extends a proven pattern, quickly explore experiments that require a lot of boilerplate, or have multiple smaller tasks that you can set multiple agents upon to parallelize. I've also had success in using LLMs to do a lot of external documentation research in order to integrate findings into code.

If you are fine-tuning an algorithm or doing domain-expert-level tweaks that require a lot of contextual input-output expert analysis, then you're probably better off just coding on your own.

Context engineering has been mentioned a lot lately, but it's not a meme. It's the real trick to successful LLM agent usage. Good context documentation, guides, and well-defined processes (just like with a human intern) will mean the difference between success and failure.

ericmcer16 hours ago

Looking at the example tasks in the pdf ("Sentencize wrongly splits sentence with multiple...") these look like really discrete and well defined bug fixes. AI should smash tasks like that so this is even less hopeful.

mjr0017 hours ago

> My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

Definitely. Effective LLM usage is not as straightforward as people believe. Two big things I see a lot of developers do when they share chats:

1. Talk to the LLM like a human. Remember when internet search first came out, and people were literally "Asking Jeeves" in full natural language? Eventually people learned that you don't need to type, "What is the current weather in San Francisco?" because "san francisco weather" gave you the same, or better, results. Now we've come full circle and people talk to LLMs like humans again; not out of any advanced prompt engineering, but just because it's so anthropomorphized it feels natural. But I can assure you that "pandas count unique values column 'Foo'" is just as effective an LLM prompt as "Using pandas, how do I get the count of unique values in the column named 'Foo'?" The LLM is also not insulted by you talking to it like this.

2. Don't know when to stop using the LLM. Rather than let the LLM take you 80% of the way there and then handle the remaining 20% "manually", they'll keep trying to prompt to get the LLM to generate what they want. Sometimes this works, but often it's just a waste of time and it's far more efficient to just take the LLM output and adjust it manually.

Much like so-called Google-fu, LLM usage is a skill and people who don't know what they're doing are going to get substandard results.

bit199315 hours ago

> Rather than let the LLM take you 80% of the way there and then handle the remaining 20% "manually"

IMO 80% is way too much, LLMs are probably good for things that are not your domain knowledge and you can efford to not be 100% correct, like rendering the Mandelbrot set, simple functions like that.

LLMs are not deterministic sometimes they produce correct code and other times they produce wrong code. This means one has to audit LLM generated code and auditing code takes more effort than writing it, especially if you are not the original author of the code being audited.

Code has to be 100% deterministic. As programmers we write code, detailed instructions for the computer (CPU), we have developed allot of tools such as Unit Tests to make sure the computer does exactly what we wrote.

A codebase has allot of context that you gain by writing the code, some things just look wrong and you know exactly why because you wrote the code, there is also allot of context that you should keep in your head as you write the code, context that you miss from simply prompting an LLM.

Jaxan17 hours ago

> Effective LLM usage is not as straightforward as people believe

It is not as straightforward as people are told to believe!

sleepybrett16 hours ago

^ this, so much this. The amount of bullshit that gets shoveled into hacker news threads about the supposed capabilities of these models is epic.

badsectoracula7 hours ago

> But I can assure you that "pandas count unique values column 'Foo'" is just as effective an LLM prompt as "Using pandas, how do I get the count of unique values in the column named 'Foo'?"

While the results are going to be similar, typing a question in full can help you think about it yourself too, as if the LLM is a rubber duck that can respond back.

I've found myself adjusting and rewriting prompts during the process of writing them before i ask the LLM anything because as i was writing the prompt i was thinking about the problem simultaneously.

Of course for simple queries like "write me a function in C that calculates the length of a 3d vector using vec3 for type" you can write it like "c function vec3 length 3d" or something like that instead and the LLM will give more or less the same response (tried it with Devstral).

But TBH to me that sounds like programmers using Vim claiming they're more productive than users of other editors because they have to use less keystrokes.

lukan17 hours ago

"But I can assure you that "pandas count unique values column 'Foo'" is just as effective an LLM prompt as "Using pandas, how do I get the count of unique values in the column named 'Foo'?""

How can you be so sure? Did you compare in a systematic way or read papers by people who did it?

Now I surely get results giving the llm only snippets and keywords, but anything complex, I do notice differences the way I articulate. Not claiming there is a significant difference, but it seems to me this way.

mjr0016 hours ago

> How can you be so sure? Did you compare in a systematic way or read papers by people who did it?

No, but I didn't need to read scientific papers to figure how to use Google effectively, either. I'm just using a results-based analysis after a lot of LLM usage.

lukan16 hours ago

Well, I did needed some tutorials to use google efficently in the old days when + meant something specific.

skybrian16 hours ago

Other people don't have benefit of your experience, though, so there's a communications gap here: this boils down to "trust me, bro."

How do we get beyond that?

mjr0015 hours ago

This is the gap between capability (what can this tool do?) versus workflow (what is the best way to use this tool to accomplish a goal?). Capabilities can be strictly evaluated, but workflow is subjective. Saying "Google has the site: and before: operators" is capability, saying "you should use site:reddit.com before:2020 in Google queries" is workflow.

LLMs have made the distinction ambiguous because their capabilities are so poorly understood. When I say "you should talk to an LLM like it's a computer", that's a workflow statement; it's a more efficient way to accomplish the same goal. You can try it for yourself and see if you agree. I personally liken people who talk to LLMs in full, proper English, capitalization and all, to boomers who still type in full sentences when running a Google query. Is there anything strictly wrong with it? Not really. Do I believe it's a more efficient workflow to just type the keywords that will give you the same result? Yes.

Workflow efficiencies can't really be scientifically evaluated. Some people still prefer to have desktop icons for programs on Windows; my workflow is pressing winkey -> typing the first few characters of the program -> enter. Is one of these methods scientifically more correct? Not really.

So, yeah -- eventually you'll either find your own workflow or copy the workflow of someone you see who is using LLMs effectively. It really is "just trust me, bro."

skybrian13 hours ago

Maybe it would help if more people wrote tutorials? It doesn't seem reasonable for people who don't have a buddy to learn from to have to figure it out on their own.

frotaur17 hours ago

I'm not sure about your example about talking to LLMs. There is good reason to think that speaking to it like a human might produce better results, as that's what most of the training data is composed of.

I don't have any studies, but it eems to me reasonable to assume.

(Unlike google, where presumably it actually used keywords anyway)

mjr0017 hours ago

> I'm not sure about your example about talking to LLMs. There is good reason to think that speaking to it like a human might produce better results, as that's what most of the training data is composed of.

In practice I have not had any issues getting information out of an LLM when speaking to them like a computer, rather than a human. At least not for factual or code-related information; I'm not sure how it impacts responses for e.g. creative writing, but that's not what I'm using them for anyway.

gedy17 hours ago

> Talk to the LLM like a human

Maybe the LLM doesn't strictly need it, but typing out does bring some clarity for the asker. I've found it helps a lot to catch myself - what am I even wanting from this?

rukuu0019 hours ago

I'm sympathetic to the argument re experience with the tools paying off, because my personal anecdata matches that. It hasn't been until the last 6 weeks, after watching a friend demo their workflow, that my personal efficiency has improved dramatically.

The most useful thing of all would have been to have screen recordings of those 16 developers working on their assigned issues, so they could be reviewed for varying approaches to AI-assisted dev, and we could be done with this absurd debate once and for all.

theshrike79an hour ago

LLMs are good for things you know how to do, but can't be arsed to. Like small tools with extensive use of random APIs etc.

For example I whipped together a Steam API -based tool that gets my game library and enriches it with data available in maybe 30 minutes of active work.

The LLM (Cursor with Gemini Pro + Claude 3.7 at the time IIRC) spent maybe 2-3 hours on it while I watched some shows on my main display and it worked on my second screen with me directing it.

Could I have done it myself from scratch like a proper artisan? Most definitely. Would I have bothered? Nope.

Aurornis14 hours ago

> A quarter of the participants saw increased performance, 3/4 saw reduced performance.

The study used 246 tasks across 16 developers, for an average of 15 tasks per developer. Divide that further in half because tasks were assigned as AI or not-AI assisted, and the sample size per developer is still relatively small. Someone would have to take the time to review the statistics, but I don’t think this is a case where you can start inferring that the developers who benefited from AI were just better at using AI tools than those who were not.

I do agree that it would be interesting to repeat a similar test on developers who have more AI tool assistance, but then there is a potential confounding effect that AI-enthusiastic developers could actually lose some of their practice in writing code without the tools.

bluefirebrand8 hours ago

> potential confounding effect that AI-enthusiastic developers could actually lose some of their practice in writing code without the tools

I don't think this is a confounding effect

This is something that we definitely need to measure and be aware of, if there is a risk of it

nicman234 hours ago

i just treat ai as a very long auto complete. sometimes it surprises me. on things i do not know, like windows C calls, i think i ought to just search the documentation..

heavyset_go13 hours ago

Any "tricks" you learn for one model may not be applicable to another, it isn't a given that previous experience with a company's product will increase the likelihood of productivity increases. When models change out from under you, the heuristics you've built up might be useless.

[deleted]15 hours agocollapsed

mnky9800n15 hours ago

I feel like I get better at it as I use Claude code more because I both understand its strength and weaknesses and also understand what context it’s usually missing. Like today I was struggling to debug an issue and realised that Claude’s idea of a coordinate system was 90 degrees rotated from mine and thus it was getting confused because I was confusing it.

throwawayoldie15 hours ago

One of the major findings is that people's perception--that is, what it felt like--was incorrect.

devin15 hours ago

It seems really surprising to me that anyone would call 50 hours of experience a "high skill ceiling".

bc100000316 hours ago

"My intiution is that..." - AGREED.

I've found that there are a couple of things you need to do to be very efficient.

- Maintain an architecture.md file (with AI assistance) that answers many of the questions and clarifies a lot of the ambiguity in the design and structure of the code.

- A bootstrap.md file(s) is also useful for a lot of tasks.. having the AI read it and start with a correct idea about the subject is useful and a time saver for a variety of kinds of tasks.

- Regularly asking the AI to refactor code, simplify it, modularize it - this is what the experienced dev is for. VIBE coding generally doesn't work as AI's tend to write messy non-modular code unless you tell them otherwise. But if you review code, ask for specific changes.. they happily comply.

- Read the code produced, and carefully review it. And notice and address areas where there are issues, have the AI fix all of these.

- Take over when there are editing tasks you can do more efficiently.

- Structure the solution/architecture in ways that you know the AI will work well with.. things it knows about.. it's general sweet spots.

- Know when to stop using the AI and code it yourself.. particuarly when the AI has entered the confusion doom loop. Wasting time trying to get the AI to figure out what it's never going to is best used just fixing it yourself.

- Know when to just not ever try to use AI. Intuitively you know there's just certain code you can't trust the AI to safely work on. Don't be a fool and break your software.

----

I've found there's no guarantee that AI assistance will speed up any one project (and in some cases slow it down).. but measured cross all tasks and projects, the benefits are pretty substantial. That's probably others experience at this point too.

onlyrealcuzzo17 hours ago

How were "experienced engineers" defined?

I've found AI to be quite helpful in pointing me in the right direction when navigating an entirely new code-base.

When it's code I already know like the back of my hand, it's not super helpful, other than maybe doing a few automated tasks like refactoring, where there have already been some good tools for a while.

smj-edison15 hours ago

> To directly measure the real-world impact of AI tools on software development, we recruited 16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code) that they’ve contributed to for multiple years.

eightysixfour13 hours ago

I have been teaching people at my company how to use AI code tools, the learning curve is way worse for developers and I have had to come up with some exercises to try and breakthrough the curve. Some seemingly can’t get it.

The short version is that devs want to give instructions instead of ask for what outcome they want. When it doesn’t follow the instructions, they double down by being more precise, the worst thing you can do. When non devs don’t get what they want, they add more detail to the description of the desired outcome.

Once you get past the control problem, then you have a second set of issues for devs where the things that should be easy or hard don’t necessarily map to their mental model of what is easy or hard, so they get frustrated with the LLM when it can’t do something “easy.”

Lastly, devs keep a shit load of context in their head - the project, what they are working on, application state, etc. and they need to do that for LLMs too, but you have to repeat themselves often and “be” the external memory for the LLM. Most devs I have taught hate that, they actually would rather have it the other way around where they get help with context and state but want to instruct the computer on their own.

Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding. I have a hypothesis they’re wired a bit differently and their role with AI tools is actually closer to management than it is development in a number of ways.

BigGreenJorts11 hours ago

> Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding. I have a hypothesis they’re wired a bit differently and their role with AI tools is actually closer to management than it is development in a number of ways.

The CTO and VPEng at my company (very small, still do technical work occasionally) both love the agent stuff so much. Part of it for them is that it gives them the opportunity to do technical work again with the limited time they have. Without having to distract an actual dev, or spend a long time reading through the codebase, they can quickly get context for an build small items themselves.

rester32411 hours ago

> Interestingly, the best AI assisted devs have often moved to management/solution architecture, and they find the AI code tools brought back some of the love of coding

This suggests me though that they are bad at coding, otherwise they would have stayed longer. And I can't find anything in your comment that would corroborate the opposite. So what gives?

I am not saying what you say is untrue, but you didn't give any convincing arguments to us to believe otherwise.

Also, you didn't define the criteria of getting better. Getting better in terms of what exactly???

qingcharles6 hours ago

I'm not bad at coding. I would say I'm pretty damned good. But coding is a means-to-an-end. I come up with an idea, then I have the long-winded middle bit where I have to write all the code, spin up a DB, create the tables, etc.

LLMs have given me a whole new love of coding, getting rid of the dull grind and letting me write code an order of magnitude quicker than before.

eightysixfour8 hours ago

> This suggests me though that they are bad at coding, otherwise they would have stayed longer.

Or they care about producing value, not just the code, and realized they had more leverage and impact in other roles.

> And I can't find anything in your comment that would corroborate the opposite.

I didn’t try and corroborate the opposite.

Honestly, I don’t care about the “best coders.” I care about people who do their job well, sometimes that is writing amazing code but most of the time it isn’t. I don’t have any devs in my company who work in a magical vacuum where they are handed perfectly written tasks, they complete them, and then they do the next one.

If I did, I could replace them with AI faster.

> Also, you didn't define the criteria of getting better. Getting better in terms of what exactly?

Delivery velocity - bug fixes, features, etc. that pass testing/QA and goes to prod.

rester3247 hours ago

> Honestly, I don’t care about the “best coders.”

> Interestingly, the best AI assisted devs have often moved to management/solution architecture

Is it just me? Or does it seem to others as well that you pretty much rank these people even at the moment and your first comment contradicts your second comment? Especially when you admit that you rank them based on velocity.

I am not saying you shouldn't do that, but it feels to me like rating road construction workers on the number of potholes fixed, even though it's very possible that the potholes are caused by the sloppy work to begin with.

Not what I would want to do.

eightysixfour5 hours ago

> Is it just me? Or does it seem to others as well that you pretty much rank these people even at the moment and your first comment contradicts your second comment?

I think you are reading what you want to read and not what I said, so yes it is you. The most productive, valuable people with developer titles in my organizations are not the ones who write the cleanest, most beautiful, most perfect code. They do all of the other parts of the job well and write solid code.

Following the introduction of AI tools, many of the people in my organization who most effectively learned to use those tools are people who previously chose to move to manager and SA roles.

Not only are these not contradictory, they fit quite well together. People who do the things around coding well, but maybe have to work hard at writing the actual code, are better at using the AI tools than exceptional coders. For my organization, the former are generally more valuable than the latter without AI, and that is increasing as a result of AI.

> I am not saying you shouldn't do that, but it feels to me like rating road construction workers on the number of potholes fixed, even though it's very possible that the potholes are caused by the sloppy work to begin with.

Not if your measurement includes quality testing the pothole repairs, which mine does, as I explicitly called out. I work in industries with extensive, long testing cycles, we are (imperfectly, of course) able to measure productivity based on things which make it through those cycles.

You are trying very hard to find ways to ignore what I am saying. It is fine if you don’t want to believe me, but these things have been true based on our observations:

A. Great “coders” have a much harder time picking up AI dev tools and using them effectively, and when they see how others use them they will admit that isn’t how they use them. They will revert to their previous habits and give up on the tools.

B. The productivity gains for the people who are good at using the tools, as measured by velocity with a minimum bar for quality (with substantial QA), are very high.

C. We have measured these things to thoroughly understand the ROI and we are accelerating our investment in AI coding tools as a result.

Some caveats I am absolutely willing to make - we are not working on bleeding edge tech doing things no one has ever done before.

We failed to effectively use AI many times before we started to get it right.

There are developers who are slower with the AI code tools than without it.

rester3244 hours ago

I am not convinced.

If what you write was true, then the rate of bugs of those incredible devs would simply fall to zero at one point, and at that point they would become a legend who we all would have heard of by now. So the whole story sounds too fishy to my taste.

It's OK if you want to manage your team this way. Everyone needs some external feedback to confirm their own bias. It seems you found yours and it works for you.

It's just not a good argument in support of AI or AI assisted development.

It's too anecdotal.

And since you are the one who are telling me that you are right, and not others, it makes me even more skeptical about the whole story.

keeda14 hours ago

> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

Yes, and I'll add that there is likely no single "golden workflow" that works for everybody, and everybody needs to figure it out for themselves. It took me months to figure out how to be effective with these tools, and I doubt my approach will transfer over to others' situations.

For instance, I'm working solo on smallish, research-y projects and I had the freedom to structure my code and workflows in a way that works best for me and the AI. Briefly: I follow an ad-hoc, pair-programming paradigm, fluidly switching between manual coding and AI-codegen depending on an instinctive evaluation of whether a prompt would be faster. This rapid manual-vs-prompt assessment is second nature to me now, but it took me a while to build that muscle.

I've not worked with coding agents, but I doubt this approach will transfer over well to them.

I've said it before, but this is technology that behaves like people, and so you have to approach it like working with a colleague, with all their quirks and fallibilities and potentially-unbound capabilities, rather than a deterministic, single-purpose tool.

I'd love to see a follow-up of the study where they let the same developers get more familiar with AI-assisted coding for a few months and repeat the experiment.

Filligree14 hours ago

> I've not worked with coding agents, but I doubt this approach will transfer over well to them.

Actually, it works well so long as you tell them when you’ve made a change. Claude gets confused if things randomly change underneath it, but it has no trouble so long as you give it a short explanation.

lupusreal2 hours ago

A friend of mine, complete non-programmer, has been trying to use ChatGPT to write a phone app. I've been as hands off as I feel I can be, watching how the process goes for him. My observations so far is that it's not going well, he doesn't understand what questions he should be asking so the answers he's getting aren't useful. I encourage him to ask it to teach him the relevant programming but he asks it to help him make the app without programming at all.

With more coaching from me, which I might end up doing, I think he would get further. But I expected the chatbot to get him further through the process than this. My conclusion so far is that this technology won't meaningfully shift the balance of programmers to non-programmers in the general population.

dmezzetti15 hours ago

I'm the developer of txtai, a fairly popular open-source project. I don't use any AI-generated code and it's not integrated into my workflows at the moment.

AI has a lot of potential but it's way over-hyped right now. Listen to the people on the ground who are doing real work and building real projects, none of them are over-hyping it. It's mostly those who have tangentially used LLMs.

It's also not surprising that many in this thread are clinging to a basic premise that it's 3 steps backwards to go 5 steps forward. Perhaps that is true but I'll take the study at face value, it seems very plausible to me.

AndrewKemendo10 hours ago

What you described has been true of the adoption of every technology ever

Nothing new this time except for people who have no vision and no ability to work hard not “getting it” because they don’t have the cognitive capacity to learn

th0ma513 hours ago

Simon's opinion is unsurprisingly that people need to read his blog and spam on every story on HN lest we be left behind.

Uehreka16 hours ago

> My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

You hit the nail on the head here.

I feel like I’ve seen a lot of people trying to make strong arguments that AI coding assistants aren’t useful. As someone who uses and enjoys AI coding assistants, I don’t find this research angle to be… uh… very grounded in reality?

Like, if you’re using these things, the fact that they are useful is pretty irrefutable. If one thinks there’s some sort of “productivity mirage” going on here, well OK, but to demonstrate that it might be better to start by acknowledging areas where they are useful, and show that your method explains the reality we’re seeing before using that method to show areas where we might be fooling ourselves.

I can maybe buy that AI might not be useful for certain kinds of tasks or contexts. But I keep pushing their boundaries and they keep surprising me with how capable they are, so it feels like it’ll be difficult to prove otherwise in a durable fashion.

furyofantares16 hours ago

I think the thing is there IS a learning curve, AND there is a productivity mirage, AND they are immensely useful, AND it is context dependent. All of this leads to a lot of confusion when communicating with people who are having a different experience.

Uehreka16 hours ago

Right, my problem is that while some people may be correct about the productivity mirage, many of those people are getting out over their skis and making bigger claims than they can reasonably prove. I’m arguing that they should be more nuanced and tactical.

GoatInGrey16 hours ago

It always comes back to nuance!

TechDebtDevin16 hours ago

Still odd to me that the only vibe coded software that gets aquired are by companies selling tools or want to promote vibe coding.

Uehreka16 hours ago

Pardon my caps, but WHO CARES about acquisitions?!

You’ve been given a dubiously capable genie that can write code without you having to do it! If this thing can build first drafts of those side projects you always think about and never get around to, that in and of itself is useful! If it can do the yak-shaving required to set up those e2e tests you know you should have but never have time for it is useful!

Have it try out all the dumb ideas you have that might be cool but don’t feel worth your time to boilerplate out!

I like to think we’re a bunch of creative people here! Stop thinking about how it can make you money and use it for fun!

TechDebtDevin9 hours ago

I have great code gen tools I've built for myself that build my perfect scaffolding/boilerplate every time, for any project in about 30 seconds.

Took me a week to build those tools. Its much more reliable (and flexible) than any LLM and cost me nothing.

It comes with secure Auth, email, admin, ect ect.. Doesn't cost me a dime and almost never has a common vulnerability.

Best part about it. I know how my side project runs.

fwip15 hours ago

Unfortunately, HN is YC-backed, and attracts these types by design.

Uehreka14 hours ago

I mean sure, but HN/YC’s founder was always going on about the kinship between “Hackers and Painters” (or at least he used to). It hasn’t always been like this, and definitely doesn’t have to be. We can and should aspire to better.

furyofantares16 hours ago

That's not odd. These things are incredibly useful and vibe coding mostly sucks.

rcruzeiro15 hours ago

Exactly. The people who say that these assistants are useless or "not good enough" are basically burying their heads in the sand. The people who claim that there is no mirage are burying their head in the sand as well...

noisy_boy18 hours ago

It is 80/20 again - it gets you 80% of the way in 20% of the time and then you spend 80% of the time to get the rest of the 20% done. And since it always feels like it is almost there, sunk-cost fallacy comes into play as well and you just don't want to give up.

I think an approach that I tried recently is to use it as a friction remover instead of a solution provider. I do the programming but use it to remove pebbles such as that small bit of syntax I forgot, basically to keep up the velocity. However, I don't look at the wholesale code it offers. I think keeping the active thinking cap on results in code I actually understand while avoiding skill atrophy.

reverendsteveii17 hours ago

well we used to have a sort of inverse pareto where 80% of the work took 80% of the effort and the remaining 20% of the work also took 80% of the effort.

I do think you're onto something with getting pebbles out of the road inasmuch as once I know what I need to do AI coding makes the doing much faster. Just yesterday I was playing around with removing things from a List object using the Java streams API and I kept running into ConcurrentOperationsExceptions, which happen when multiple threads are mutating the list object at the same time because no thread can guarantee it has the latest copy of the list unaltered by other threads. I spent about an hour trying to write a method that deep copies the list, makes the change and then returns the copy and running into all sorts of problems til I asked AI to build me a thread-safe list mutation method and it was like "Sure, this is how I'd do it but also the API you're working with already has a method that just....does this." Cases like this are where AI is supremely useful - intricate but well-defined problems.

cwmoore17 hours ago

Code reuse at scale: 80 + 80 = 160% ~ phi...coincidence?

I think this may become a long horizon harvest for the rigorous OOP strategy, may Bill Joy be disproved.

Gray goo may not [taste] like steel-cut oatmeal.

visarga15 hours ago

1.6x multiplier is low, we usually need to apply 5x

Sharlin16 hours ago

It's often said that π is the factor by which one should multiply all estimates – reducing it to ɸ would be a significant improvement in estimation accuracy!

emodendroket17 hours ago

I think it’s most useful when you basically need Stack Overflow on steroids: I basically know what I want to do but I’m not sure how to achieve it using this environment. It can also be helpful for debugging and rubber ducking generally.

some-guy17 hours ago

All those things are true, but it's such a small part of my workflow at this point that the savings, while nice, aren't nearly as life-changing to my job as my CEO is forcing us to think it is.

Once AI can actually untangle our 14 year old codebase full of hosh-posh code, read every commit message, JIRA ticket, and Slack conversation related to the changes in full context, it's not going to solve a lot of the hard problems at my job.

emodendroket11 hours ago

Some of the “explain what it does” functionality is better than you might think but to be honest I find myself called on to work with unfamiliar tools all the time so I find plenty of value.

skydhash17 hours ago

The issue is that it is slow and verbose, at least in its default configuration. The amount of reading is non trivial. There’s a reason most references are dense.

lukan17 hours ago

Those issues you can partly solve by changing the prompt to tell it to be concise and don't explain its code.

But nothing will make them stick to the one API version I use.

diggan16 hours ago

> But nothing will make them stick to the one API version I use.

Models trained for tool use can do that. When I use Codex for some Rust stuff for example, it can grep from source files in the directory dependencies are stored, so looking up the current APIs is trivial for them. Same works for JavaScript and a bunch of other languages too, as long as it's accessible somewhere via the tools they have available.

lukan16 hours ago

Hm, I never tried codex so far, but quite some other tools and models and none could help me in a consistent way. But I am sceptical, because also if I tell them explicitel, to only use one specific version they might or not might use that, depending on their training corpus and temperature I assume.

malfist12 hours ago

The less verbosity you allow the dumber the LLM is. It thinks in tokens and if you keep it from using tokens it's lobotomized.

lukan3 hours ago

It can think as much as it wants and still return just code in the end.

emodendroket10 hours ago

Well, compared to what method that would be faster to answer that kind of question?

skydhash21 minutes ago

Learning the thing. It’s not like I have to use all the libraries of the whole world at the job. You can really fly over a reference documentation if you’re familiar with the domain.

threetonesun17 hours ago

Absolutely this. For a while I was working with a language I was only partially familiar with, and I'd say "here's how I would do this in [primary language], rewrite it in [new language]" and I'd get a decent piece of code back. A little searching in the project to make sure it was stylistically correct and then done.

emodendroket10 hours ago

Those kind of tasks are good for it, yeah. “Here’s some JSON. Please generate a Java class I can deserialize it into” is similar.

GuinansEyebrows17 hours ago

> rubber ducking

i don't mean to pick on your usage of this specifically, but i think it's noteworthy that the colloquial definition of "rubber ducking" seems to have expanded to include "using a software tool to generate advice/confirm hunches". I always understood the term to mean a personal process of talking through a problem out loud in order to methodically, explicitly understand a theoretical plan/process and expose gaps.

based on a lot of articles/studies i've seen (admittedly haven't dug into them too deeply) it seems like the use of chatbots to perform this type of task actually has negative cognitive impacts on some groups of users - the opposite of the personal value i thought rubber-ducking was supposed to provide.

jonathanlydall11 hours ago

There is something that happens to our thought processes when we verbalise or write down our thoughts.

I like to think of it that instead of having seemingly endless amounts of half thoughts spinning around inside your head, you make an idea or thought more “fully formed” when you express it verbally or with written (or typed) words.

I believe this is part of why therapy can work, by actually expressing our thoughts, we’re kind of forced to face realities and after doing so it’s often much easier to reflect on it. Therapists often recommend personal journals as they can also work for this.

I believe rubber ducking works because in having to explain the problem, it forces you to actually gather your thoughts into something usable from which you can more effectively reflect on.

I see no reason why doing the same thing except in writing to an LLM couldn’t be equally effective.

emodendroket11 hours ago

Well OK, sure. But I’m having a “conversation” with nobody still. I’m surprised how often it happens that the AI a gives me a totally wrong answer but a combination of formulating the question and something in the answer made me think of the right thing after all.

danparsonson11 hours ago

Indeed the duck is supposed to sit there in silence while the speaker does the thinking ^^

This is what human language does though, isn't it? Evolves over time, in often weird ways; like how many people "could care less" about something they couldn't care less about.

wmeredith17 hours ago

> and then you spend 80% of the time to get the rest of the 20% done

This was my pr-AI experience anyway, so getting that first chunk of time back is helpful.

Related: One of the better takes I've seen on AI from an experienced developer was, "90% of my skills just became worthless, and the other 10% just became 1,000 times more valuable." There's some hyperbole there, I but I like the gist.

skydhash17 hours ago

It’s not funny when you find yourself redoing the first 80%, as the only way to complete the second 80%.

bluefirebrand15 hours ago

Let us know if that dev you're talking about winds up working 90% less for the same amount, or earning 1000x more

Otherwise he can shut the fuck up about being 1000x more valuable imo

qingcharles6 hours ago

This is just not true in my experience. Not with the latest models. I routinely manage to 1-shot a whole "thing." e.g. yesterday I needed a Wordpress plugin for a single-time use to clean up a friend's site. I described exactly what I needed, it produced the code, it ran perfect first time and the UI looked like a million dollars. It got me 100% of the way in 0% of the time.

I'm the biggest skeptic, but more and more I'm seeing it get me the bulk of the way with very little back-and-forth. If it was even more heavily integrated in my dev environment, it would save me even more time.

causal17 hours ago

Agreed and +1 on "always feels like it is almost there" leading to time sink. AI is especially good at making you feel like it's doing something useful; it takes a lot of skill to discern the truth.

eknkc17 hours ago

It works great on adding stuff to an already established codebase. Things like “we have these search parameters, also add foo”. Remove anything related to x…

antonvs17 hours ago

Exactly. If you can give it a contract and a context, essentially, and it doesn't need to write a large amount of code to fulfill it, it can be great.

I just used it to write about 80 lines of new code like that, and there's no question it saves time.

0110001117 hours ago

As an old dev this is really all I want: a sort of autocorrect for my syntactical errors to save me a couple compile-edit cycles.

pferde16 hours ago

What I want is not autocorrect, because that won't teach me anything. I want it to yell at me loudly and point to the syntactical error.

Autocorrect is a scourge of humanity.

i_love_retros14 hours ago

The problem is I then have to also figure out the code it wrote to be able to complete the final 20%. I have no momentum and am starting from almost scratch mentally.

narush17 hours ago

Hey HN, study author here. I'm a long-time HN user -- and I'll be in the comments today to answer questions/comments when possible!

If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.

[1] https://x.com/METR_Evals/status/1943360399220388093

causal17 hours ago

Hey I just wanted to say this is one of the better studies I've seen - not clickbaity, very forthright about what is being claimed, and presented in such an easy-to-digest format. Thanks so much for doing this.

narush16 hours ago

Thanks for the kind words!

isoprophlex15 hours ago

I'll just say that the methodology of the paper and the professionalism with which you are answering us here is top notch. Great work.

narush14 hours ago

Thank you!

igorkraw16 hours ago

Could you either release the dataset (raw but anonymized) for independent statistical évaluation or at least add the absolute times of each dev per task to the paper? I'm curious what the absolute times of each dev with/without AI was and whether the one guy with lots of Cursor experience was actually faster than the rest of just a slow typer getting a big boost out of llms

Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect

narush16 hours ago

Yep, sorry, meant to post this somewhere but forgot in final-paper-polishing-sprint yesterday!

We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).

Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.

igorkraw14 hours ago

Cool, thanks a lot. Btw, I have a very tiny tiny (50 to 100 audience ) podcast where we try to give context to what we call the "muck" of AI discourse (trying to ground claims into both what we would call objectively observable facts/évidence, and then _separately_ giving out own biased takes), if you would be interested to come on it and chat => contact email in my profile.

ryanar12 hours ago

podcast link?

jsnider317 hours ago

It's good to know that Claude 3.7 isn't enough to build Skynet!

yawnxyz7 hours ago

Does this reproduce for early/mid-career engineers who aren't at the top of their game?

narush6 hours ago

How these results transfer to other settings is an excellent question. Previous literature would suggest speedup -- but I'd be excited to run a very similar methodology in those settings. It's already challenging as models + tools have changed!

JackC14 hours ago

(I read the post but not paper.)

Did you measure subjective fatigue as one way to explain the misperception that AI was faster? As a developer-turned-manager I like AI because it's easier when my brain is tired.

narush14 hours ago

We attempted to! We explore this more in the section Trading speed for ease (C.2.5) in the paper (https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf).

TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.

antonvs16 hours ago

Was any attention paid to whether the tickets being implemented with AI assistance were an appropriate use case for AI?

If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.

If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.

narush16 hours ago

The instructions given to developers was not just "implement with AI" - but rather that they could use AI if they deemed it would be helpful, but indeed did _not need to use AI if they didn't think it would be helpful_. In about ~16% of labeled screen recordings where developers were allowed to use AI, they choose to use no AI at all!

That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

pera16 hours ago

Wow these are extremely interesting results, specially this part:

> This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

I wonder what could explain such large difference between estimation/experience vs reality, any ideas?

Maybe our brains are measuring mental effort and distorting our experience of time?

rsynnott6 minutes ago

> I wonder what could explain such large difference between estimation/experience vs reality, any ideas?

This bit I wasn't at all surprised by, because this is _very common_. People who are doing a [magic thing] which they believe in often claim that it is improving things even where it empirically isn't; very, very common with fad diets and exercise regimens, say. You really can't trust subjects' claims of efficacy of something that's being tested on them, or that they're testing on themselves.

And particularly for LLM tools, there is this strong sense amongst many fans that they are The Future, that anyone who doesn't get onboard is being Left Behind, and so forth. I'd assume a lot of users aren't thinking particularly rationally about them.

evanelias14 hours ago

Here's a scary thought, which I'm admittedly basing on absolutely nothing scientific:

What if agentic coding sessions are triggering a similar dopamine feedback loop as social media apps? Obviously not to the same degree as social media apps, I mean coding for work is still "work"... but there's maybe some similarity in getting iterative solutions from the agent, triggering something in your brain each time, yes?

If that was the case, wouldn't we expect developers to have an overly positive perception of AI because they're literally becoming addicted to it?

EarthLaunch14 hours ago

> The LLMentalist Effect: how chat-based Large Language Models replicate the mechanisms of a psychic’s con

https://softwarecrisis.dev/letters/llmentalist/

Plus there's a gambling mechanic: Push the button, sometimes get things for free.

lll-o-lll12 hours ago

This is very interesting and disturbing. We are outsourcing our decision making to an algorithmic “Mentalist” and will reap a terrible reward. I need to ween myself off the comforting teat of the chatbot psychic.

jwrallie11 hours ago

Like the feeling of the command line being always faster than using the GUI? Different ways we engage with a task can change our time perception.

I wish there was a simple way to measure energy spent instead of time. Maybe nature is just optimizing for something else.

csherratt14 hours ago

That's my suspicion to.

My issue with this being a 'negative' thing is that I'm not sure it is. It works off of the same hunting / foraging instincts that keep us alive. If you feel addiction to something positive, it is bad?

Social media is negative because it addicts you to mostly low quality filler content. Content that doesn't challenge you. You are reading shit posts instead of reading a book or doing something with better for you in the long run.

One could argue that's true for AI, but I'm not confident enough to make such a statement.

evanelias14 hours ago

The study found AI caused a "significant slowdown" in developer efficiency though, so that doesn't seem positive!

coffeefirst6 hours ago

This is fascinating and would go a long ways to explain why people seem to have totally different experiences with the same machines.

alfalfasprout15 hours ago

I would speculate that it's because there's been a huge concerted effort to make people want to believe that these tools are better than they are.

The "economic experts" and "ml experts" are in many cases effectively the same group-- companies pushing AI coding tools have a vested interest in people believing they're more useful than they are. Executives take this at face value and broadly promise major wins. Economic experts take this at face value and use this for their forecasts.

This propagates further, and now novices and casual individuals begin to believe in the hype. Eventually, as an experienced engineer it moves the "baseline" expectation much higher.

Unfortunately this is very difficult to capture empirically.

longwave15 hours ago

I also wonder how many of the numerous AI proponents in HN comments are subject to the same effect. Unless they are truly measuring their own performance, is AI really making them more productive?

malfist12 hours ago

How would you even measure your own performance? You can go and redo something, forgetting everything you did along the way the first time

jwrallie11 hours ago

You could go the same way as the study, flip a coin to use AI or not, write down the task you just did, the time you thought the task took you and the actual clock time. Repeat and self-evaluate.

malfist10 hours ago

Sample size of 16 is already hard enough to draw conclusions from. Sample size of 1 is even worse.

JelteF4 hours ago

It's the most representative sample size if you're interested in your own performance though. I really don't care if other people are more productive with AI, if I'm the outlier that's not then I'd want to know.

fiddlerwoaroof6 hours ago

I think just about every developer hack turns out this way: static vs dynamic types; keyboard shortcuts vs mice; etc. But I think it’s also possible to over-interpret these findings: using the tools that make your work enjoyable has important second-order effects even if they aren’t the productivity silver bullet everyone claims they are.

qingcharles5 hours ago

Part of it is that I feel I don't have to put as much mental energy into the coding part. I use my mental energy on the design and ideas, then kinda breeze through the coding now with AI at a much lower mental energy state than I would have when I was typing every single character of every line.

chamomeal14 hours ago

It’s funny cause I sometimes have the opposite experience. I tried to use Claude code today to make a demo app to show off a small library I’m working on. I needed it to set up some very boilerplatey example app stuff.

It was fun to watch, it’s super polished and sci-fi-esque. But after 15 minutes I felt braindead and was bored out of my mind lol

kokanee18 hours ago

> developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with. The other is that it's tempting to time an AI with metrics like how long until the PR was opened or merged. But the AI workflow fundamentally shifts engineering hours so that a greater percentage of time is spent on refactoring, testing, and resolving issues later in the process, including after the code was initially approved and merged. I can see how it's easy for a developer to report that AI completed a task quickly because the PR was opened quickly, discounting the amount of future work that the PR created.

narush16 hours ago

Qualitatively, we don't see a drop in PR quality in between AI-allowed and AI-disallowed conditions in the study; the devs who participate are generally excellent, know their repositories standards super well, and aren't really into the 'get up a bad PR' vibe -- the median review time on the PRs in the study is about a minute.

Developers totally spend time totally differently, though, this is a great callout! On page 10 of the paper [1], you can see a breakdown of how developers spend time when they have AI vs. not - in general, when these devs have AI, they spend a smaller % of time writing code, and a larger % of time working with AI (which... makes sense).

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

qsort18 hours ago

It's really hard to attribute productivity gains/losses to specific technologies or practices, I'm very wary of self-reported anecdotes in any direction precisely because it's so easy to fool ourselves.

I'm not making any claim in either direction, the authors themselves recognize the study's limitations, I'm just trying to say that everyone should have far greater error bars. This technology is the weirdest shit I've seen in my lifetime, making deductions about productivity from anecdotes and dubious benchmarks is basically reading tea leaves.

gitremote15 hours ago

> I feel like there are two challenges causing this. One is that it's difficult to get good data on how long the same person in the same context would have taken to do a task without AI vs with.

The standard experimental design that solves this is to randomly assign participants to the experiment group (with AI) and the control group (without AI), which is what they did. This isolates the variable (with or without AI), taking into account uncontrollable individual, context, and environmental differences. You don't need to know how the single individual and context would have behaved in the other group. With a large enough sample size and effect size, you can determine statistical significance, and that the with-or-without-AI variable was the only difference.

yorwba17 hours ago

Figure 21 shows that initial implementation time (which I take to be time to PR) increased as well, although post-review time increased even more (but doesn't seem to have a significant impact on the total).

But Figure 18 shows that time spent actively coding decreased (which might be where the feeling of a speed-up was coming from) and the gains were eaten up by time spent prompting, waiting for and then reviewing the AI output and generally being idle.

So maybe it's not a good idea to use LLMs for tasks that you could've done yourself in under 5 minutes.

groos16 hours ago

One thing I've experienced in trying to use LLMs to code in an existing large code base is that it's _extremely_ hard to accurately describe what you want to do. Oftentimes, you are working on a problem with a web of interactions all over the code and describing the problem to an LLM will take far longer than just doing it manually. This is not the case with generating new (boilerplate) code for projects, which is where users report the most favorable interaction with LLMs.

9dev16 hours ago

That’s my experience as well. It’s where Knuth comes in again: the program doesn’t just live in the code, but also in the minds of its creator. Unless I communicate all that context from the start, I can’t just dump years of concepts and strategy out of my brain into the LLM without missing details that would be relevant.

phyzome8 hours ago

Hell, a lot of times I can't even explain an idea to my coworkers in a conversation, and I eventually say "I'll just explain it in code instead of words." And I just quickly put up a small PR that makes the skeleton of the changes (or even the entire changeset) and then we continue our conversation (or just do the review).

geerlingguy17 hours ago

So far in my own hobby OSS projects, AI has only hampered things as code generation/scaffolding is probably the least of my concerns, whereas code review, community wrangling, etc. are more impactful. And AI tooling can only do so much.

But it's hampered me in the fact that others, uninvited, toss an AI code review tool at some of my open PRs, and that spits out a 2-page document with cute emoji and formatted bullet points going over all aspects of a 30 line PR.

Just adds to the noise, so now I spend time deleting or hiding those comments in PRs, which means I have even _less_ time for actual useful maintenance work. (Not that I have much already.)

NewsaHackO17 hours ago

So they paid developers 300 x 246 = about 73K just for developer recruitment for the study, which is not in any academic journal, or has no peer reviews? The underlying paper looks quite polished and not overtly AI generated so I don't want to say it entirely made up, but how were they even able to get funding for this?

narush17 hours ago

Our largest funding was through The Audacious Project -- you can see an announcement here: https://metr.org/blog/2024-10-09-new-support-through-the-aud...

Per our website, “To date, April 2025, we have not accepted compensation from AI companies for the evaluations we have conducted.” You can check out the footnote on this page: https://metr.org/donate

iLoveOncall17 hours ago

This is really disingenuous when you also say that OpenAI and Anthropic have provided you with access and compute credits (on https://metr.org/about).

Not all payment is cash. Compute credits is still by all means compensation.

golly_ned15 hours ago

Those are compute credits that are directly spent on the experiment itself. It's no more "compensation" than a chemistry researcher being "compensated" with test tubes.

iLoveOncall15 hours ago

> Those are compute credits that are directly spent on the experiment itself.

You're extrapolating, it's not saying this anywhere.

> It's no more "compensation" than a chemistry researcher being "compensated" with test tubes.

Yes, that's compensation too. Thanks for contributing another example. Here's another one: it's no more compensation than a software engineer being compensated with a new computer.

Actually the situation here is way worse than your example. Unless the chemistry researcher is commissioned by Big Test Tube Corp. to conduct research on the outcome of using their test tubes, there's no conflict of interest here. But there is an obvious conflict of interest on AI research being financed by credits given by AI companies to use their own AI tools.

dolebirchwood16 hours ago

Is it "really" disingenuous, or is it just a misinterpretation of what it means to be "compensated for"? Seems more like quibbling to me.

iLoveOncall15 hours ago

I was actually being kind by saying it's disingenuous. I think it's an outright lie.

gtsop17 hours ago

Are you willing to be compensated with compute credits for your job?

Such companies spit out "credits" all over the place in order to gain traction and enstablish themselves. I remember when cloud providers gave vps credits to startups like they were peanuts. To me, it really means absolutelly nothing.

bawolff16 hours ago

I wouldn't do my job for $10, but if somehow someone did pay me $10 to do something, i wouldn't claim i wasn't compensated.

In-kind compensation is still compensation.

iLoveOncall16 hours ago

> Are you willing to be compensated with compute credits for your job?

Well, yes? I use compute for some personal projects so I would be absolutely fine if a part of my compensation was in compute credits.

As a company, even more so.

bee_rider17 hours ago

Companies produce whitepapers all the time, right? They are typically some combination of technical report, policy suggestion, and advertisement for the organization.

resource_waste17 hours ago

>which is not in any academic journal, or has no peer reviews?

As a philosopher who is into epistemology and ontology, I find this to be as abhorrent as religion.

'Science' doesn't matter who publishes it. Science needs to be replicated.

The psychology replication crisis is a prime example of why peer reviews and publishing in a journal matters 0.

bee_rider15 hours ago

> The psychology replication crisis is a prime example of why peer reviews and publishing in a journal matters 0.

Specifically, it works as an example of a specific case where peer review doesn’t help as much. Peer review checks your arguments, not your data collection process (which the reviewer can’t audit for obvious reasons). It works fine in other scenarios.

Peer review is unrelated to replication problems, except to the extent to which confused people expect peer review to fix totally unrelated replication problems.

raincole13 hours ago

Peer reviews are very important to filter out obviously low effort stuff.

...Or should I say "were" very important? With the help of today's GenAI every low effort stuff can look high effort without much extra effort.

fabianhjr17 hours ago

Most of the world provides funding for research, the US used to provide funding but now that has been mostly gutted.

iLoveOncall17 hours ago

https://metr.org/about Seems like they get paid by AI companies, and they also get government funding.

[deleted]17 hours agocollapsed

keerthiko15 hours ago

IME AI coding is excellent for one-off scripts, personal automation tooling (I iterate on a tool to scrape receipts and submit expenses for my specific needs) and generally stuff that can be run in environments where the creator and the end user are effectively the same (and only) entity.

Scaled up slightly, we use it to build plenty of internal tooling in our video content production pipeline (syncing between encoding tools and a status dashboard for our non-technical content team).

Using it for anything more than boilerplate code, well-defined but tedious refactors, or quickly demonstrating how to use an unfamiliar API in production code, before a human takes a full pass at everything is something I'm going to be wary of for a long time.

fritzo18 hours ago

As an open source maintainer on the brink of tech debt bankruptcy, I feel like AI is a savior, helping me keep up with rapid changes to dependencies, build systems, release methodology, and idioms.

candiddevmike15 hours ago

If you stewarded that much tech debt in the first place, how can you be sure LLM will help prevent it going forward? In my experience, LLMs add more tech debt due to lacking cohesion with it's edits.

aerhardt17 hours ago

But what about producing actual code?

fritzo17 hours ago

Producing code is overrated. There's lots of old code whose lifetime we can extend.

fhd215 hours ago

Very, very much this.

resource_waste17 hours ago

I find it useful for simple algorithms and error solving.

reaperducer17 hours ago

[flagged]

tcdent17 hours ago

This study neglects to incorporate the fact that I have forgotten how to write code.

narush16 hours ago

Honestly, this is a fair point -- and speaks the difficulty of figuring out the right baseline to measure against here!

If we studied folks with _no_ AI experience, then we might underestimate speedup, as these folks are learning tools (see a discussion of learning effects in section (C.2.7) - Below-average use of AI tools - in the paper). If we studied folks with _only_ AI experience, then we might overestimate speedup, as perhaps these folks can't really program without AI at all.

In some sense, these are just two separate and interesting questions - I'm excited for future work to really dig in on both!

resource_waste17 hours ago

I'm curious what space people are working in where AI does their job entirely.

I can use it for parts of code, algorithms, error solving, and maybe sometimes a 'first draft'.

But there is no way I could finish an entire piece of software with AI only.

asdff16 hours ago

Not a lot of people are empowered to create an entire piece of software. Most are probably in the trenches squashing tickets.

tcdent15 hours ago

I do create entire pieces of software, and while my workflow is always evolving, it goes something like this:

Define schemas, interfaces, and perhaps some base classes that define the attributes I'm thinking about.

Research libraries that support my cause, and include them.

Reference patterns I have established in other parts of the codebase; internal tooling for database, HTTP services, etc.

Instruct the agent to come up with a plan for a first pass at execution in markdown format. Iterate on this plan; "what about X?"

Splat a bunch of code down that supports the structure I'm looking for. Iterate. Cleanup. Iterate. Implement unit tests and get them to pass.

Go back through everything manually and adjust it to suit my personal style, while at the same time fully understanding what's being done and why.

I use STT a lot to have conversations with the agent as we go, and very rarely allow it to make sequential edits without reviewing first; this is a great opportunity to go back and forth and refine what's being written.

asdff15 hours ago

You are going well above and beyond what a lot of people do to be fair. There are people in senior roles who are just futzing with json files.

joks12 hours ago

I think the question still stands.

codyb12 hours ago

So slow until a learning curve is hit (or as one user posited "until you forget how to work without it").

But isn't the important thing to measure... how long does it take to debug the resulting code at 3AM when you get a PagerDuty alert?

Similarly... how about the quality of this code over time? It's taken a lot of effort to bring some of the code bases I work in into a more portable, less coupled, more concise state through the hard work of

- bringing shared business logic up into shared folders

- working to ensure call chains flow top down towards root then back up through exposed APIs from other modules as opposed to criss-crossing through the directory structure

- working to separate business logic from API logic from display logic

- working to provide encapsulation through the use of wrapper functions creating portability

- using techniques like dependency injection to decouple concepts allowing for easier testing

etc

So, do we end up with better code quality that ends up being more maintainable, extensible, portable, and composable? Or do we just end up with lots of poor quality code that eventually grows to become a tangled mess we spend 50% of our time fighting bugs on?

30minAdayHN17 hours ago

This study focused on experienced OSS maintainers. Here is my personal experience, but a very different persona (or opposite to the one in the study). I always wanted to contribute to OSS but never had time to. Finally was able to do that, thanks to AI. Last month, I was able to contribute to 4 different repositories which I would never have dreamed of doing it. I was using an async coding agent I built[1], to generate PRs given a GitHub issue. Some PRs took a lot of back and forth. And some PRs were accepted as is. Without AI, there is no way I would have contributed to those repositories.

One thing that did work in my favor is that, I was clearly creating a failing repro test case, and adding before and after along with PR. That helped getting the PR landed.

There are also a few PRs that never got accepted because the repro is not as strong or clear.

[1] https://workback.ai

ares6233 hours ago

Did you make the contributions though? Or did the LLM?

This is not directed at you, but I am worried that contributors that use AI "exclusively" to contribute to OSS projects are extracting the value (street cred, being seen as part of the project community) without actually contributing anything (by being one more person that knows the codebase and can help steward it).

It's the same thing we've seen out of enshittification of everything. Value extraction without giving back.

Maybe I'm too much of a cynic. Maybe majority of OSS projects don't care. But I know I will be saddened if one of the OSS projects I care about get taken over by such "value extractors".

30minAdayHN2 hours ago

Did not take it personal. You brought up a good point.

I've slightly alternate perspective. Imo, using OSS without contributing is the value extraction without giving back.

If someone can fix a bunch of chores (that still take human time), with the use of AI (even though they don't become stewards), I still see it as giving back. Of course, there is a value chain - contributing with AI without understanding code is the bottom of value creation. Like you mentioned, also being a steward is the top of the value chain. Along the way, if the contributor builds some sorta reputation that would help with their career or other outcomes, so be it.

So in that sense, I don't see it as enshittification. AI might make a pathway to resolve a bunch of things which otherwise wouldn't be resolved. In fact, this was the line of thinking for the tool we built. Instead of people making these mindless PRs, can we build an agent that can take care of 'trivial' tasks. I manually created PRs to test that hypothesis.

There is also a natural self selection here. If someone was able to fix something without understanding any code, that is also indicative of how trivial the task is. There is a reverse effect to my argument though. These "AI contributors" can create a ton of PRs that would create a lot of work for maintainers to review them.

In my case, I was being upfront about how I'm raising PRs and requesting permissions if it is OK to work on certain issue. Maintainers are quite open and inviting.

ares623an hour ago

Thanks, I like your perspective. Hard to be optimistic these days.

Amaury-El9 hours ago

I've been using LLMs almost every day for the past year. They're definitely helpful for small tasks, but in real, complex projects, reviewing and fixing their output can sometimes take more time than just writing the code myself.

We probably need a bit less wishful thinking. Blindly trusting what the AI suggests tends to backfire. The real challenge is figuring out where it actually helps, and where it quietly gets in the way.

doctoboggan15 hours ago

For me, the measurable gain in productiviy comes when I am working with a new language or new technology. If I were to use claude code to help implement a feature of a python library I've worked on for years then I don't think it would help much (Maybe even hurt). However, if I use claude code on some go code I have very little experience with, or using it to write/modify helm charts then I can definitely say it speeds me up.

But, taking a broader view its possible that these initial speed ups are negated by the fact that I never really learn go or helm charts as deeply now that I use claude code. Over time, its possible that my net productiviy is still reduced. Hard to say for sure, especially considering I might not have even considered talking these more difficult go library modifications if I didn't have claude code to hold my hand.

Regardless, these tools are out there, increasing in effectiveness and I do feel like I need to jump on the train before it leaves me at the station.

mattxxx8 hours ago

agreed - it helps transpose skills.

That said, this comes up often in my office. It's just not giving really good advice in many situations - especially novel ones.

AI is super good at coming up with things that have been written ad nauseam for coding-interview-prep website

thepasswordis14 hours ago

I actually think that pasting questions into chatGPT etc. and then getting general answers to put into your code is the way.

“One shotting” apps, or even cursor and so forth seem like a waste of time. It feels like if you prompt it just right it might help but then it never really does.

partdavid13 hours ago

I've done okay with copilot as a very smart autocomplete on: a) very typical codebase, with b) lots of boilerplate, where c) I'm not terribly familiar with the languages and frameworks, which are d) very, very popular but e) I don't really like, so I'm not particularly motivated to become familiar with them. I'm not a frontend developer, I don't like it, but I'm in a position now where I need to do frontend things with a verbose Typescript/React application which is not interesting from a technical point of view (good product, it's just not good because it has an interesting or demanding front end). Copilot (I use Emacs, so cursor is a non-starter, but copilot-mode works very well for Typescript) has been pretty invaluable to just sort of slogging through stuff.

For everything else, I think you're right, and actually the dialog-oriented method is way better. If I learn an approach and apply some general example from ChatGPT, but I do the typing and implementation myself so I need to understand what I'm doing, I'm actually leveling up and I know what I'm finished with. If I weren't "experienced", I'd worry about what it was doing to my critical thinking skills, but I know enough about learning on my own at this point to know I'm doing something.

I'm not interested in vibe coding at all--it seems like a one-way process to automate what was already not the hard part of software engineering; generating tutorial-level initial implementations. Just more scaffolding that eventually needs to be cleared away.

bit199313 hours ago

It used to be that all you required to program was a computer and to RTFM but now we need to pay for API "tokens" and pray that there are no rug pull in the future.

danparsonson11 hours ago

"It used to be that all you required to write was a pen and paper but now we need to pay for 'electricity'..."

You can still do those things.

Jabrov18 hours ago

Very interesting methodology, but the sample size (16) is way too low. Would love to see this repeated with more participants.

narush17 hours ago

Noting that most of our power comes from the number of tasks that developers complete; it's 246 total completed issues in the course of this study -- developers do about 15 issues (7.5 with AI and 7.5 without AI) on average.

biophysboy15 hours ago

Did you compare the variance within individuals (due to treatment) to the variance between individuals (due to other stuff)?

[deleted]16 hours agocollapsed

IshKebab18 hours ago

They paid the developers about $75k in total to do this so I wouldn't hold your breath!

barbazoo18 hours ago

That's a lot of money for many of us. Do you know those folks were in a HCOL area?

mapt17 hours ago

It isn't a lot of money for industry research. Changes of +-40% in productivity are an enormous advantage/disadvantage for a large tech company moving tens of billions of dollars a year in cashflow through a pipeline that their software engineers built.

IshKebab17 hours ago

No idea. They don't say who they were; just random popular GitHub projects.

To be clear it wasn't $75k each.

narush15 hours ago

You can see a list of repositories with participating developers in the appendix! Section G.7.

Paper is here: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

lawlessone17 hours ago

Neat, how to sign up??

asdff16 hours ago

I see these things posted on linkedin. Usually asking $40/hr though. But essentially the same thing as the OP outlines: you do some domain related task assigned either with or without an AI tool. Check linked in. They will have really vague titles like "data scientist" though even though that's not what is being described, its study subject. Maybe set 40/hr as a filter on linkedin and see if you can get a few to come up.

IshKebab17 hours ago

Go back in time, create a popular github repo with lots of stars, be lucky.

semireg8 hours ago

Something I don’t see mentioned that’s been helpful to me is having an agent add strict type safety to my typescript. I avoid the use of “any” type and berating an agent to “make it work” really opens my eyes and forces me to learn how advanced typescript can be leveraged. I feel that creating complex types that make my code simpler, makes autocomplete work(!), is a great tradeoff in some meta dimension of software dev.

MYEUHD17 hours ago

This does not mention the open-source developer time wasted while reviewing vibe coded PRs

narush16 hours ago

Yeah, I'll note that this study does _not_ capture the entire OS dev workflow -- you're totally right that reviewing PRs is a big portion of the time that many maintainers spend on their projects (and thanks to them for doing this [often hard] work). In the paper [1], we explore this factor in more detail -- see section (C.2.2) - Unrepresentative task distribution.

There's some existing lit about increased contributions to OS repositories after the introduction of AI -- I've also personally heard a fear anecdotes about an increase in the number of low-quality PRs from first time contributors, seemingly as a result of AI making it easier to get started -- ofc, the tradeoff is that making it easier to get started has pros to it too!

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

lmeyerov14 hours ago

As someone has been doing hardcore genai for 2+ years, my experience has been, and what we advise internally:

* 3 weeks to transition from ai pairing to AI Delegation to ai multitasking. So work gains are mostly week 3+. That's 120+ hours in, as someone pretty senior here.

* Speedup is the wrong metric. Think throughput, not latency. Some finite amount of work might take longer, but the volume of work should go up because AI can do more on a task and diff tasks/projects in parallel.

Both perspectives seem consistent with the paper description...

phyzome8 hours ago

Have you actually measured this?

Because one of the big takeaways from this study is that people are bad at predicting and observing their own time spent.

lmeyerov7 hours ago

yes, I keep prompt plan logs

At the same time... that's not why I'm comfortable writing this. It's pretty obvious when you know what good vs bad feels like here and adjust accordingly:

1. Good: You are able to generate a long plan and that plan mostly works. These are big wins _as long as you are multitasking_: you are high throughput, even if the AI is slow. Think running 5-20min at a time for pretty good progress, for just a few minutes of your planning that you'd largely have to do anyways.

2. Bad: You are wasting a lot of attention chatting (so 1-2min runs) and repairing (re-planning from the top, vs progressing). There is no multitasking win.

It's pretty clear what situation you're in, with run duration on its own being a ~10X level difference.

Ex: I'll have ~3 projects going at the same time, and/or whatever else I'm doing. I'm not interacting "much" so I know it's a win. If a project is requiring interaction, well, now I need to jump in, and it's no longer agentic coding IMO, but chat assistant stuff.

At the same time, I power through case #2 in practice because we're investing in AI automation. We're retooling everything to enable long runs, so we'll still do the "hard" tasks via AI to identify & smooth the bumps. Similar to infrastructure-as-code and SDLC tooling, we're investing in automating as much of our stack as we can, so that means we figure out prompt templates, CI tooling, etc to enable the AI to do these so we can benefit later.

mrwaffle15 hours ago

My overall concern has to do with our developer ecosystem from the important points mentioned by simonw and narush. I've been concerned about this for years but AI reliance seems to be pouring jet fuel on the fire. Particularly troubling is the lack of understanding less-experienced devs will have over time. Does anyone have a counter-argument for this they can share on why this is a good thing?

partdavid13 hours ago

The shallow analogy is like "why worry about not being able to do arithmetic without a calculator"? Like... the dev of the future just won't need it.

I feel like programming has become increasingly specialized and even before AI tool explosion, it's way more possible to be ignorant of an enormous amount of "computing" than it used to be. I feel like a lot of "full stack" developers only understand things to the margin of their frameworks but above and below it they kind of barely know how a computer works or what different wire protocols actually are or what an OS might actually do at a lower level. Let alone the context in which in application sits beyond let's say, a level above a kubernetes pod and a kind of trial-end-error approach to poking at some YAML templates.

Do we all need to know about processor architectures and microcode and L2 caches and paging and OS distributions and system software and installers and openssl engines and how to make sure you have the one that uses native instructions and TCP packets and envoy and controllers and raft systems and topic partitions and cloud IAM and CDN and DNS? Since that's not the case--nearly everyone has vast areas of ignorance yet still does a bunch of stuff--it's harder to sell the idea that whatever AI tools are doing that we lose skills in will somehow vaguely matter in the future.

I kind of miss when you had to know a little of everything and it also seemed like "a little bit" was a bigger slice of what there was to know. Now you talk to people who use a different framework in your own language and you feel like you're talking to deep specialists whose concerns you can barely understand the existence of, let alone have an opinion on.

ChrisMarshallNY16 hours ago

It's been very helpful for me. I find ChatGPT the easiest to use; not because it's more accurate (it isn't), but because it seems to understand the intent of my questions most clearly. I don't usually have to iterate much.

I use it like a know-it-all personal assistant that I can ask any question to; even [especially] the embarrassing, "stupid" ones.

> The only stupid question is the one we don't ask.

- On an old art teacher's wall

solid_fuel11 hours ago

I would love to see a comparison of the pull requests generated by each workflow, if possible. My experience with Copilot has generally been that it suggests far more code than I would actually write to solve a specific problem - sometimes adding extra checks where they aren't needed, sometimes just being more verbose than I would be, and oftentimes repeating itself where it would be better to use an abstraction.

My personal hypothesis is that seeing the LLM write _so much_ code may create the feeling that the problems it is solving would take longer to solve by yourself.

narush5 hours ago

Check out section AI increasing issue scope (C.2.3) in the paper -- https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

We speak (the best we can) to changes in amount of code -- I'll note that this metric is quite messy and hard to reason about!

asciimov15 hours ago

I'll be curious of the long term impacts of AI.

Such as: do you end up spending more time to find and fix issues, does AI use reduce institutional knowledge, will you be more inclined to start projects over from scratch.

heisenbit17 hours ago

AI sometimes points out hygiene issues that may be swept under the carpet but once pointed out can't be ignored anymore. I know I don't need that error handling, I'm certain for the near future but maybe it is needed... Also the code produced by the AI has some impedance match with my natural code. Then one needs to figure out whether that is due to moving best practices, until now ignored best practices or the AI being overwhelmed with code from beginners. This all takes time - some of it is transient, some of it is actually improving things and some of it is waste. The jury is still out there.

gmaster144015 hours ago

What if the slowdown isn't a bug but a feature? What if AI tools are forcing developers to think more carefully about their code, making them slower but potentially producing better results? AFAIK the study measured speed, not quality, maintainability, or correctness.

The developers might feel more productive because they're engaging with their code at a higher level of abstraction, even if it takes longer. This would be consistent with why they maintained positive perceptions despite the slowdown.

PessimalDecimal14 hours ago

In my experience, LLMs are not causing people to think more carefully about their code.

[deleted]17 hours agocollapsed

0xmusubi16 hours ago

I find myself having discussions with AI about different design possibilities and it sometimes comes up with ideas I hadn't thought of or features I wasn't aware of. I wouldn't classify this as "overuse" as I often find the discussions useful, even if it's just to get my thoughts down. This might be more relevant for larger scoped tasks or ones where the programmer isn't as familiar with certain features or libraries though.

atleastoptimal16 hours ago

I’m not surprised that AI doesn’t help people with 5+ years experience in open source contribution, but I’d imagine most people aren’t claiming AI tools are at senior engineer level yet.

Soon once the tools and how people use them improve AI won’t be a hinderance for advanced tasks like this, and soon after AI will be able to do these prs on their own. It’s inevitable given the rate of improvement even since this study.

artee_4916 hours ago

Even for senior levels the claim has been that AI will speed up their coding (take it over) so they can focus on higher level decisions and abstract level concepts. These contributions are not those and based on prior predictions the productivity should have gone up.

atleastoptimal11 hours ago

It would be different I'm sure if they were making contributions to repos they had less familiarity with. In my experience and talking with those who use AI most effectively, it is best leveraged as a way of getting up to speed or creating code for a framework/project you have less familiarity with. The general ratio determining the effectiveness of non-AI coding vs AI coding is the familiarity the user has with the codebase * the complexity of the codebase : the amount of closed-loop abstractions in the tasks the coder needs to carry out.

Currently AI is like a junior engineer, and if you don't have good experience managing junior engineers, AI isn't going to help you as much.

dash218 hours ago

The authors say "High developer familiarity with repositories" is a likely reason for the surprising negative result, so I wonder if this generalizes beyond that.

kennywinker18 hours ago

Like if it generalizes to situations where the developer is not familiar with the repo? That doesn’t seem like generalizing, that seems like specifying. Am I wrong in saying that the majority of developer time is spent in repos that they’re familiar with? Every job and project I’ve worked has been on a fixed set of repos the entire time. If AI is only helpful for the first week or two on a project, that’s not very many cases it’s useful for.

jbeninger13 hours ago

I'd say I write the majority of my code in areas I'm familiar with, but spend the majority of my _time_ on sections I'm not familiar with, and ai helps a lot more with the latter than the former. I've always felt my coding life is speeding through a hundred lines of easy code then getting stuck on the 101st. Then as I get more experienced that hundred becomes 150, then 200, but always speeding through the easy part until I have to learn something new.

So I never feel like I'm getting any faster. 90% of my time is still spent in frustration, even when I'm producing twice the code at higher quality

add-sub-mul-div17 hours ago

Without the familiarity would the work be getting done effectively? What does it mean for someone to commit AI code that they can't fully understand?

ieie336614 hours ago

LLMs are godtier if you know what you’re doing, and prompt them with ”do X”, where x is a SELF-CONTAINED change you would manually know how to implement

For example, today I asked claude to implement per-user rate-limiting into my nestjs service, then iterated by asking implementing specific unit tests and some refactoring. It one-shot everything. I would say 90% time savings.

Unskilled people ask them ”i have giant problem X solve it” and end up with slop

lloeki33 minutes ago

I tried exactly that, several times, over and over.

Except on "hello world" situations (which I guess is a solid part of the corpus LLMs are trained with) these tools were consistently slower.

Last time was an area where several files were subtly different in a section that essentially does about the same thing, and needed to be aligned and made consistent†.

Time to - begrudgingly - do it manually: 5min

Time to come up with a one-shot shell incantation: 10min

Time to very dumbly manually mark the areas with ===BEGIN=== and ===END=== and come up with a one-shot shell incantation: 3min

Time to do it for the LLM: 45min††; also it required regular petting every 20ish command so zero chance of letting it run and doing something else†††.

Time to review + manually fix the LLM output which missed two sections, left obsolete comments, and modified four files that were entirely unrelated yet clearly declared as out of scope in the prompt: 5min

Consistently, proponents have been telling me "yeah you need to practice more, I'm getting fine results so you're holding it wrong, we can do a session together and I'll show you how to do it", which they do, and then it doesn't work, and they're like "well I'll look into it and circle back" and I never hear from them again.

As for suggestions, for every good completion where I accept saying "oh well, why not", 99 get rejected: the majority are complete hallucinations absolutely unrelated to the surrounding logic, a third are either broken or introduce non-working code, and 1-5 _actively dangerous_ in some way.

The only places where I found LLMs vaguely useful are:

- Asking questions about an unknown codebase. It still hallucinates and misdirects or is excessively repetitive about some things (even with rules) but it can crudely draw a rough "map" and make non-obvious connections about two distant areas, which can be welcome.

- Asking for a quick code review in addition to the one I ask to humans; 70% of such output is laughably useless (although harmless beyond the noise + energy cost), 30% is duplicate of human reviews but I can get it earlier, and sometimes it unearths a good point that has been overlooked.

† No, the specific section cannot+should not be factored out

†† And that's because I interrupted it because it was going about modifying files that it should not have.

††† A bit of a lie because I did the other three ways during that time. Which also is telling because the time to do the other ways would actually be _lower_ because I was interrupted by / had to keep tabs on what the AI agent was doing.

cadamsdotcom13 hours ago

My hot take: Cursor is a bad tool for agentic coding. Had a subscription and canceled it in favor of Claude Code. I don’t want to spend 90% of my time approving every line the agent wants to write. With Claude Code I review whole diffs - 1-2 minutes of the agent’s work at a time. Then I work with the agent at a level of what its approach is, almost never asking about specific lines of code. I can look at 5 files at once in git diff and then ask “why’d you choose that way?” “Can we undo that and try to find a simpler way?”

Cursor’s workflow exposes how differently different people track context. The best ways to work with Cursor may simply not work for some of us.

If Cursor isn’t working for you, I strongly encourage you to try CLI agents like Claude Code.

isoprophlex15 hours ago

Ed Zitron was 100% right. The mask is off and the AI subprime crisis is coming. Reading TFA, it would be hilarious if the bubble burst AND it turns out there's actually no value to be had, at ANY price. I for one can't wait for this era of hype to end. We'll see.

you're addicted to the FEELING of productivity more than actual productivity. even knowing this, even seeing the data, even acknowledging the complete fuckery of it all, you're still gonna use me. i'm still gonna exist. you're all still gonna pretend this helps because the alternative is admitting you spent billions of dollars on spicy autocomplete.

nestorD17 hours ago

One thing I could not find on a cursory read is how used were those developers to AI tools. I would expect someone using those regularly to benefit while someone who only played with them a couple of time would likely be slowed down as they deal with the friction of learning to be productive with the tool.

uludag17 hours ago

In this case though you still wouldn't necessarily know if the AI tools had a positive causal effect. For example, I practically live in Emacs. Take that away and no doubt I would be immensely less effective. That Emacs improves my productivity and without it I am much worse in no way implies that Emacs is better than the alternatives.

I feel like a proper study for this would involve following multiple developers over time, tracking how their contribution patterns and social standing changes. For example, take three cohorts of relatively new developers: instruct one to go all in on agentic development, one to freely use AI tools, and one prohibited from AI tools. Then teach these developers open source (like a course off of this book: https://pragprog.com/titles/a-vbopens/forge-your-future-with...) and have them work for a year to become part of a project of their choosing. Then in the end, track a number of metrics such as leadership position in community, coding/non-coding contributions, emotional connection to project, social connections made with community, knowledge of code base, etc.

Personally, my prior probability is that the no-ai group would likely still be ahead overall.

Leo-thorne8 hours ago

I really admire stories like this. Reaching $1M ARR without any funding is rare and feels real. It shows what building something truly takes. Late nights, tough moments, losing users. It's not about big bursts of growth but staying consistent, solving real problems, and growing revenue little by little. There's a lot to learn from that.

semireg8 hours ago

Wrong thread

DrNosferatu8 hours ago

I would say AI has very different impacts on individual coding styles.

[deleted]17 hours agocollapsed

AIorNot14 hours ago

Hey guys why are we making it so complicated? do we really need a paper and study?

anyway -AI as the tech currently stand is a new skill to use and takes us humans time to learn, but once we do well, its becomes force multiplier

ie see this: https://claude.ai/public/artifacts/221821f0-0677-409b-8294-3...

lompadan hour ago

Because for now, that's just what those financially profiting from the AI-hype tell us. Be it sama, hyung or nadella, they all profit if people _believe_ AI is a force multiplier for everybody. Reality is much more muddy though and it's absolutely not as obvious as those people claim.

And keep in mind that a 5-10x price hike is to be expected if those companies keep spending billions to make millions.

Right now, there is a consistent stream of papers incoming which indicates that AI might be much more of a specialized tool for very particular situations instead of the "solve everything"-tool the hype makes people believe. This is highly significant.

"Just believe me bro" is just not enough.

_jayhack_15 hours ago

This does not take into account the fact that experienced developers working with AI have shifted into roles of management and triage, working on several tasks simultaneously.

Would be interesting (and in fact necessary to derive conclusions from this study) to see aggregate number of tasks completed per developer with AI augmentation. That is, if time per task has gone up by 20% but we clear 2x as many tasks, that is a pretty important caveat to the results published here

budududuroiu10 hours ago

My theory is that, outside of programming skill, amazement by AI tools is inversely proportional to typing/navigating speed.

I already know what I need to write, I just need to get it into the editor. I wouldn’t trade the precision I have with vim macros flying across multiple files for an AI workflow.

I do think AI is a good rubber ducky sometimes tho, but I despise letting it take over editing files.

journal8 hours ago

would you be worse without it? now prepare to pay $1000+/month for chatgpt in a few years when dust settles.

zzzeek17 hours ago

As a project for work, I've been using Claude CLI all week to do as many tasks as possible. So with my week's experience, I'm now an expert in this subject and can weigh in.

Two things that stand out to me are 1. it depends a lot on what kind of task you are having the LLM do. and 2. if the LLM process takes more time, it is very likely your cognitive effort was still way less - for sysadmin kinds of tasks, working with less often accessed systems, LLMs can read --help, man pages, doc sites, all for you, and give you the working command right there (And then run it, and then look at the output and tell you why it failed, or how it worked, and what it did). There is absolutely no question that second part is a big deal. Sticking it onto my large open source project to fix a deep, esoteric issue or write some subtle documentation where it doesnt really "get" what I'm doing, yeah it is not as productive in that realm and you might want to skip it for the thinking part there. I think everyone is trying to figure out this question of "when and how" for LLMs. I think the sweet spot is for tasks involving systems and technologies where you'd otherwise be spending a lot of time googling, stackoverflowing, reading man pages to get just the right parameters into commands and so forth. This is cognitive grunt work and the LLMs can do that part very well.

My week of effort with it was not really "coding on my open source project"; two examples were, 1. running a bunch of ansible playbooks that I wrote years ago on a new host, where OS upgrades had lots of snags; I worked with Claude to debug all the various error messages and places where the newer OS distribution had different packages, missing packages, etc. it was ENORMOUSLY helpful since I never look at these playbooks and I dont even remember what I did, Claude can read it for you and interpret it as well as you can. 2. I got a bugzilla for a fedora package that I packaged years ago, where they have some change to the directives used in specfiles that everyone has to make. I look at fedora packaging workflows once every three years. I told Claude to read the BZ and just do it. IT DID IT. I had to get involved running the "mock" suite as it needed sudo but Claude gave me the commands. zero googling. zero even reading the new format of the specfile (the bz linked to a tool that does the conversion). From bug received to bug closed and I didnt do any typing at all outside of the prompt. Had it done before breakfast since I didnt even need any glucose for mental energy expended. This would have been a painful and frustrating mental effort otherwise.

so the studies have to get more nuanced and survey a lot more than 16 devs I think

8note4 hours ago

> esoteric issue or write some subtle documentation where it doesnt really "get" what I'm doing, yeah it is not as productive in that realm and you might want to skip it for the thinking part there

I've been sensing in these cases that i don't have a good enough way to express these problems, and that i actually need to figure that out, or the rest of my team, whether they're using AI or not, are gonna have a real hard time understanding the change i made.

LegNeato15 hours ago

For certain tasks it can speed me up 30x compared to an expert in the space: https://rust-gpu.github.io/blog/2025/06/24/vulkan-shader-por...

lpghatguy15 hours ago

This is very disingenuous: we don't know how much spare time Sascha spent, and much of that time was likely spent learning, experimenting, and reporting issues to Slang.

OpenSourceWard15 hours ago

Very cool work! And I love the nuance in your methodology and findings. Anyway, I'm preparing myself for all the "Bombshell news: AI is slowing down developers" posts that are coming.

mwigdahl11 hours ago

Plus the gaslighting to follow for anyone claiming AI improved their productivity.

phyzome8 hours ago

Well, it would be a nice counterweight to all the gaslighting of people who claim AI doesn't improve their productivity...

castratikron17 hours ago

I really like those graphics, does anyone know the tool was used to create them?

narush16 hours ago

The graphs are all matplotlib. The methodology figure is built in Figma! (Source: I'm a paper author :)).

AvAn1216 hours ago

N = 16 developers. Is this enough to draw any meaningful conclusions?

sarchertech16 hours ago

That depends on the size of the effect you’re trying to measure. If cursor provides a 5x, 10x, or 100x productivity boost as many people are claiming, you’d expect to see that in a sample size of 16 unless there’s something seriously wrong with your sample selection.

If you are looking for a 0.1% increase in productivity, then 16 is too small.

biophysboy15 hours ago

Well it depends on the variance of the random variable itself. You're right that with big, obvious effects, a larger n is less "necessary". I could see individuals having very different "productivities", especially when the idea is flattened down to completion time.

AvAn1214 hours ago

“A quarter of the participants saw increased performance, 3/4 saw reduced performance.” So I think any conclusions drawn on these 16 people doesn’t signify much one way or the other. Cool paper but how is this anything other than a null finding?

tripletao4 hours ago

They show a 95% CI excluding zero in Figure 1. By the usual standards of social science, that's not a null finding. They give their methodology in Appendix D.

For intuition on why it's insufficient to consider N alone, I assume e.g. that you'd greatly increase your belief that a coin was unfair long before 16 consecutive heads--as already noted, the size of the effect also matters. That relationship isn't intuitive in general, and attempts to replace the math with feelings tend to fail.

phyzome8 hours ago

Or N = 246 tasks.

[deleted]14 hours agocollapsed

swayvil17 hours ago

AI by design can only repeat and recombine past material. Therefore actual invention is out.

8note4 hours ago

its not a huge deal. i dont need the AI to invent a replacement to the for loop or map function; i only want it to use the those tools.

I'm the one providing the invention, it's transforming my invention into an implementation; sometimes better than others.

keeda13 hours ago

Pretty much all invention is novel combination of known techniques. Anything that introduces a fundamental new technique is usually in the realm of groundbreaking papers and Nobel prizes.

luibelgo15 hours ago

Is that actually proven?

greenchair15 hours ago

The easiest way to see this for yourself is with an image generator. Try asking for a very specific combination of things that would not normally appear together in an artpiece.

atleastoptimal16 hours ago

HN moment

elpakal17 hours ago

underrated comment

thesz14 hours ago

What is interesting here is that all predictions were positive, but results are negative.

This shows that everyone in the study (economic experts, ML experts and even developers themselves, even after getting experience) are novices if we look at them from the Dunning-Kruger effect [1] perspective.

[1] https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

"The Dunning–Kruger effect is a cognitive bias in which people with limited competence in a particular domain overestimate their abilities."

mattl14 hours ago

I don't understand how anyone doing open source can use something trained on other people's code as a tool for contributions.

I wouldn't accept someone's copy and pasted code from another project if it were under an incompatible license, let alone something with unknown origin.

afro8816 hours ago

Early 2025. I imagine the results would be quite different with mid 2025 models and tools.

phyzome8 hours ago

If they used mid 2025 models and tools, the paper would have come out in late 2025, and you would have had the same complaint.

achenet17 hours ago

I find agents useful for showing me how to do something I don't already know how to do, but I could see how for tasks I'm an expert on, I'd be faster without an extra thing to have to worry about (the AI).

IshKebab17 hours ago

I wonder if the discrepancy is that it felt like it was taking less time because they were having to do less thinking which feels like it is easier and hence faster.

Even so... I still would be really surprised if there wasn't some systematic error here skewing the results, like the developers deliberately picked "easy" tasks that they already knew how to do, so implementing them themselves was particularly fast.

Seems like they authors had about as good methodology as you can get for something like this. It's just really hard to test stuff like this. I've seen studies proving that code comments don't matter for example... are you going to stop writing comments? No.

narush16 hours ago

> which feels like it is easier and hence faster.

We explore this factor in section (C.2.5) - "Trading speed for ease" - in the paper [1]. It's labeled as a factor with an unclear effect, some developers seem to think so, and others don't!

> like the developers deliberately picked "easy" tasks that they already knew how to do

We explore this factor in (C.2.2) - "Unrepresentative task distribution." I think the effect here is unclear; these are certainly real tasks, but they are sampled from the smaller end of tasks developers would work on. I think the relative effect on AI vs. human performance is not super clear...

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

IshKebab14 hours ago

Sounds like you've thought of everything!

incomingpain17 hours ago

Essentially an advertisement against Cursor Pro and/or Claude Sonnet 3.5/3.7

I think personally when i tried tools like Void IDE, I was fighting with Void too much. It is beta software, it is buggy, but also the big one... learning curve of the tool.

I havent had the chance to try cursor but i imagine its going to have a learning curve as a new tool.

So perhaps there is a slowdown at first expected; but later after you get your context and prompting down pat. Asking specifically for what you want. Then you get your speed up.

tarofchaos13 hours ago

Totally flawed study

phyzome8 hours ago

This comment isn't very useful if you don't actually point out some flaws. :-)

alganet17 hours ago

[flagged]

dboreham17 hours ago

Any time you see the word "measuring" in the context of software development, you know what follows will be nonsense and probably in service of someone's business model.

[deleted]16 hours agocollapsed

inetknght17 hours ago

> We pay developers $150/hr as compensation for their participation in the study.

Can someone point me to these 300k/yr jobs?

phyzome8 hours ago

This is more like contractor pay rates -- not salaried. Which is appropriate to the nature of the work here.

akavi15 hours ago

L5 ("Senior") at any FAANG co, L6 ("Staff") at pretty much any VC-backed startup in the bay.

recursive17 hours ago

These are not 300k/yr jobs.

source