Hacker News

orange_puff

Ask HN: What explains the recent surge in LLM coding capabilities?

It seems like we are in the midst of another AI hype cycle. Many people are calling the current coding models an "inflection point", where now the capabilities are so high that future model growth will be explosive. I have heard serious people, like economics writer Noah Smith, make this argument [0].

But it's not just the commentariat. I have seen very serious people in software engineering and tech talk about the ways in which their coding habits have change drastically.

Benchmarks [1] alone don't seem to capture everything, although there have been jumps in the agentic sections, so maybe they actually do.

My question is; what explains these big jumps in capabilities that many serious people seem to be noticing all at once? Is it simply that we have thrown enough data and compute at the models, or instead, are labs perhaps fine-tuning models to get really good at tool calls, which leads to this new, surprising behavior?

When I explain agents to people, I usually walk them through a manual task one might go through when debugging code. You copy some code into ChatGPT, it asks you for more context, you copy some more code in, it suggests and edit, you edit and run, there is an error, so you paste that in, and so on. An agent is just an LLM in that loop which can use tools to do those things automatically. It would not be shocking to me if we took weaker models like Claude Opus 4.0 and made it 10x better at tool calls, it would be a much stronger and more impressive model. But is that all that is happening, or am I missing something big?

[0] https://substack.com/@noahpinion/p-187818379

[1] https://www.anthropic.com/news/claude-opus-4-6

segmondya day ago

RLVR. Reinforcement Learning with Verifiable Rewards. Prior to this it was RLHF, reinforcement learning with human feedback. The models can now be trained without human in the loop for coding problems, you give them code to solve. you have a means of verifying the answer. think like a unit test. the model codes it, it fails? it get's a fail. it passes it gets a pass. you do enough of this and the model really learns to code on it's own or operate better as an agent. That's the main thing that has changed between last year and this year.

numeri17 hours ago

and if I was to guess, the latest generation of models (Claude Opus 4.6, GPT-5.3-codex, etc.) differ from Opus 4.5, GPT 5.2 primarily in the addition of deeper, more difficult (most likely agentic and coding-based, like Terminal Bench) tasks to their RLVR training.

I could be completely off, as my intuition here is fully based on public research papers, but it seems to explain the current state of things fairly well.

softwaredoug2 days ago

Codex/Claude gather telemetry by default. That’s why they are subsidized. You’re giving them training data.

If you start with everything on GitHub, with maybe some manual annotated prompts for fine tuning, you get a decent base model of “if you see this code, then this other code follows” you’ll only go so far

If you can track how thousands of people actually use prompts, then the most successful tool usage patterns that result in success, then you will be able to fine tune to even more data (and train to avoid the unsuccessful ones). Now you’re training with much more data, around how people actually use the product, not theoretical scenarios.

In ML it always boils down to the training data.

dudewhocodes2 days ago

The agents are good enough to get results when an experienced person is properly guiding these tools.

However, during the holidays there was a big marketing campaign, mainly on Twitter. Everyone started posting the same talking points repeatedly and suddenly, and triggered a storm of fomo perfect for when everyone was not working.

There was no sudden huge jump, I've been using AI code tools since 2024 and was surprised to see this sudden hype when the tools worked ok before.

ediblelegible2 days ago

I think your intuition is correct and there’s nothing crazy novel happening

Recent surge Is mainly predictable growth in the same technical direction we’ve been trending. It’s just that it got good enough for people to notice.

In the 3 Body Problem series, the author describes a scenario where

(SPOILERS!!)

… humans are blocked on “fundamental” scientific gains but can still develop incredible technologies that make life unrecognizable

I think we’re seeing that scenario. Scaling RL for tool use and reasoning, improved pretraining data quality, larger models, better agent architectures, improved inference efficiency, etc, are all just incremental moves along the same branch, nothing fundamentally new or powerful

Which is not to say it’s not amazing or valuable. Just that it was not the result of anything super innovative from a research perspective. Model T to 2026 Camry was an amazing shift without really changing the combustion engine.

kingkongjaffa2 days ago

> Model T to 2026 Camry was an amazing shift without really changing the combustion engine.

A lot of times the big jumps in internal combustion engine development have been down to materials science or manufacturing capability improvement.

The underlying thermodynamics and theoretical limits have not changed, but the individual parts and what we make it out of have steadily improved over time.

The other factor to this is the need for emissions reduction strategies as a overriding design factor.

The analogue to these two in LLMs are:

1. The harnesses and agentic systems-focused training has gotten better over time so performance has increased without a step change in the foundation models.

2. The requirements for guardrails and anti-prompt injection and other concepts to make LLMs palatable for use by consumers and businesses.

coder4rover2 days ago

Quantum computing such that permutations of code to prompt is possible as it tries to answer to some kind of statistical probability solution.

kypro12 hours ago

Imo, the base models are still kinda meh, but the tooling built around the models has matured significantly to the point that they're now decent enough for the average coder can no longer deny their utility.

By early summer 2025 models had gotten good enough at tool calling that agentic coding started to work really well. If you had a piece of code which you could write unit tests against then the results from agentic coding assistances like Devon or Cursor were fairly impressive.

They still struggled with context though. It also felt like they didn't do enough research before starting tasks and didn't plan well on their own. Typically if you asked them to do something they would hack something together making way too many assumptions and often it would require a lot of iteration if you didn't do the research and planning for them.

Claude Code has done a really good job at addressing this. You don't have to think that much anymore, Claude Code will just do a lot of the research and planning before writing code so given a relatively simple prompt it will do a fairly good job in most cases.

Of course, base models have improved a bit too, but not that significantly. People who work with the APIs closely know the model output is not a step change like the output of agentic coding tools.

Honestly I think most people were just unaware of how good things were getting... Then late last year the agenetic coding tools had got good enough that anyone using them immediately saw the benefit of them without having to learn how to use them, where as in the summer you really had to know how to use them to see the benefits.

I feel similar about the current hype as I did when ChatGPT launched... A lot of people were really blown away, but anyone actually following the progress being made were impressed, but much less so. It was less, "oh wow, my computer can now talk to me like a human" vs "oh wow, this is a really good implementation of a SOTA large language model".

newzino2 days ago

[dead]

source