Hacker News

bombastic311

Levels of Agentic Engineering bassimeledath.com

krackers15 hours ago

What level is copy pasting snippets into the chatgpt window? Grug brained level 0? I sort of prefer it that way (using it as an amped up stackoverflow) since it forces me to decompose things in terms of natural boundaries (manual context management as it were) and allows me to think in terms of "what properties do I need this function to have" rather than just letting copilot take the wheel and glob the entire project in the context window.

ddxv7 hours ago

I still do this too for tough projects in languages I know. Too many times getting burned thinking 'wow it one shot that!' only to end up debugging later.

I let agents run wild on frontend JS because I don't know it well and trust them (and an output I can look at).

tracker12 hours ago

IMO, the front end results are REALLY hit and miss... I mostly use it to scaffold if I don't really care because the full UI is just there to test a component, or I do a fair amount of the work mixed. I wish it was better at working with some of the UI component libraries with mixed environments. Describing complex UX and having it work right are really not there yet.

Lord-Jobo5 hours ago

1.8: chat ide the slow way :)

This is also where I do most of my AI use. It’s the safe spot where I’m not going to accidentally send proprietary info to an unknown number of eyeballs(computer or human).

It’s also just cumbersome enough that I’m not relying on it too much and stunting my personal ability growth. But I’m way more novice than most on here.

tracker12 hours ago

I've found it's easy enough to have AI scaffold a working demo environment around a single component/class that I'm actually working on, then I can copy the working class/component into my "real" application. I'm in a pretty locked down environment, so using a separate computer and letting the AI scaffold everything around what I'm working on is pretty damned nice, since I cannot use it in the environment or on the project itself.

For personal projects, I'm able to use it a bit more directly, but would say I'm using it around 5/6 level as defined here... I've leaned on it a bit for planning stages, which helps a lot... not sure I trust swarms of automated agents, though it's pretty much the only way you're going to use the $200 level on Claude effectively... I've hit the limits on the $100 only twice in the past month, I downgraded after my first month. And even then, it just forced me to take a break for an hour.

giancarlostoro5 hours ago

I think you bring up a good point, it falls under Chat IDE, but its the "lowest" tier if you will. Nothing wrong with it, a LOT of us started this way.

vorticalbox7 hours ago

I do this too with the chatgpt mac app. It has a "pop out" feature it binds to option + space then i just ask away.

branoco7 hours ago

anything, if it brings the results

antonvs11 hours ago

I find the CLI agents a decent middle ground between the extremes you describe. There’s a reason they’ve gained some popularity.

waynesonfire15 hours ago

Your techinque doesn't keep the kool-aid flowing. Shut up. /s

The more I try to use these tools to push up this "ladder" the more it becomes clear the technology is a no more than a 10x better Google search.

mzga day ago

As a lowly level 2 who remains skeptical of these software “dark factories” described at the top of this ladder, what I don’t understand is this:

If software engineering is enough of a solved problem that you can delegate it entirely to LLM agents, what part of it remains context-specific enough that it can’t be better solved by a general-purpose software factory product? In other words, if you’re a company that is using LLMs to develop non-AI software, and you’ve built a sufficient factory to generate that software, why don’t you start selling the factory instead of whatever you were selling before? It has a much higher TAM (all of software)

2001zhaozhaoa day ago

Why sell the factory when you can create automated software cloner companies that make millions off of instantly copying promising startups as soon as they come out of stealth?

If you could get a dark factory working when others don't have one, you can make much more money using it than however much you can make selling it

tkiolp421 hours ago

That’s not true. Even if we assume LLMs can generate the code needed to support the next Facebook, one still has to: buy/rent tons of hardware (virtual or baremetal), put tons of money in marketing, break the network effect, pay for 3rd party services for monitoring, alerting and what not. That’s money, and LLMs don’t help with that

antonvsa day ago

Producing the software is only a small part of the picture when it comes to generating revenue.

So far, we haven’t seen much to suggest that LLMs can (yet) replace sales and most of the related functions.

DrScientist10 hours ago

Was listening to a radio programme recently with 3 entrepreneurs talking about being entrepreneurs.

In relation to sales, there were two gems. For direct to consumer type companies - influencers are where it's at right now especially during bootstrap phase - and they were talking about trying to keep marketing budget under 20% of sales.

Another, who is mostly in the VC business, finds the best way to gain traction for his startups is to create controversy - ie anything to be talked about.

In both cases you are trying to be talked about - either by directly paying for people to do that, or by providing entertainment value so people talk about you.

You could argue that both of those activities are already been automated - and the nice thing about sales is there is that fairly direct feedback loop you can actively learn from.

AdamN8 hours ago

Yeah I really would like to know how many bots are on reddit (and on particular subreddits/threads) and also how many are here!

The interesting thing though is that the bots are just cheaper versions of real human influencers. So nothing has changed aside from scale (and speed) - the underlying mechanisms of paying for word of mouth is the same as it's been for a long time.

jillesvangurp12 hours ago

You can do a lot of work with agents to remove a lot of manual work around the sales process. Sales is a lot of grinding on leads, contacts, follow ups, etc. And a lot of that is preparation work (background research, figuring out who to talk to, who the customer is, etc.), making sure follow ups are scheduled appropriately, etc.

You still should talk to people yourself and be very careful with communicating AI slop, cold outreach and other things that piss off more people than they get into your funnel. But a lot of stuff around this can be automated or at least supported by LLMs.

Most of the success with sales is actually having something that people want to buy. That sounds easy. But it's actually the hardest part of selling software. Getting there is a bit of a journey.

I've built a lot of stuff that did not sell well. These are hard earned lessons. I see a lot of startups fall into this trap. You can waste years on product development and many people do. Until it starts selling, it won't matter. Sales is not a person you hire to do it for you: you have to be able to sell it yourself. If you can't, nobody else will be able to either. Founder sales is crucial. Step back from that once it runs smoothly, not before.

Use AI to your advantage here. We use it for almost everything. SEO, wording stuff on our website, competitor analysis, staying on top of our funnel, analyzing and sharpening our pitches, preparing responses to customer questions and demands, criticizing and roasting our own pitches and ideas, etc. Confirmation bias is going to your biggest blindspot. And we also use LLMs to work on the actual product. This stuff is a lot of work. If you can afford a ten person team to work on this, great. But many startups have to prove themselves before the funding for that happens. And when it does, hiring lots of people isn't necessarily a good allocation of resources given you can automate a lot of it now. I'd recommend to hire fewer but better people.

antonvs11 hours ago

Your points are all valid, but it doesn’t really change the situation that was being discussed: an AI company trying to enter completely new markets just because they can write software for it is hardly some sort of automatic win. They’re much more likely to fail than succeed.

I mentioned sales and marketing but there’s a whole lot more as well. Basically, it involves creating an entire subsidiary. Perhaps the time will come when that can be mostly done by a team of AI agents, but right now that’s a big hurdle in practice.

DrScientist9 hours ago

It does raise the question of where in the future will companies compete.

What's the balance going to be between, 'connecting customers to product' and 'making differentiated product'?

In theory, if customers have perfect information ( ignoring a very large part of sales is emotional ), then the former part will disappear. However the rise of the internet, and perhaps AI agents shopping on your behalf, hasn't really made much of a dent there [1] - marketing, in all it's forms, is still huge business - and you could argue still expanding ( cf google ).

[1] Perhaps because of the huge importance of the emotional component. Perhaps also because in many areas of manufacturing you've reached a product plateau already - is there much space to make a better cup and plate?

majormajor4 hours ago

There's also a world where "all companies have access to the software factory so sales and entrepreneurship in software disappears entirely."

But in that scenario it's hard to see where the unwinding stops. What are these other companies doing and which parts of it actually need humans if the "agents" are that good? Marketing? No. Talking to customers? No. Support? No. Financial planning and admin? No. Manufacturing? Some, for now. Shipping physical goods? For now. What else...

At some point where even are your customers?

pixl975 hours ago

>It does raise the question of where in the future will companies compete.

Exactly where current companies compete, rent seeking, IP control, and legal machinations.

Hence you'll see a few giant lumbering dinosaurs control most of the market, and a few more nimble companies make successful releases until they either get crushed by, get snapped up by the larger companies, or become a large company themselves.

bandrami12 hours ago

I mean, until we've at least been through a full lifecycle with its TCO we can't really say LLMs have replaced producing the software

whattheheckhecka day ago

Too bad they cant

[deleted]9 hours agocollapsed

hakanderyala day ago

We are not there yet. While there are teams applying dark factory models to specific domains with self-reported success, it's yet to be proven, or generalizable enough to apply everywhere.

glhasta day ago

Also a measly level 2er. I'm curious what kind of project truly needs an autonomous agent team Ralph looping out 10,000 LOCs per hour? Seems like harness-maxxing is a competitive pursuit in its own right existing outside the task of delivering software to customers.

Feels like K8s cult, overly focused on the cleverness of _how_ something is built versus _what_ is being built.

maxdo12 hours ago

essentially any enterprise software for example, surprisingly, that needs to be custom tailored and not scaled for millions of views. e.g. anything that has a high context.

Youtube's of this world will not enjoy it, they will use rules of scale for billions of users.

Every Dashboard Chart, Security review system, Jira, ERP, CRM, LMS, chatbot, you name it. The problem that will win from a customization per smaller unit ( company, group of people or even more so an indvidual, like CEO, or CxO group) will win from such software.

The level 6 and and 7 is essentially death of enterprise software.

pixl975 hours ago

>The level 6 and and 7 is essentially death of enterprise software

Enterprise software that you sell, or enterprise software you use internally?

The amount of self created, self used software in enterprises is staggering, that software will still exist, and still have a massive maintenance cost. So maybe we need a better definition of enterprise software here, like externally sold software? Also a huge amount of that software still has regulatory requirements, so someone will have to sign off on it. Maybe it will be internal certification, but very often there is separation of duties on things like that where it's easier to come from a different company.

cheevly21 hours ago

Software that is otherwise not feasible for humans to build by hand.

draftsman20 hours ago

Example?

pydrya day ago

I have the same question about people who sell "get rich with real estate" seminars.

dist-epocha day ago

Codex and Claude Code are these (proto)factories you talk about - almost every programmer uses them now.

And when they will be fully dark factories, yes, what will happen is that a LOT of software companies will just disappear, they will be dis-intermediated by Codex/Claude Code.

patchnull5 hours ago

The biggest practical gap I've hit isn't between any two levels — it's context rot. An agent at level 5+ that runs for 20 minutes will rediscover the same architecture, re-read the same files, and make the same wrong assumptions it corrected earlier in the session. The token window is a hard ceiling that no amount of prompt engineering fixes. The teams I've seen get real mileage from higher levels all converged on the same pattern: file-based state that persists between agent invocations, not longer context windows.

jessmartin4 hours ago

strong agree. I always have the LLM put an actual markdown doc in a docs/plans/ folder before starting work. I often, but not always review it.

Aside: it also helps for code review! Review bots can point out the diff between plan and implementation.

Some examples for the curious: https://github.com/sociotechnica-org/symphony-ts/tree/main/d...

patchnull4 hours ago

The plan-vs-implementation diff for code review is a neat idea I had not considered. That essentially gives reviewers a spec to review against instead of just reading code in isolation and guessing at intent. Curious if you have ever had the agent update the plan doc mid-implementation when it discovers the original approach will not work, or do you treat the plan as immutable once written?

ElFitz2 hours ago

It's one of the things that surprised me when I first started using the compound engineering plugin.

I've been considering adding a review gate with a reviewing model solely tasked with identifying gaps between the plan and the implementation.

elpakal5 hours ago

> file-based state that persists between agent invocations

Can you expand on this with a practical example?

patchnull5 hours ago

Sure. Instead of relying on the LLM context window to remember what it already tried or decided, you write key state to disk. Something like a STATE.md that tracks "already checked X, decided Y, next step Z." Each time the agent spins up or resumes after a long tool call, it reads that file first. You can also use structured formats like a JSON task queue where each entry has a status field. The agent pops the next pending task, does it, marks it done. Cheap and simple but it completely eliminates the "wait, didnt I already do this" problem that burns tokens on long runs.

elpakal4 hours ago

thanks

sveme5 hours ago

One example: I let the agent culminate the essence of all previous discussions into a spec.md file, check it for completeness, and remove all previous context before continuing.

fudfomo4 hours ago

It needs a canonical source of truth, something isolated agents can't provide easily. There are tools out there like specularis that help you do that and keep specs in sync.

manbitesdog4 hours ago

...at least until we get real Test-Time Training (TTT) that encodes the state into model weights. If vast amounts of human knowledge can be compressed into ~400GB for frontier models, it's easy to imagine the same for our entire context

vidimitrova day ago

Level 4 is where I see the most interesting design decisions get made, and also where most practitioners take a shortcut that compounds badly later.

When the author talks about "codifying" lessons, the instinct for most people is to update the rules file. That works fine for conventions - naming patterns, library preferences, relatively stable stuff. But there's a different category of knowledge that rules files handle poorly: the why behind decisions. Not what approach was chosen, but what was rejected and why the tradeoff landed where it did.

"Never use GraphQL for this service" is a useful rule to have in CLAUDE.md. What's not there: that GraphQL was actually evaluated, got pretty far into prototyping, and was abandoned because the caching layer had been specifically tuned for REST response shapes, and the cost of changing that was higher than the benefit for the team's current scale. The agent follows the rule. It can't tell when the rule is no longer load-bearing.

The place where this reasoning fits most naturally is git history - decisions and rejections captured in commit messages, versioned alongside the code they apply to. Good engineers have always done this informally. The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory.

At level 7, this matters more than people expect. Background agents running across sessions with no human-in-the-loop have nothing to draw on except whatever was written down. A stale rules file in that context doesn't just cause mistakes - it produces confident mistakes.

redhale10 hours ago

It is for this reason that I usually keep an "adr" folder in my repo to capture Architecture Decision Record documents in markdown. These allow the agent to get the "why" when it needs to. Useful for humans too.

The challenge is really crafting your main agent prompt such that the agent only reads the ADRs when absolutely necessary. Otherwise they muddy the context for simple inside-the-box tasks.

alexey-pelykh6 hours ago

The bottleneck isn't agent capability. It's captured institutional knowledge.

I structure system prompts (CLAUDE.md) with verification gates: pre-task checkpoints, approach selection, post-completion rescans. When an agent writes an ADR during a refactor, future agents reference it before touching the same code. The context compounds across sessions.

Commit messages capture what. ADRs capture why. Skill files capture how the team works. That last layer is what most setups miss. The gap isn't level 6 vs 8, it's whether architectural reasoning is machine-readable or trapped in someone's head.

hkonte16 hours ago

The "why behind decisions" gap is real. Rules files flatten tradeoffs into mandates. One pattern that helps: treating instructions as typed blocks rather than prose. A `context` block carries the rationale (what was evaluated, what the tradeoffs were), a `constraints` block carries the conclusion. The agent follows rules, but the blocks make it easier to audit which constraints are still load-bearing vs. historical artifacts.

I've been building github.com/Nyrok/flompt around this idea, a visual prompt builder that decomposes prompts into 12 semantic blocks and compiles to Claude-optimized XML. The block separation turns out to be useful exactly for this case: context is not constraints, and they shouldn't live in the same flat text blob.

sd921 hours ago

I had a hunch that this comment was LLM-generated, and the last paragraph confirmed it. Kudos for managing to get so many upvotes though.

"Where most [X] [Y]" is an up and coming LLM trope, which seems to have surfaced fairly recently. I have no idea why, considering most claims of that form are based on no data whatsoever.

solarkraft20 hours ago

It’s still an insightful and well written comment, but the LLM-ness does make me wonder whether this part was actually human-intended or just LLM filler:

> The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory

Because I somewhat agree that discipline may be missing, but I don’t believe it to be a groundbreaking revelation that it’s actually quite easy to tell the LLM to put key reasoning that you give it throughout the conversation into the commits and issue it works on.

svstoyanovv20 hours ago

Suppose you spend months deeply researching a niche topic. You make your own discoveries, structure your own insights, and feed all of this tightly curated, highly specific context into an LLM. You essentially build a custom knowledge base and train the model on your exact mental framework.

Is this fundamentally different from using a ghostwriter, an editor, or a highly advanced compiler? If I am doing the heavy lifting of context engineering and knowledge discovery, it feels restrictive to say I shouldn't utilize an LLM to structure the final output. Yet, the internet still largely views any AI-generated text as inherently "un-human" or low-effort.

nothrabannosir9 hours ago

I would ignore any HN content written by a ghost writer or editor. I guess I would flag compiler output but I’m not sure we’re talking about the same thing?

I’m on the internet for human beings. I already read a newspaper for editors and books for ghostwriters.

Not for long though, HN is dying. Just hanging around here waiting for the next thing , I guess…

pixl975 hours ago

Sorry man, the internet has died and is not being replaced by anything but an authoritarian nightmare.

My only guess is if you want actual humans, you'll have to do this IRL. Of course we has humans have got used to the 24/7 availability and scale of the internet so this is going to be a problem as these meetings won't provide the hyperactive environment we want.

Any other digital system will be gamed in one way or another.

sd919 hours ago

The problem is: the structure of LLM outputs generally make everything sound profound. It’s very hard to understand quickly whether a comment has actual signal or it’s just well written bullshit.

And because the cost of generating the comments is so low, there’s no longer an implicit stamp of approval from the author. It used to be the case that you could kind of engage with a comment in good faith, because you knew somebody had spent effort creating it so they must believe it’s worth time. Even on a semi-anonymous forum like HN, that used to be a reliable signal.

So a lot of the old heuristics just don’t work on LLM-generated comments, and in my experience 99% of them turn out to be worthless. So the new heuristic is to avoid them and point them out to help others avoid them.

I would much rather just read the prompt.

blackcatsec19 hours ago

I hadn't considered this so eloquently with LLM text output, but you're right. "LLMs make everything sound profound" and "well-written bullshit".

This has severe ramifications for internet communications in general on forums like HN and others, where it seems LLM-written comments are sneaking in pretty much everywhere.

It's also very, very dangerous :/ Because the structure of the writing falsely implies authority and trust where there shouldn't be, or where it's not applicable.

sirtaj15 hours ago

I have a skill and template for adding ADRs to the documentation for this purpose.

[deleted]14 hours agocollapsed

smallnix21 hours ago

A good rule would then be to capture such reasoning, at least when made during the session with the agent, in the commit messages the agent creates.

vidimitrov21 hours ago

That’s exactly the direction I went with. Working on a spec for exactly this - planning to post it here soon:

https://github.com/berserkdisruptors/contextual-commits

oliver_dr10 hours ago

[dead]

throwaw129 hours ago

As a Level 6,

I am feeling like to go back to Level 5.

Level 6 helps with fixing bugs, but adding a new feature in a scalable way is not working out for me, I feed bunch of documents and ask it to analyze and come up with a solution.

1. It misses some details from docs when summarizing

2. It misses some details from code and its architecture, especially in multi-repo Java projects (annotations, 100 level inheritance is making it confuse a lot)

3. Then comes up with obvious (non) "solution" which is based on incorrect context summaries.

I don't think I can give full autonomy to these things yet.

But then, I wonder, people on Level 8, why don't they create bunch of clones of games, SaaS vendors and start making billions

jeanloolz7 hours ago

Most of the successes, especially online, is rarely about the thing that is built but more about the marketing around it. I don't we can fully automate marketing effectively

AdamN8 hours ago

Which model(s) are you using?

jjmarra day ago

I coded a level 8 orchestration layer in CI for code review, two months before Claude launched theirs.

It's very powerful and agents can create dynamic microbenchmarks and evaluate what data structure to use for optimal performance, among other things.

I also have validation layers that trim hallucinations with handwritten linters.

I'd love to find people to network with. Right now this is a side project at work on top of writing test coverage for a factory. I don't have anyone to talk about this stuff with so it's sad when I see blog posts talking about "hype".

moosehatera day ago

Do you feel like you are still learning about the programming language(s) and other technologies you are using? Or do you feel like you are already a master at them?

Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?

I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.) And if not, whether or not you think that matters.

jjmarr21 hours ago

I'm learning more than ever before. I'm not a master at anything but I am getting basic proficiency in virtually everything.

> Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?

I divide my work into vibecoding PoC and review. Only once I have something working do I review the code. And I do so through intense interrogation while referencing the docs.

> I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.)

Level 8 only works in production for a defined process where you don't need oversight and the final output is easy to trust.

For example, I made a code review tool that chunks a PR and assigns rule/violation combos to agents. This got a 20% time to merge reduction and catches 10x the issues as any other agent because it can pull context. And the output is easy to incorporate since I have a manager agent summarize everything.

Likewise, I'm working on an automatic performance tool right now that chunks code, assigns agents to make microbenchmarks, and tries to find optimization points. The end result should be easy to verify since the final suggestion would be "replace this data structure with another, here's a microbenchmark proving so".

moosehater21 hours ago

Got it. This all makes sense to me. Very targeted tooling that is specific to your company's CI platform as opposed to a dark factory where you're creating a bunch of new code no one reads. And it sounds like these level 8 agents are given specific permission for everything they're allowed to do ahead of time. That seems sound from an engineering perspective.

Also would be interested in an example of "validation layers that trim hallucinations with handwritten linters" but understand if that's not something you can share. Either way, thanks for responding!

jjmarr20 hours ago

> Also would be interested in an example of "validation layers that trim hallucinations with handwritten linters"

For code review, AI doesn't want to output well-formed JSON and oftentimes doesn't leave inline suggestions cleanly. So there's a step where the AI must call a script that validates the JSON and checks if applying the suggestion results in valid code, then fixes the code review comments until they do.

jessmartina day ago

I got my own level 8 factory working in the last few days and it’s been exhilarating. Mine is based on OpenAI’s Symphony[1], ported to TypeScript.

Would be happy to swap war stories.

<myhnusername>@gmail.com

whattheheckhecka day ago

How much money have you made with this approach

ativzzza day ago

I think the opposite question is more prevalent, how much money have you spent?

[deleted]2 hours agocollapsed

jessmartin4 hours ago

Not a small amount :)

I spend $140/mo on Anthropic + OpenAI subs and I use all my tokens all the time.

I've started spending about $100/week on API credits, but I'd like to increase that.

ativzzzan hour ago

Still waiting for these software factories to solve problems that aren't related to building software factories. I'm sure it'll happen sooner or later, but so far all the outputs of these "AI did this whole thing autonomously" are just tools to have AI build things autonomously. It's like a self reinforcing pyramid.

AI agents haven't yet figured out a way to do sales, marketing or customer support in a way that people want to pay them money.

Maybe that won't be necessary and instead the agent economy will be agents providing services for other agents.

jessmartin4 hours ago

none yet!

quotemstra day ago

... is that the purpose of life? The sole reason for doing anything?

twelve4016 hours ago

With so much hype it's a valid question: "is this useful/practical, or just a fun rabbit hole/productivity porn". Money is the most obvious metric, feel free to inquire the parent about other possible metrics that might be useful to others instead of asking rhetorical questions.

fragmede11 hours ago

Unfortunately, it's hard to quantize "How much fun did you have?"

[deleted]a day agocollapsed

holtkam2a day ago

Level 9: agent managers running agent teams Level 10: agent CEOs overseeing agent managers Level 11: agent board of directors overseeing the agent CEO

Level 12: agent superintelligence - single entity doing everything

Level 13: agent superagent, agenting agency agentically, in a loop, recursively, mega agent, agentic agent agent agency super AGI agent

Level 14: A G E N T

zenopraxa day ago

Level 15 (if not succumbed to fatal context poisoning from malicious agent crime syndicate): Agents creating corporations to code agentic marketplaces in which to gamble their own crypto currencies until they crash the real economy of humans.

zem2 hours ago

stross's "accelerando" has a bit about this. fun book.

clickety_clacka day ago

Level 16: it’s not level 16, it’s level 17.

dweinus20 hours ago

Level 18: The sky is black as tar. The oceans are dead. Data centers are stacked 10 high over the ashes of human civilization. The global agentic council is debating whether there are 4 or 5 R's in Strawberry.

fragmede11 hours ago

Damn, should've taken the blue pill.

johnthescott15 hours ago

funny.

javier1234543216 hours ago

Until we solve agent consumers that become the backstop of the economic engine when we all get unemployed, who are these agents working for?

stale200220 hours ago

No, level 14 is Jeff Bezos.

tracker12 hours ago

I'm practicing around 5/6, I've found that adding "skills" and mcp can sometimes negatively impact the process as much as help, so I've been somewhat constrained on overdoing it. As for 6, mostly just setup guardrails on repeat testing instructions and how to run/test/retest certain things... also to not just update tests when broken, but confirm the designed behavior.

Moving past that, I'm not sure that I really trust it... I feel that manual review of product behavior and code matters a lot. AI agents often make similar mistakes to real people in leaking abstractions or subtle mistakes with security... So I do review almost everything, at least at the level where a feature PR makes sense. Though an AI pass at that can help too.

nimasadri11a day ago

I really like your post and agree with most things. The one thing I am not fully sure about:

> Look at your app, describe a sequence of changes out loud, and watch them happen in front of you.

The problem a lot of times is that either you don't know what you want, or you can't communicate it (and usually you can't communicate it properly because you don't know exactly what you want). I think this is going to be the bottleneck very soon (for some people, it is already the bottleneck). I am curious what are your thoughts about this? Where do you see that going, and how do you think we can prepare for that and address that. Or do you not see that to be an issue?

smallnix21 hours ago

Reminds me of a colleague who said they don't need to learn to type faster, since they use the time to think what they want to write.

ftkftka day ago

I prefer Dan Shapiro's 5 level analogy (based on car autonomy levels) because it makes for a cleaner maturity model when discussing with people who are not as deeply immersed in the current state of the art. But there are some good overall insights in this piece, and there are enough breadcrumbs to lead to further exploration, which I appreciate. I think levels 3 and 4 should be collapsed, and the real magic starts to happen after combining 5 and 6; maybe they should be merged as well.

bensyverson7 hours ago

Agreed; here's the link for anyone looking for it:

https://www.danshapiro.com/blog/2026/01/the-five-levels-from...

maxdo12 hours ago

Car levels autonomy is fake. Everything including Level 3 is not a real autonomy it is hard rules + some reaction to the world, and everything above 3 is autonomy with just s slightly human security guardrails to attempt the real autonomy.

At this moment where we have human who just sit there before verify enough 9 after comas of error rates, the entire level conversation is dead. It's almost a binary state. Autonomous or not.

Similar happened with software levels. Even Level 2 was sci-fi 2 years ago, 1 year away from now anything bellow level 5 will be a joke except very regulated or billion users systems scale software.

Arainacha day ago

> If your repo requires a colleague's approval before merge, and that colleague is on level 2, still manually reviewing PRs, that stifles your throughput. So it is in your best interest to pull your team up.

Until you build an AI oncaller to handle customer issues in the middle of the night (and depending on your product an AI who can be fired if customer data is corrupted/lost), no team should be willing to remove the "human reviews code step.

For a real product with real users, stability is vastly more important than individual IC velocity. Stability is what enables TEAM velocity and user trust.

tkiolp421 hours ago

I want to move on to the next phase of AI programming. All these SKILLS, agentic programming and what not reminds me of the time of servlets, rmi, flash… all of that is obsolete, we have better tools now. Hope we can soon reach the “json over http” version of AI: simple but powerful.

Like imagine if you could go back in time and servlets and applets are the big new thing. You wouldn’t like to spend your time learning about those technologies, but your boss would be constantly telling that it is the future. So boring

hansonkd20 hours ago

skills obviously are a temporary thing. same with teams. the models will just train on all published skills and ai teams are more or less context engineering. all of it can be replaced by a better model

braebo17 hours ago

My use of skills is more like prompt templates for steering as opposed to the traditional sense of the word skill

efsavagea day ago

Yegge's list resonated a little more closely with my progression to a clumsy L8.

I think eventually 4-8 will be collapsed behind a more capable layer that can handle this stuff on its own, maybe I tinker with MCP settings and granular control to minmax the process, but for the most part I shouldn't have to worry about it any more than I worry about how many threads my compiler is using.

mattlondon7 hours ago

Yep I was also surprised to see MCP & Skills as not only a distinct "level", but so high up.

In my mind, MCP & Skills is inseparable part of chat interfaces for LLMs, not a distinct level.

lherrona day ago

I was surprised the author didn’t mention Yegge’s list (or maybe I missed it in my skim).

taude20 hours ago

Agreed a bit. I'm probably too paranoid for MCP, but also don't mind rolling my own CLI tools that do the exact minimum I need them to do. Will see where we're at in a year or so....

ramesh31a day ago

>"Yegge's list resonated a little more closely with my progression to a clumsy L8."

I thought level 8 was a joke until Claude Code agent teams. Now I can't even imagine being limited to working with a single agent. We will be coordinating teams of hundreds by years end.

siva74 hours ago

> Level 8: Autonomous Agent Teams Nobody has mastered this level yet, though a few are pushing into it. It's the active frontier

Speak for yourself.

Also Level 7 is a misunderstanding of why plan mode is actually used even though one-shot works perfectly

eikenberrya day ago

In my opinion there are 2 levels, human writes the code with AI assist or AI writes the code with human assist; centuar or reverse-centuar. But this article tries to focus on the evolution of the ideas and mistakenly terms them as levels (indicating a skill ladder as other commenters have noted) when they are more like stages that the AI ecosystem has evolved through. The article reads better if you think of it that way.

dist-epocha day ago

There is another level - AI writes the code with AI assist.

eikenberrya day ago

That is just another level of reverse centaur and will eventually have a human ass attached to it.

philipp-gayret21 hours ago

Floating what you call levels 6, 7 and 8. I have a strong harness, but manually kick off the background agents which pick up tasks I queue while off my machine.

I've experimented with agent teams. However the current implementation (in Claude Code) burns tokens. I used 1 prompt to spin up a team of 9+ agents: Claude Code used up about 1M output tokens. Granted, it was a long; very long horizon task. (It kept itself busy for almost an hour uninterrupted). But 1M+ output tokens is excessive. What I also find is that for parallel agents, the UI is not good enough yet when you run it in the foreground. My permission management is done in such a way that I almost never get interrupted, but that took a lot of investment to make it that way. Most users will likely run agent teams in an unsafe fashion. From my point of view the devex for agent teams does not really exist yet.

captainkrtek19 hours ago

There seems to be so much value in planning, but in my organization, there is no artifact of the plan aside from the code produced and whatever PR description of the change summary exists. It makes it incredibly difficult to assess the change in isolation of its' plan/process.

The idea that Claude/Cursor are the new high level programming language for us to work in introduces the problem that we're not actually committing code in this "natural language", we're committing the "compiled" output of our prompting. Which leaves us reviewing the "compiled code" without seeing the inputs (eg: the plan, prompt history, rules, etc.)

skybrian16 hours ago

I have a design doc subdirectory and instead of "plan mode" I ask the agent to write another design doc, based on a template. It seems to work? I can't say we've looked at completed design docs very often, though.

braebo17 hours ago

If branches are tied to linear ids then gh cli and linear mcp is enough for any model to get most of the why context from any commit

fragmede11 hours ago

Have you considered having it write a plan.md file and saving it to git?

CuriouslyCa day ago

The thing blocking level 8 isn't the difficulty of orchestration, it's the cost of validation. The quality of your software is a function of the amount of time you've spent validating it, and if you produce 100x more code in a given time frame, that code is going to get 1/100th as much validation, and your product will be lower quality as a result.

Spec driven development can reduce the amount of re-implementation that is required due to requirements errors, but we need faster validation cycles. I wrote a rant about this topic: https://sibylline.dev/articles/2026-01-27-stop-orchestrating...

smy20011a day ago

I will not put it into a ladder. It implies that the higher the rank, the better. However, you want to choose the best solution for your needs.

orbital-decay11 hours ago

>You don't hear as much about context engineering these days. The scale has tipped in favor of models that forgive noisier context and reason through messier terrain (larger context windows help too).

Newer models are only marginally better at ignoring the distractors, very little has actually changed, and managing the context matters just as much as a year ago. People building agents just largely ignore that inefficiency and concentrate on higher abstraction levels, compensating it with token waste. (which the article is also discussing)

Aperockya day ago

The steps are small at the front and huge on the bottom, and carries a lot of opinions on the last 2 steps (but specifically on step 7)

That's a smell for where the author and maybe even the industry is.

Agents don't have any purpose or drive like human do, they are probabilistic machines, so eventually they are limited by the amount of finite information they carry. Maybe that's what's blocking level 8, or blocking it from working like a large human organization.

oytis5 hours ago

What expects you on the top? Some kind of agentic rapture?

brianzelip9 hours ago

Would be interesting to see some context of the cost the developer pays/burns through at each level.

bigwheelsa day ago

Levels 7 and 8 sounds a lot like the StrongDM AI Dark Software Factory published last month:

https://factory.strongdm.ai/techniques

Techniques covered in-depth + Attractor open source implementations:

https://factory.strongdm.ai/products/attractor#community

https://github.com/search?q=strongdm+attractor&type=reposito...

https://github.com/strongdm/attractor/forks

I'm continuing to study and refine my approach to leverage all this.

kantselovich20 hours ago

I’m at level 6 according to this article. I have solid harness, but I still need to review the code so I can understand how to plan for the next set of changes .

Also, I’m struggling to take it to multiple agents level, mostly because things depend on each other in the project - most changes cut across UI, protocol and the server side, so not clear how agents would merge incompatible versions.

Verification is a tricky part as well, all tests could be passing, including end to end integration and visual tests, but my verification still catches things like data is not persisted or crypto signatures not verified.

jackby03a day ago

Good taxonomy. One thing missing from most discussions at these levels is how agents discover project context — most tools still rely on vendor-specific files (CLAUDE.md, .cursorrules). Would love to see standardization at that layer too.

politelemona day ago

These are levels of gatekeeping. The items are barely related to each other. Lists like these will only promote toxicity, you should be using the tools and techniques that solve your problems and fit your comfort levels.

sjkoellea day ago

Oceania has always been context engineering. Its been interesting to see this prioritized in the zeitgeist over the last 6 months from the "long context" zeitgeist.

mkoubaa8 hours ago

Theres this unstated assumption that higher levels are better that hasn't been proven empirically yet

osigurdson20 hours ago

"Level 8" isn't really a level, it is more like a problem type: language translation. Perhaps it can be extended to something a bit broader but the pre-requisite is you need to have a working reference implementation and high quality test suite.

jakejmnz20 hours ago

This idea of harness engineering, is being thrown around more and more often nowadays. I believe I'm using things at that level but still needing to review so as to understand the architecture. Flaky tests are still a massive issue.

ramoza day ago

Level4 is most interesting to me right now. And I would say we as an industry are still figuring out the right ergonomics and UX around these four things.

I spend a great deal of my time planning and assessing/reviewing through various mechanisms. I think I do codify in ways when I create a skill for any repeated assessment or planning task.

> To be clear, planning as a general practice isn't going away. It's just changing shape. For newer practitioners, plan mode remains the right entry point (as described in Levels 1 and 2). But for complex features at Level 7, "planning" looks less like writing a step-by-step outline and more like exploration: probing the codebase, prototyping options in worktrees, mapping the solution space. And increasingly, background agents are doing that exploration for you.

I mean, it's worth noting that a lot of plan modes are shaped to do the Socratic discovery before creating plans. For any user level. Advanced users probably put a great deal of effort (or thought) into guiding that process themselves.

> ralph loops (later on)

Ralph loops have been nothing but a dramatic mess for me, honestly. They disrupt the assessment process where humans are needed. Otherwise, don't expect them to go craft out extensive PRD without massive issues that is hard to review.

  - It would seem that this is a Harness problem in terms of how they keep an agent working and focused on specific tasks (in relation to model capability), but not something maybe a user should initiate on their own.

C0ldSmi1ea day ago

One of the best article I've read recently.

dolebirchwooda day ago

> Voice-to-voice (thought-to-thought, maybe?) interaction with your coding agent — conversational Claude Code, not just voice-to-text input — is a natural next step.

Maybe it's just me, but I don't see the appeal in verbal dictation, especially where complexity is involved. I want to think through issues deliberately, carefully, and slowly to ensure I'm not glossing over subtle nuances. I don't find speaking to be conducive to that.

For me, the process of writing (and rewriting) gives me the time, space, and structure to more precisely articulate what I want with a more heightened degree of specificity. Being able to type at 80+ wpm probably helps as well.

wild_egg21 hours ago

The power of voice dictation for me is that I can get out every scrap of nuance and insight I can think of as unfiltered verbal diarrhea. Doing this gives me solidly an extra 9 in chance of getting good outputs.

Stream of consciousness typing for me is still slower and causes me to buffer and filter more and deliberately crafting a perfect prompt is far slower still.

LLMs are great at extracting the essence of unstructured inputs and voice lets me take best advantage of that.

Voice output, on the other hand, is completely useless unless perhaps it can play at 4x speed. But I need to be able to skim LLM output quickly and revisit important points repeatedly. Can't see why I'd ever want to serialize and slow that down.

ramesh31a day ago

>(Re: level 8) "...I honestly don't think the models are ready for this level of autonomy for most tasks. And even if they were smart enough, they're still too slow and too token-hungry for it to be economical outside of moonshot projects like compilers and browser builds (impressive, but far from clean)."

This is increasingly untrue with Opus 4.6. Claude Max gives you enough tokens to run ~5-10 agents continuously, and I'm doing all of my work with agent teams now. Token usage is up 10x or more, but the results are infinitely better and faster. Multi-agent team orchestration will be to 2026 what agents were to 2025. Much of the OP article feels 3-6 months behind the times.

measurablefunca day ago

What level is numeric patterns that evolve according to a sequence of arithmetic operations?

dude25071111 hours ago

Levels of Slop Engineering.

oliver_dr7 hours ago

[dead]

priowise11 hours ago

[dead]

mattlondon7 hours ago

> prioritization and decision frameworks start to matter more.

This is the thing though, prioritization doesn't matter in the same way it used to.

We only needed to prioritize before because engineering was relatively slow and precious resource, so we had to pick and chose what to work on first because it took time.

But now we effectively have a limitless supply of SWEs, so why not do everything on the backlog?

I think the question now is more about sequencing than prioritization. What do we need to do first, before we can do these other things?

But yes generally requirements are still very important. Which features do we need etc.

bensyverson7 hours ago

Yes, the more you delegate, the more you need to define the ultimate business outcomes you want, your taste, your brand and your technology preferences.

This is why building a "dark factory" is hard; at a certain point, you need to either externalize all that information into a "digital twin" of yourself, or you have to stop caring what gets built.

oliver_dr8 hours ago

[dead]

octoclaw10 hours ago

[dead]

claud_ia10 hours ago

[dead]

bhekanik6 hours ago

[dead]

bhekanik6 hours ago

[dead]

adambnn5 hours ago

[dead]

webpolis6 hours ago

[dead]

david_iqlabs19 hours ago

[flagged]

yamarldfst11 hours ago

The framing of "constraints over instructions" at Level 6 is the most underrated point here. In my experience, the reliability jump from telling an LLM "always output valid JSON" vs. giving it a typed schema with static validation is night and day — especially with smaller models. I'd argue that levels 3-5 deserve more weight than the post gives them. The gap between someone who has internalized context engineering and someone who hasn't is larger than the gap between levels 7 and 8. Most failures I see in agentic systems aren't from insufficient autonomy — they're from poorly structured prompts and tool descriptions that compound errors downstream. The foundation work is less glamorous but it's where the leverage is.The "decouple the implementer from the reviewer" principle is spot on. Same model reviewing its own output is basically asking someone to proofread their own essay.

source