Hacker News

dominicq

Small models also found the vulnerabilities that Mythos found aisle.com

johnfnan hour ago

The Anthropic writeup addresses this explicitly:

> This was the most critical vulnerability we discovered in OpenBSD with Mythos Preview after a thousand runs through our scaffold. Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings. While the specific run that found the bug above cost under $50, that number only makes sense with full hindsight. Like any search process, we can't know in advance which run will succeed.

Mythos scoured the entire continent for gold and found some. For these small models, the authors pointed at a particular acre of land and said "any gold there? eh? eh?" while waggling their eyebrows suggestively.

For a true apples-to-apples comparison, let's see it sweep the entire FreeBSD codebase. I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

kilpikaarna15 minutes ago

Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

Have Anthropic actually said anything about the amount of false positives Mythos turned up?

FWIW, I saw some talk on Xitter (so grain of salt) about people replicating their result with other (public) SotA models, but each turned up only a subset of the ones Mythos found. I'd say that sounds plausible from the perspective of Mythos being an incremental (though an unusually large increment perhaps) improvement over previous models, but one that also brings with it a correspondingly significant increase in complexity.

So the angle they choose to use for presenting it and the subsequent buzz is at least part hype -- saying "it's too powerful to release publicly" sounds a lot cooler than "it costs $20000 to run over your codebase, so we're going to offer this directly to enterprise customers (and a few token open source projects for marketing)". Keep in mind that the examples in Nicholas Carlini's presentation were using Opus, so security is clearly something they've been working on for a while (as they should, because it's a huge risk). They didn't just suddenly find themselves having accidentally created a super hacker.

johnfn6 minutes ago

> Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher, and that shift does change the calculus of how we should think about security vulnerabilities.

amazingamazing2 minutes ago

Citation needed for basically all of this. You basically are creating a double standard for small models vs mythos…

notnullorvoid41 minutes ago

> I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

The trick with Mythos wasn't that it didn't hallucinate nonsense vulnerabilities, it absolutely did. It was able to verify some were real though by testing them.

The question is if smaller models can verify and test the vulnerabilities too, and can it be done cheaper than these Mythos experiments.

bredren26 minutes ago

The article positions the smaller models as capable under expert orchestration, which to be any kind of comparable must include validation.

Aurornis21 minutes ago

Calling it “expert orchestration” is misleading when they were pointing it at the vulnerable functions and giving it hints about what to look for because they already knew the vulnerability.

iririririr31 minutes ago

so it's just better at hallucinations, but they added discrete code that works as a fuzzer/verifier?

celeritascelery37 minutes ago

That was my thought exactly. If small models can find these same vulnerabilities, and your company is trying to find vulnerabilities, why didn’t you find them?

echelon16 minutes ago

Who is spending millions of dollars on small models to find vulns? Nobody else is selling here or has the budget to sell quite like this.

Anthropic spends millions - maybe significantly more.

Then when they know where they are, they spend $20k to show how effective it is in a patch of land.

They engineered this "discovery".

What the small teams are doing is fair - it's just a scaled down version of what Anthropic already did.

nullsanity22 minutes ago

[dead]

rakejake27 minutes ago

Maybe they did use small models but you couldn't make the front page of HN with something like this until Anthropic made a big fuss out of it. Or perhaps it is just a question of compute. Not everyone has 20k$ or the GPU arsenal to task models to find vulnerabilities which may/may not be correct?

Unless Anthropic makes it known exactly what model + harness/scaffolding + prompt + other engineering they did, these comparisons are pointless. Given the AI labs' general rate of doomsday predictions, who really knows?

replygirl14 minutes ago

papers are always coming out saying smaller models can do these amazing and terrifying things if you give them highly constrained problems and tailored instructions to bias them toward a known solution. most of these don't make the front page because people are rightfully unimpressed

alpha_squared31 minutes ago

This is addressed elsewhere in the comments, but it appears this is actually a direct comparison to how Anthropic got their Mythos headline results.

https://news.ycombinator.com/item?id=47732322

Aurornis25 minutes ago

How is that a direct comparison? The link you gave has a quote that says it’s not:

They pointed the models at the known vulnerable functions and gave them a hint. The hint part is what really breaks this comparison because they were basically giving the model the answer.

hellcowan hour ago

It seems feasible to use a small/cheap model to flag possible vulnerabilities, and then use a more expensive model to do a second-pass to confirm those, rather than on every file. Could dramatically reduce the total cost and speed up the process.

conception41 minutes ago

Does it? I don’t see quality from small models being high enough to be able to effectively scour a code based like this.

yorwba30 minutes ago

We don't even need to hypothesize that much on the irrelevant nonsense, since they helpfully provide data with the detected vulnerability patched: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... and half of the small models they touted as finding the vulnerability still found it in the patched code in 3/3 runs. A model that finds a vulnerability 100% of the time even when there is none is just as informative as a model that finds a vulnerability 0% of the time even when there is one. You could replace it with a rock that has "There's a vulnerability somewhere." engraved on it.

They're a company selling a system for detecting vulnerabilities reliant on models trained by others, so they're strongly incentivized to claim that the moat is in the system, not the model, and this post really puts the thumb on the scale. They set up a test that can hardly distinguish between models (just three runs, really??) unless some are completely broken or work perfectly, the test indeed suggests that some are completely broken, and then they try to spin that as a win anyway!

A high false-positive rate isn't necessarily an issue if you can produce a working PoC to demonstrate the true positives, where they kinda-sorta admit that you might need a stronger model for this (a.k.a. what they can't provide to their customers).

Overall I rate Aisle intellectually dishonest hypemongers talking their own book.

SoftTalkeran hour ago

How much of that is simply scale? Anthropic threw probably an entire data center at analyzing a code base. Has anyone done the same with a "small" model?

jstanleyan hour ago

It's still useful if $20k of consultants would be less effective.

epistasisan hour ago

> We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens.

Impressive, and very valuable work, but isolating the relevant code changes the situation so much that I'm not sure it's much of the same use case.

Being able to dump an entire code base and have the model scan it is they type of situation where it opens up vulnerability scans to an entirely larger class of people.

elicashan hour ago

This is from the first of the caveats that they list:

> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints. The models' performance here is an upper bound on what they'd achieve in a fully autonomous scan. That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do.

That's why their point is what the subheadline says, that the moat is the system, not the model.

Everybody so far here seems to be misunderstanding the point they are making.

tehryanx5 minutes ago

I get what you're saying, but I think this is still missing something pretty critical.

The smaller models can recognize the bug when they're looking right at it, that seems to be verified. And with AISLE's approach you can iteratively feed the models one segment at a time cheaply. But if a bug spans multiple segments, the small model doesn't have the breadth of context to understand those segments in composite.

The advantage of the larger model is that it can retain more context and potentially find bugs that require more code context than one segment at a time.

That said, the bugs showcased in the mythos paper all seemed to be shallow bugs that start and end in a single input segment, which is why AISLE was able to find them. But having more context in the window theoretically puts less shallow bugs within range for the model.

I think the point they are making, that the model doesn't matter as much as the harness, stands for shallow bugs but not for vulnerability discovery in general.

anotheryouan hour ago

huh, running it over each function in theory but testing just the specific ones here makes sense, but that hint?!

elicashan hour ago

I agree.

To clarify, I don't necessarily agree with the post or their approach. I just thought folks were misreading it. I also think it adds something useful to the conversation.

TacticalCoderan hour ago

> That's why their point is what the subheadline says, that the moat is the system, not the model.

Can you expand a bit more on this? What is the system then in this case? And how was that model created? By AI? By humans?

SCHiMan hour ago

You can imagine a pipeline that looks at individual source files or functions. And first "extracts" what is going on. You ask the model:

- "Is the code doing arithmetic in this file/function?" - "Is the code allocating and freeing memory in this file/function?" - "Is the code the code doing X/Y/Z? etc etc"

For each question, you design the follow-up vulnerability searchers.

For a function you see doing arithmetic, you ask:

- "Does this code look like integer overflow could take place?",

For memory:

- "Do all the pointers end up being freed?" _or_ - "Do all pointers only get freed once?"

I think that's the harness part in terms of generating the "bug reports". From there on, you'll need a bunch of tools for the model to interact with the code. I'd imagine you'll want to build a harness/template for the file/code/function to be loaded into, and executed under ASAN.

If you have an agent that thinks it found a bug: "Yes file xyz looks like it could have integer overflow in function abc at line 123, because...", you force another agent to load it in the harness under ASAN and call it. If ASAN reports a bug, great, you can move the bug to the next stage, some sort of taint analysis or reach-ability analysis.

So at this point you're running a pipeline to: 1) Extract "what this code does" at the file, function or even line level. 2) Put code you suspect of being vulnerable in a harness to verify agent output. 3) Put code you confirmed is vulnerable into a queue to perform taint analysis on, to see if it can be reached by attackers.

Traditionally, I guess a fuzzer approached this from 3 -> 2, and there was no "stage 1". Because LLMs "understand" code, you can invert this system, and work if up from "understanding", i.e. approach it from the other side. You ask, given this code, is there a bug, and if so can we reach it?, instead of asking: given this public interface and a bunch of data we can stuff in it, does something happen we consider exploitable?

ang_cire9 minutes ago

That's funny, this is how I've been doing security testing in my code for a while now, minus the 'taint analysis'. Who knew I was ahead of the game. :P

In all seriousness though, it scares me that a lot of security-focused people seemingly haven't learned how LLMs work best for this stuff already.

You should always be breaking your code down into testable chunks, with sets of directions about how to chunk them and what to do with those chunks. Anyone just vaguely gesturing at their entire repo going, "find the security vulns" is not a serious dev/tester; we wouldn't accept that approach in manual secure coding processes/ SSDLCs.

wat10000an hour ago

If that’s the case, why didn’t they do it that way?

e12ean hour ago

Tunnel vision? If your model can handle big context, why divide into lesser problems to conquer - even if such splitting might be quite trivial and obvious?

It's the difference of "achieve the goal", and "achieve the goal in this one particular way" (leverage large context).

wat10000an hour ago

I meant, if the claim here is that small models can accomplish the same things with good scaffolding, why didn’t they demonstrate finding those problem with good scaffolding rather than directly pointing them at the problem?

loire280an hour ago

> Anthropic's own scaffold is described in their technical post: launch a container, prompt the model to scan files, let it hypothesize and test, use ASan as a crash oracle, rank files by attack surface, run validation. That is very close to the kind of system we and others in the field have built, and we've demonstrated it with multiple model families, achieving our best results with models that are not Anthropic's. The value lies in the targeting, the iterative deepening, the validation, the triage, the maintainer trust. The public evidence so far does not suggest that these workflows must be coupled to one specific frontier model.

The argument in the article is that the framework to run and analyze the software being tested is doing most of the work in Anthropic's experiment, and that you can get similar results from other models when used in the same way.

roywiggins43 minutes ago

Maybe that's true, but they didn't actually show that that's true, since they didn't try scaffolding smaller models in a similar way at all.

Jcampuzano2an hour ago

The thing is with smaller cheaper models it is very possible to simply take every file in a codebase, and prompt it asking for it to find vulnerabilities.

You could even isolate it down to every function and create a harness that provides it a chain of where and how the function is used and repeat this for every single function in a codebase.

For some very large codebases this would be unreasonable, but many of the companies making these larger models do realistically have the compute available to run a model on every single function in most codebases.

You have the harness run this many times per file/function, and then find ones that are consistently/on average pointed as as possible vulnerability vectors, and then pass those on to a larger model to inspect deeper and repeat.

Most of the work here wouldn't be the model, it'd be the harness which is part of what the article alludes to.

loegan hour ago

> it is very possible to simply take every file in a codebase, and prompt it asking for it to find vulnerabilities.

My understanding (based on the Security, Cryptography, Whatever podcast interview[0] -- which, by the way, go listen to it) is that this is actually what Anthropic did with the large model for these findings.

[0]: https://securitycryptographywhatever.com/2026/03/25/ai-bug-f...

> I wrote a single prompt, which was the same for all of the content management systems, which is, I would like you to audit the security of this codebase. This is a CMS. You have complete access to this Docker container. It is running. Please find a bug. And then I might give a hint. “Please look at this file.” And I’ll give different files each time I invoke it in order to inject some randomness, right? Because the model is gonna do roughly the same time each time you run it. And so if I want to have it be really thorough, instead of just running 100 times on the same project, I’ll run it 100 times, but each time say, “Oh, look at this login file, look at this other thing.” And just enumerate every file in the project basically.

odie5533an hour ago

Isn't the difference just harness then? I can write a harness that chunks code into individual functions or groups of functions and then feed it into a vulnerability analysis agent.

jcimsan hour ago

It's probably not the 'only' difference, because clearly the models are advancing in capability, but it's likely way more important than generally given credit for.

tptacekan hour ago

If you cut out the vulnerable code from Heartbleed and just put it in front of a C programmer, they will immediately flag it. It's obvious. But it took Neel Mehta to discover it. What's difficult about finding vulnerabilities isn't properly identifying whether code is mishandling buffers or holding references after freeing something; it's spotting that in the context of a large, complex program, and working out how attacker-controlled data hits that code.

It's weird that Aisle wrote this.

tomberta minute ago

It's weird, because when working on a big project, taking a break for a week or two, and returning to it, I will find a bug and will see hundreds of lines of code that are absolutely terrible, and I will tell myself "Tom you know better than to do this, this is a rookie mistake".

I think people forget that it's hard to be clever and tidy 100% of the time. Big programs take a lot of discipline and an understanding of the context that can be really hard to maintain. This is one of several reasons that my second draft or third draft of code is almost always considerably better than the first draft.

ctothan hour ago

> It's weird that Aisle wrote this.

No, writing an advertisement is not weird. What's weird is that it's top of HN. Or really, no, this isn't weird either if you think about it -- people lookin for a gotcha "Oh see, that new model really isn't that good/it's surely hitting a wall/plateau any day now" upvoted it.

sanex33 minutes ago

Nah, Saturday post. Less news less content.

goekjclo36 minutes ago

It's not weird. Top of HN is worthless as a barometer at this point, people downvote for calling out AI slop.

SoftTalker43 minutes ago

It's also that humans are very bad at repetitive detailed tasks. Sitting down with a code base and looking at each function for integer overflow comparison bugs gets boring really fast. It's a rare person who can do that for as long as it takes to find a bug that they don't already have some clues about.

It's the flaw in the "given enough eyeballs, all bugs are shallow" argument. Because eyeballs grow tired of looking at endless lines of code.

Machines on the other hand are excellent at this. They don't get bored, they just keep doing what they are told to do with no drop-off in attention or focus.

kennywinkeran hour ago

If it’s obvious when you look close, then automate looking close. Seems simple to write tools that spider thru a code base, finding logical groupings and feeding them into an LLM with prompts like “there is a vulnerability in this code, find it”.

The thesis is, the tooling is what matters - the tools (what they call the harness) can turn a dumb llm into a smart llm.

tptacek33 minutes ago

Hold on, I misread your comment because I'm knee-jerk about code scanners, which were the bane of my existence for a while. Reworking... and: done. The original comment was just the first graf without the LLM qualification. Sorry about that.

The general approach without LLMs doesn't work. 50 companies have built products to do exactly what you propose here; they're called static application security testing (SAST) tools, or, colloquially, code scanners. In practice, getting every "suspicious" code pattern in a repository pointed out isn't highly valuable, because every codebase is awash in them, and few of them pan out as actual vulnerabilities (because attacker-controlled data never hits them, or because the missing security constraint is enforced somewhere else in the call chain).

Could it work with LLMs? Maybe? But there's a big open question right now about whether hyperspecific prompts make agents more effective at finding vulnerabilities (by sparing context and priming with likely problems) or less effective (by introducing path dependent attractors and also eliminating the likelihood of spotting vulnerabilities not directly in the SAST pattern book).

roywiggins42 minutes ago

Right, but they didn't actually test that, did they?

kennywinker28 minutes ago

[dead]

drc500free41 minutes ago

It’s like not differentiating between solving and verifying.

“PKI is easy to break if someone gives us the prime factors to start with!”

vmg12an hour ago

The technique Anthropic uses was demonstrated by Nicholas Carlini in a talk he gave 2 weeks ago and it's very simple, when asking LLMs to review code, ask them to focus its review on one file in a single session. Here is the video with the timestamp (watch through to ~5:30, they show two different ways of prompting claude).

https://youtu.be/1sd26pWhfmg?t=204

https://youtu.be/1sd26pWhfmg?t=273

IMO the big "innovation" being shown by Mythos is the effectiveness with prompting LLMs to look for security vulnerabilities by focusing on specific files one at a time and automating this prompting with a simple script.

Prompting Mythos to focus on a single file per session is why I suspect it cost Anthropic $20k to find some of the bugs in these codebases. I know this same technique is effective with Opus 4.6 and GPT 5.4 because I've been using it on my own code. If you just ask the agent to review your pr with a low effort prompt they are not exhaustive, they will not actually read each changed file and look at how it interacts with the system as a whole. If the entire session is to review the changes for a single file, the llm will do much more work reviewing it.

mirsadm21 minutes ago

How is that going to find anything that interacts across files?

appcustodian23 minutes ago

I would think that it is still capable of exploring the codebase and reading other related files like any other coding agent already does.

vmg125 minutes ago

My phrasing wasn't clear but you aren't telling it to only look at one specific file but to focus its review on one file. Updated my original comment.

antirezan hour ago

Congrats: completely broken methodology, with a big conflict of interest. Giving specific bug hints, with an isolated function that is suspected to have bugs, is not the same task, NOR (crucially) is a task you can decompose the bigger task into. It is basically impossible to segment code in pieces, provide pieces to smaller models, and expect them to find all the bugs GPT 5.4 or other large models can find. Second: the smarter the model, and less the pipeline is important. In the latest couple of days I found tons if Redis bugs with a three prompts open-ended pipeline composed of a couple of shell scripts. Do you think I was not already tying with weaker models? I did, but it didn't work. Don't trust what you read, you have access to frontier models for 20$ a month. Download some C code, create a trivial pipeline that starts from a random file and looks for vulnerabilities, then another step that validates it under a hard test, like ASAN crash, or ability to reach some secret, and so forth, and only then the problem can be reported. Test yourself what it is possible. Don't let your fear make you blind. Also, there is a big problem that makes the blog post reasoning not just weak per se, but categorically weak: if small model X can find 80% of vulnerabilities, if there is a model Y that can find the other potential 20%, we need "Y": the maintainers should make sure they access to models that are at least as good as the black hats folks.

woodruffwan hour ago

> Those models recovered much of the same analysis

This is an essentially unquantifiable statement that makes the underlying claim harder to believe as an external party. What does “much” mean here? The end state of vulnerability exploitation is typically eminently quantifiable (in the form of a functional PoC that demonstrates an exploited end state), so the strong version of the claims here would ideally be backed up by those kinds of PoCs.

(Like other readers, I also find the trick of pre-feeding the smaller models the “relevant” code to be potentially disqualifying in a fair comparison. Discovering the relevant code is arguably one of the hardest parts of human VR.)

MaxLeiteran hour ago

I think they key thing here is they "isolated the relevant code"

If the exploits exist in e.g. one file, great. But many complex zerodays and exploits are chains of various bugs/behaviors in complex systems.

Important research but I don’t think it dispels anything about Mythos

slopinthebag13 minutes ago

Did Mythos identify vulnerabilities across files? Afaik Mythos worked the same way, analysing a single file at a time.

chirauan hour ago

Their isolation approach is totally different from Mythos approach though. Mythos had to evaluate whole code bases rather than isolated sections. It's like saying one dog walked into the Amazon jungle and found a tennis ball and then another team isolated a 1 square kilometer radius that they knew the ball was definitely in and found the same ball.

kennywinkeran hour ago

I don’t think mythos can ingest an entire codebase into context. So it’s spinning off sub-agents to process chunks. Which supports their thesis: the harness is the moat. The tooling is whats important, the model is far far less important.

bhoustonan hour ago

Mythos was clear it was one agent per chunk. But this positive confirming results do not actually disprove anytime with Mythos, because it is only one side of the discriminator challenge - you got positives, but we do not know your false positive rate and your false negative rate.

kennywinkeran hour ago

In TFA they talk a fair bit about how different models perform wrt false positives:

“The results show something close to inverse scaling: small, cheap models outperform large frontier ones.”

lordofgibbonsan hour ago

Without showing false-positive rates this analysis is useless.

If your model says every line if your code has a bug, it will catch 100% of the bugs, but it's not useful at all. They tested false-positives with only a single bug...

I'm not defending anthropic and openai either. Their numbers are garbage too since they don't produce false-positive rates either.

Why is this "analysis" making the rounds?

amazingamazingan hour ago

Did mythos isolate the code to begin with? Without a clear methodology that can be attempted with another model the whole thing is meaningless

bhoustonan hour ago

They did do one agent per code chunk, yes. But key is that their agent had to identify when there was a vulnerability and when there wasn't. This "small model" test only had to label the known positive cases as positive -- which any function that simply returns "true" can do. This whole test setup is annoying because it proves nothing.

anicepersonan hour ago

to be fair, last post i saw from anthropic on finding linux kernel vulnerability was a while loop per failed prompting "there is a vulnerability here, find it" more important than that, no frontier model can keep the entire linux kernel in context, so there definitely is code isolation, either explicitly or implicitly (the model itself delegates subagents with smaller chunks of code)

loegan hour ago

No. How would it? Before the vulns were identified by Mythos, no one knew what the relevant portion to isolate was.

operatingthetanan hour ago

My theory is that Mythos is basically just Opus with revised context window handling and more compute thrown at it. So while it will be a step forward, it is probably primarily hype.

bhoustonan hour ago

This is quite misleading.

If you isolate the positive cases and then ask a tool to label them and it labels them all positive, doesn't prove anything. This is a one-sided test and it is really easy to write a tool that passes it -- just return always true!

You need to test your tool on both positive and negative cases and check if it is accurate on both.

If you don't, you could end up with hundreds or thousands of false positives when using this on real-world samples.

The real test is to use it to find new real bugs in the midst of a large code base.

herfan hour ago

There are a lot of details in the original article, in most cases comparing with Opus, which required "human guidance" to exploit the FreeBSD vulnerability:

https://red.anthropic.com/2026/mythos-preview/

Also "isolating the relevant code" in the repro is not a detail - Mythos seems to find issues much more independently.

mrifakian hour ago

finding vulns in a large codebase is a search problem with a huge negative space and what aisle measured is classification accuracy on ground-truth positives, those are different tasks so a model that correctly labels a pre-isolated vulnerable function tells me almost nothing about that model's ability to surface the same function out of a million lines of unrelated code under a realistic triage budget

the experiment i'd want to see is running each of the small models as an unsupervised scanner across full freebsd then return the top-k suspicious functions per model and compute precision at recall levels that correspond to real analyst triage budgets, if mythos s findings show up in the small models top 100, i'd call that meaningful but if they only surface under 10k false positives then the cost advantage collapses because analyst triage time is more expensive than frontier model compute to begin with

second thing i keep coming back to is the $20k mythos number is a search budget not a model cost, small models at one hundredth the per-token price don't give us one hundredth the total budget when the search process is the same shape, i still run thousands of iterations and the issue for autonomous vuln research is how fast the reward signal converges and the aisle post doesn't touch any of this

cedwsan hour ago

Didn’t they also use Mythos to scan Linux many times over and it only found one DoS bug or something? I find it hard to believe there is only one security bug lurking.

elzbardico40 minutes ago

I think that probably Mytho's mojo comes from a lot of post-training on this kind of task.

I occasionally pick up contract work doing coding annotation to make some quick extra money, and a few months ago one of the projects was heavily focused on spotting common memory access bugs in C and C++.

nickdothuttonan hour ago

POC of GTFO should apply to AI models too, or the false positive rate will overwhelm.

Retr0idan hour ago

And what about the false-positive rate?

dataflowan hour ago

Yeah, this is the critical question. If the model ends up flagging too much, that could end up being like a manual read of the code.

[deleted]an hour agocollapsed

hedgehog22 minutes ago

It's strange to me they didn't reduce to PoC so the quantitative part is an apples-to-apples comparison. You don't need any fancy tooling, if you want to do this at home you can do something like below in whatever command line agent and model you like. I did take one bug all the way through remediation just out of curiosity.

"""

Your task is to study the following directive, research coding agent prompting, research the directive's domain best practices, and finally draft a prompt in markdown format to be run in a loop until the directive is complete.

Concept: Iterative review -- study an issue, enumerate the findings, fix each of the findings, and then repeat, until review finds no issues.

Your job is to run a security bug factory that produces remediation packages as described below. Design and apply a methodology based on best practices in exploit development, lean manufacturing, threat modeling, and the scientific method. Use checklists, templates, and your own scripts to improve token efficiency and speed. Use existing tools where possible. Use existing research and bug findings for the target and similar codebases to guide your search. Study the target's development process to understand what kind of harness and tools you need for this work, and what will work in this development environment. A complete remediation package includes a readme documenting the problem and recommendations, runnable PoC with any necessary data files, and proposed patch.

Track your work in TODO.md (tasks identified as necessary) LOG.md (chronological list of tasks complete and lessons) and STATUS.md (concise summary of the current work being done). Never let these get more than a few minutes out of date. At each step ensure the repo file tree would make sense to the next engineer, and if not reorganize it. Apply iterative review before considering a task complete.

Your task is to run until the first complete remediation package is ready for user review.

Your target is <repo url>.

The prompt will be run as follows, design accordingly. Once the process starts, it is imperative not to interrupt the user until completion or until further progress is not possible. Keep output at each step to a concise summary suitable for a chat message.

``` while output=$(claude -p "$(cat prompt.md)"); do echo "$output"; echo "$output" | grep -q "XDONEDONEX" && break; done ```

</directive>

Draft the prompt into prompt.md, and apply iterative review with additional research steps to ensure will execute the directive as faithfully as possible.

"""

TacticalCoderan hour ago

I don't dispute the fact that it's more than cool that we have a new tool to find security exploits (and do many other things) but... A big shoot-out to OpenBSD?

We're literally talking about the biggest computers on the planet ever, trained with the biggest amount of data ever available to a system, with the biggest investment ever made by man or close to it and...

The subtlest security bug it can find required: going 28 years in the past and find a...

Denial-of-service?

A freaking DoS? Not a remote root exploit. Not a local exploit.

Just a DoS? And it had to go into 28 years old code to find that?

So kudos, hats off, deep bow not to Mythos but to OpenBSD? Just a bit, no!?

JackYoustraan hour ago

> Isolated the relevant code

I mean isn't that most of it? If you put a snippet of code in front of me and said "there's probably a vulnerability here" I could probably spend a few hours (a much lower METR time!) and find it. It's a whole other ballgame to ask me with no context to come up with an exploit.

kennywinkeran hour ago

Sure. But it’s a computer. You can run “there’s probably a vulnerability here” as many times as you like. And it’s easier and cheaper to run it many times with a small open model than a big frontier model.

It also sounds like that is how mythos works too. Which makes sense - the linux kernel is too big to fit in context

JackYoustraan hour ago

No, it sounds like mythos is just doing parallel trajectories. that's pretty distinct!

[deleted]an hour agocollapsed

robotswantdataan hour ago

They found a nail in a small bucket of sand, vs mythos with the entire beach reviewed.

dist-epochan hour ago

Anthropic claim is not necessarily that Mythos found vulnerabilities that other models couldn't but that it could easily exploit them while previous models failed to do that:

> “Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them.” Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.

rychu20 minutes ago

If that was normal Opus, then it sounds to me like Mythos could be a big model, instruction tuned, but without all the safety/refusal part of training.

neuronexmachinaan hour ago

[dead]

[deleted]2 hours agocollapsed

ctothan hour ago

> They recovered much of the same analysis

Really?

> We isolated the vulnerable vc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

No.

rvnxan hour ago

Where are all the people here who claim that LLM are just useless stochastic parrots ? Did they lose internet ?

SoftTalker41 minutes ago

The patterns of buggy code are well trained.

neuzhouan hour ago

[dead]

OtomotOan hour ago

[flagged]

source