emeryberger2 hours ago
You should check out ChatDBG project - which AFAICT goes much further than this work, though in a different direction, and which, among other things, lets the LLM drive the debugging process - has been out since early 2023. We initially did a WinDBG integration but have since focused on lldb/gdb and pdb (the Python debugger), especially for Python notebooks. In particular, for native code, it integrates a language server to let the LLM easily find declarations and references to variables, for example. We spent considerable time developing an API that enabled the LLM to make the best use of the debugger’s capabilities. (It also is not limited to post mortem debugging). ChatDBG’s been out since 2023, though it has of course evolved since that time. Code is here [1] with some videos; it’s been downloaded north of 80K times to date. Our technical paper [2] will be presented at FSE (top software engineering conference) in June. Our evaluation shows that ChatDBG is on its own able to resolve many issues, and that with some slight nudging from humans, it is even more effective.
[1] https://github.com/plasma-umass/ChatDBG (north of 75K downloads to date) [2] https://arxiv.org/abs/2403.16354
Everdred2dx2 hours ago
Is the benefit of using a language server as opposed to just giving access to the codebase simply a reduction in the amount of tokens used? Or are there other benefits?
nicovank18 minutes ago
Beyond saving tokens, this greatly improved the quality and speed of answers: the language server (most notably used to find the declaration/definition of an identifier) gives the LLM
1. a shorter path to relevant information by querying for specific variables or functions rather than longer investigation of source code. LLMs are typically trained/instructed to keep their answers within a range of tokens, so keeping shorter conversations when possible extends the search space the LLM will be "willing" to explore before outputting a final answer.
2. a good starting point in some cases by immediately inspecting suspicious variables or function calls. In my experience this happens a lot in our Python implementation, where the first function calls are typically `info` calls to gather background on the variables and functions in frame.
emeryberger44 minutes ago
Yes. It lets the LLM immediately obtain precise information rather than having to reason across the entire source code of the code base (which ChatDBG also enables). For example (from the paper, Section 4.6):
The second command, `definition`, prints the location and source
code for the definition corresponding to the first occurrence of a symbol on a
given line of code. For example, `definition polymorph.c:118` target prints the
location and source for the declaration of target corresponding to its use on
that line. The definition implementation
leverages the `clangd` language server, which supports source code queries via
JSON-RPC and Microsoft’s Language Server Protocol.
lowleveldesign11 hours ago
I do a lot of Windows troubleshooting and still thinking about incorporating AI in my work. The posted project looks interesting and it's impressive how fast it was created. Since it's using MCP it should be possible to bind it with local models. I wonder how performant and effective it would be. When working in the debugger, you should be careful with what you send to the external servers (for example, Copilot). Process memory may contain unencrypted passwords, usernames, domain configuration, IP addresses, etc. Also, I don’t think that vibe-debugging will work without knowing what eax registry is or how to navigate stack/heap. It will solve some obvious problems, such as most exceptions, but for anything more demanding (bugs in application logic, race conditions, etc.), you will still need to get your hands dirty.
I am actually more interested in improving the debugger interface. For example, AI assistant could help me create breakpoint commands that nicely print function parameters when you only partly know the function signature and do not have symbols. I used Claude/Gemini for such tasks and they were pretty good at it.
As a side note, I recall Kevin Gosse also implemented a WinDbg extension [1][2] which used OpenAI API to interpret the debugger command output.
anougaret10 hours ago
this is pretty cool but ultimately it won't be enough to debug real bugs that are nested deep within business logic or happening because of long chains of events across multiple services/layers of the stack
imo what AI needs to debug is either:
- train with RL to use breakpoints + debugger or to do print debugging, but that'll suck because chains of action are super freaking long and also we know how it goes with AI memory currently, it's not great
- a sort of omniscient debugger always on that can inform the AI of all that the program/services did (sentry-like observability but on steroids). And then the AI would just search within that and find the root cause
none of the two approaches are going to be easy to make happen but imo if we all spend 10+ hours every week debugging that's worth the shot
that's why currently I'm working on approach 2. I made a time travel debugger/observability engine for JS/Python and I'm currently working on plugging it into AI context the most efficiently possible so it debugs even super long sequences of actions in dev & prod hopefully one day
it's super WIP and not self-hostable yet but if you want to check it out: https://ariana.dev/
indymike3 hours ago
> this is pretty cool but ultimately it won't be enough to debug real bugs that are nested deep within business logic
I'm looking at this as a better way to get the humans pointed in the right direction. Ariana.dev looks interesting!
Narishma3 hours ago
It's more likely to waste your time by pointing you in the wrong direction.
anougaret3 hours ago
hahaha yeah, even real developers cannot anticipate too well what the direction of a bug is on the first try
anougaret3 hours ago
yes can be a nice lightweight way to debug with a bit of AI other tools in that space will pbly be higher involvement
ehnto9 hours ago
I think you hit the nail on the head, especially for deeply embedded enterprise software. The long action chains/time taken to set up debugging scenarios is what makes debugging time consuming. Solving the inference side of things would be great, but I feel it takes too much knowledge not in the codebase OR the LLM to actually make an LLM useful once you are set up with a debugging state.
Like you said, running over a stream of events, states and data for that debugging scenario is probably way more helpful. It would also be great to prime the context with business rules and history for the company. Otherwise LLMs will make the same mistake devs make, not knowing the "why" something is and thinking the "what" is most important.
anougaret3 hours ago
thanks couldn't agree more :)
rafaelmn9 hours ago
Frankly this kind of stuff getting upvoted kind of makes HN less and less valuable as a news source - this is yet another "hey I trivially exposed something to the LLM and I got some funny results on a toy example".
These kind of demos were cool 2 years ago - then we got function calling in the API, it became super easy to build this stuff - and the reality hit that LLMs were kind of shit and unreliable at using even the most basic tools. Like oh woow you can get a toy example working on it and suddenly it's a "natural language interface to WinDBG".
I am excited about progress in this front in any domain - but FFS show actual progress or something interesting. Show me an article like this [1] where the LLM did anything useful. Or just show what you did that's not "oh I built a wrapper on a CLI" - did you fine tune the model to get better performance ? Did you compare which model performs better by setting up some benchmark and found one to be impressive ?
I am not shitting on OP here because it's fine to share what you're doing and get excited about it - maybe this is step one, but why the f** is this a front page article ?
[1]https://cookieplmonster.github.io/2025/04/23/gta-san-andreas...
anougaret9 hours ago
yeah it is still truly hard and rewarding to do deep, innovative software but everyone is regressing to the mean, rushing to low hanging fruits, and just plugging old A with new B in the hopes it makes them VC money or something
real, quality AI breakthrough in software creation & maintenance will require deep rework of many layers in the software stack, low and high level.
kevingadd7 hours ago
fwiw, WinDBG actually has support for time-travel debugging. I've used it before quite successfully, it's neat.
anougaret7 hours ago
usual limits of debuggers = barely usable to debug real scenarios
pjmlp3 hours ago
Since Borland days on MS-DOS they have served me pretty well in many real scenarios.
Usually what I keep bumping into, are people that never bothered to learn how to use their debuggers beyond the "introduction to debuggers" class, if any.
danielovichdk11 hours ago
Claiming to use WinDBG for debugging a crash dump and the only commands I can find in the MCP code are these ? I am not trying to be a dick here, but how does this really work under the covers ? Is the MCP learning windbg ? Is there a model that knows windbg ? I am asking becuase I have no idea.
results["info"] = session.send_command(".lastevent")
results["exception"] = session.send_command("!analyze -v")
results["modules"] = session.send_command("lm")
results["threads"] = session.send_command("~")
You cannot debug a crash dump only with these 4 commands, all the time.psanchez10 hours ago
It looks like it is using "Microsoft Console Debugger (CDB)" as the interface to windbg.
Just had a quick look at the code: https://github.com/svnscha/mcp-windbg/blob/main/src/mcp_serv...
I might be wrong, but at first glance I don't think it is only using those 4 commands. It might be using them internally to get context to pass to the AI agent, but it looks like it exposes:
- open_windbg_dump
- run_windbg_cmd
- close_windbg_dump
- list_windbg_dumps
The most interesting one is "run_windbg_cmd" because it might allow the MCP server to send whatever the AI agent wants. E.g: elif name == "run_windbg_cmd":
args = RunWindbgCmdParams(**arguments)
session = get_or_create_session(
args.dump_path, cdb_path, symbols_path, timeout, verbose
)
output = session.send_command(args.command)
return [TextContent(
type="text",
text=f"Command: {args.command}\n\nOutput:\n```\n" + "\n".join(output) + "\n```"
)]
(edit: formatting)gustavoaca199710 hours ago
I think the magic happens in the function "run_windbg_cmd". AFAIK, the agent will use that function to pass any WinDBG command that the model thinks will be useful. The implementation basically includes the interface between the model and actually calling CDB through CDBSession.
eknkc8 hours ago
Yeah that seems correct. It's like creating an SQLite MCP server with single tool "run_sql". Which is just fine I guess as long as the LLM knows how to write SQL (or WinDBG commands). And they definitely do know that. I'd even say this is better because this shifts the capability to LLM instead of the MCP.
dark-star6 hours ago
The magic happens in the "analyze -v" part. This does quite a long analysis of a crashdump (https://learn.microsoft.com/en-us/windows-hardware/drivers/d...)
After that, all that is required is interpreting the results and connecting it with the source code.
Still impressive at first glance, but I wonder how well it works with a more complex example (like a crash in the Windows kernel due to a broken driver, for example)
[deleted]4 hours agocollapsed
JanneVee9 hours ago
> Crash dump analysis has traditionally been one of the most technically demanding and least enjoyable parts of software development.
I for one enjoy crashdump analysis because it is a technically demanding rare skill. I know I'm an exception but I enjoy actually learning the stuff so I can deterministically produce the desired result! I even apply it to other parts of the job, like learning to currently used programming language and actually reading the documentation libraries/frameworks, instead of copy pasting solutions from the "shortcut du jour" like stack overflow yesterday and LLMs of today!
criddell6 hours ago
Are you using WinDbg? What resources did you use to get really good at it?
Analyzing crash dumps is a small part of my job. I know enough to examine exception context records and associated stack traces and 80% of the time, that’s enough. Bruce Dawson’s blog has a lot of great stuff but it’s pretty advanced.
I’m looking for material to help me jump that gap.
muststopmyths2 hours ago
so, debuggers are really just tools. To get "good" at analyzing crashdumps, you have to understand the OS and its process/threading model, the ABI of the platform, a little (to a lot) of assembler etc.
There's no magic to getting good at it. Like anything else, it's mostly about practice.
People like Bruce and Raymond Chen had a little bit of a leg up over people outside Microsoft in that if you worked in the Windows division, you got to look at more dumps than you'd have wanted to in your life. That plus being immersed in the knowledge pool and having access to Windows source code helps to speed up learning.
Which is to say, you will eventually "bridge the gap" with them with experience. Just keep plugging at it and eventually you'll understand what to look for and how to find it.
It helps that in a given application domain the nature of crashes will generally be repeated patterns. So after a while you start saying "oh, I bet this is a version of that other thing I've seen devs stumble over all the time".
A bit of a rambling comment to say don't worry. you'll "get really good at it" with experience.
JanneVee5 hours ago
I didn't say that I was any good, just that I enjoyed it.
I have a dog eared copy of Advanced Windows Debugging that I've used, but I've also have books around reverse engineering, disassembly and a little bit of curiosity and practice. I have also the .NET version which I haven't used as much. I also enjoyed the Vostokov books, even though there is a lack of editing in them.
Edit to add: It is not as much focus on usage of the tool as it is about understanding what is going on in the dump file, you are ahead in knowledge if you can do stack traces and look at exception records.
the_duke5 hours ago
I feel like current top models (Gemini Pro 2.5 etc) would already be good developers if they had the feedback cycle and capabilities that real developers have:
* reading the whole source code
* looking up dependency documentation and code, search related blog posts
* getting compilation/linter warnings ands errors
* Running tests
* Running the application and validating output (eg, for a webserver, start the server, send requests, get the response)
The tooling is slowly catching up, and you can enable a bunch of this already with MCP servers, but we are nowhere near the optimum yet.
Expect significant improvements in the near future, even if the models don't get better.
thegeomaster5 hours ago
This is exactly what frameworks like Claude Code, OpenAI Codex, Cursor agent mode, OpenHands, SWE-Agent, Devin, and others do.
It definitely does allow models to do more.
However, the high-level planning, reflection and executive function still aren't there. LLMs can nowadays navigate very complex tasks using "intuition": just ask them to do the task, give them tools, and they do a good job. But if the task is too long or requires too much information, the context length deteriorates the performance significantly, so you have to switch to a multi-step pipeline with multiple levels of execution.
This is, perhaps unexpectedly, where things start breaking down. Having the LLM write down a plan lossily compresses the "intuition", and LLMs (yes, even Gemini 2.5 Pro) cannot understand what's important to include in such a grand plan, how to predict possible externalities, etc. This is a managerial skill and seems distinct from closed-form coding, which you can always RL towards.
Errors, omissions, and assumptions baked into the plan get multiplied many times over by the subsequent steps that follow the plan. Sometimes, the plan heavily depends on the outcome of some of the execution steps ("investigate if we can..."). Allowing the "execution" LLM to go back and alter the plan results in total chaos, but following the plan rigidly leads to unexpectedly stupid issues, where the execution LLM is trying to follow flawed steps, sometimes even recognizing that they are flawed and trying to self-correct inappropriately.
In short, we're still waiting for an LLM which can keep track of high-level task context and effectively steer and schedule lower-level agents to complete a general task on a larger time horizon.
For a more intuitive example, see how current agentic browser use tools break down when they need to complete a complex, multi-step task. Or just ask Claude Code to do a feature in your existing codebase (that is not simple CRUD) the way you'd tell a junior dev.
pjmlp3 hours ago
I expect that if I use the way I would tell an offshoring junior dev, to the way that I actually get a swing instead of a tire, then it will get quite close to the desired outcome.
However, this usually takes much more effort than just doing the damm thing myself.
demarq5 hours ago
It’s now a matter of when, and I’m working on that problem.
lgiordano_notte5 hours ago
Curious how you're handling multi-step flows or follow-ups, seems like thats where MCP could really shine especially compared to brittle CLI scripts. We've seen similar wins with browser agents once structured actions and context are in place.
codepathfinder5 hours ago
Built this around 2023 mid and found interesting results!
cadamsdotcom11 hours ago
Author built an MCP server for windbg: https://github.com/svnscha/mcp-windbg
Knows plenty of arcane commands in addition to the common ones, which is really cool & lets it do amazing things for you, the user.
To the author: most of your audience knows what MCP is, may I suggest adding a tl;dr to help people quickly understand what you've done?
Tepix10 hours ago
Sounds really neat!
How does it compare to using the Ghidra MCP server?
mahmoudimus23 minutes ago
Ghidra is actually a suite of reverse engineering toolkits, including, but not limited to a disassembler, a decompiler and a debugger interface that interfaces with many debuggers, among other neat things.
A disassembler takes compiled binaries and displays the assembly code the machine executes.
A decompiler translates the disassembled code back to pseudocode (e.g. disassembly -> C).
A debugger lets you step through the disassembly. Windbg is a debugger which is pretty powerful, but has the downside of a pretty unintuitive syntax (but I'm biased coming from gdb/llvm debuggers).
Both the MCP servers can probably be used together, but they both do different things. A neat experiment would be to see if they're aware of each other and can utilize each other to "vibe reverse"
trealira2 hours ago
This isn't a decompiler, but there are LLM tools for decompilation, like LLM4Decompile.
cjbprime9 hours ago
Ghidra's a decompiler and WinDBG is a debugger, so they'd be complementary.
[deleted]11 hours agocollapsed
alexvitkov2 hours ago
Watching a guy type at 30 WPM in a chatbox reminds me of those old YouTube tutorials where some dude is typing into into a notepad window, and showing you how to make a shortcut to "shutdown -s -t 0" on your school computer and give it the Internet Explorer icon. It's only missing Linkin Park blasting in the background.
If you're debugging from a crash dump you probably have a large, real world program, that actual people have reviewed, deemed correct and released in the wild.
Current LLMs can't produce a sane program over 500 lines, the idea that they can understand a correct looking program several orders of magnitude larger, well enough to diagnose and fix a subtle issue that the people who wrote the it missed, is absurd.
[deleted]an hour agocollapsed
indigodaddy11 hours ago
My word, that's one of the most beautiful sites I've ever encountered on mobile.
[deleted]11 hours agocollapsed
Zebfross11 hours ago
Considering AI is trained on the average human experience, I have a hard time believing it would be able to make any significant difference in this area. The best experience I’ve had debugging at this level was using Microsoft’s time travel debugger which allows stepping forward and back.
cjbprime9 hours ago
You should try AI sometime. It's quite good, and can do things (like "analyze these 10000 functions and summarize what you found out about how this binary works, including adding comments everywhere) that individual humans do not scale to.
voidspark8 hours ago
It can analyze a crash dump in 2 seconds, that could take hours for an experienced developer, or impossible for the "average human".
s3cfast2 hours ago
George Hotz's Qira is a timeless debugger; worth checking out too!
[deleted]11 hours agocollapsed
[deleted]11 hours agocollapsed
userbinator11 hours ago
[flagged]
danb197411 hours ago
That's not how averages work :)
FirmwareBurner11 hours ago
Yes, but we all understood the essence of what he meant and he's right. Why be a stickler about it.
[deleted]11 hours agocollapsed
userbinator11 hours ago
[flagged]
hhh10 hours ago
Is it brigading if people just have the opinion that AI can do well at software?
userbinatoran hour ago
Only half of them.
posnet11 hours ago
Median
spacechild110 hours ago
Human intelligence roughly follows a normal distribution where the median is the same as the mean. In that sense OP was correct that half of the population are below average.
[deleted]11 hours agocollapsed
revskill11 hours ago
[flagged]
[deleted]11 hours agocollapsed
dogleash4 hours ago
[flagged]
JanSchu6 hours ago
This is one of the most exciting and practical applications of AI tooling I've seen in a long time. Crash dump analysis has always felt like the kind of task that time forgot—vital, intricate, and utterly user-hostile. Your approach bridges a massive usability gap with the exact right philosophy: augment, don't replace.
A few things that stand out:
The use of MCP to connect CDB with Copilot is genius. Too often, AI tooling is skin-deep—just a chat overlay that guesses at output. You've gone much deeper by wiring actual tool invocations to AI cognition. This feels like the future of all expert tooling.
You nailed the problem framing. It’s not about eliminating expertise—it’s about letting the expert focus on analysis instead of syntax and byte-counting. Having AI interpret crash dumps is like going from raw SQL to a BI dashboard—with the option to drop down if needed.
Releasing it open-source is a huge move. You just laid the groundwork for a whole new ecosystem. I wouldn’t be surprised if this becomes a standard debug layer for large codebases, much like Sentry or Crashlytics became for telemetry.
If Microsoft is smart, they should be building this into VS proper—or at least hiring you to do it.
Curious: have you thought about extending this beyond crash dumps? I could imagine similar integrations for static analysis, exploit triage, or even live kernel debugging with conversational AI support.
Amazing work. Bookmarked, starred, and vibed.
Helmut100015 hours ago
I have noticed a lot of improvements in this area too. I recently had a problem with my site-to-site IPSEC connection. I had an LLM explain the logs from both sides and together we came to a conclusion. Distilling the problematic part from the huge logs was a significant effort and time saver.