Hacker News

mohsen1

Show HN: Letting LLMs Run a Debugger github.com

Hey HN,

I just built an experimental VSCode extension called LLM Debugger. It’s a proof-of-concept that lets a large language model take charge of debugging. Instead of only looking at the static code, the LLM also gets to see the live runtime state—actual variable values, function calls, branch decisions, and more. The idea is to give it enough context to help diagnose issues faster and even generate synthetic data from running programs.

Here’s what it does:

* Active Debugging: It integrates with Node.js debug sessions to gather runtime info (like variable states and stack traces).

* Automated Breakpoints: It automatically sets and manages breakpoints based on both code analysis and LLM suggestions.

* LLM Guidance: With live debugging context, the LLM can suggest actions like stepping through code or adjusting breakpoints in real time.

I built this out of curiosity to see if combining static code with runtime data could help LLMs solve bugs more effectively. It’s rough around the edges and definitely not production-ready

I’m not planning on maintaining it further. But I thought it was a fun experiment and wanted to share it with you all.

Check out the attached video demo to see it in action. Would love to hear your thoughts and any feedback you might have!

Cheers.

mohsen1op5 months ago

To people who do this sort of thing:

Does generating synthetic data using this make sense for RL of models to be be really good at debugging code? Currently all LLMs inhaled all of code in the world but the data is only text of the code (maybe plus changes that fixed bugs etc) but the amount of insight that can be generated by actually running the code and getting the runtime values, step-by-step is almost infinite.

Is this sort of data useful for training LLMs?

marxism5 months ago

Not only that, measurably improves coding ability too (according to our tests finetuning llama 2).

Back in 2023 we sold a training dataset (~500B tokens but we could have gen more) with exactly this kind of data. The dataset was a bunch of ~1-3KB text file examples with a code snippet, then we would show some variable values when the program was at line X, then ask the LLM to predict some more print statements at line Y.

If you wanted to train on stack traces or control flow graphs we offered that too.

Our mvp used rr to attach to large programs like Excel and Chrome, then used a heuristic to filter samples for more "interesting" examples. The theory at the time was sure LLMs can learn a few information theory bits from many noisy examples, but why not spend a little intelligence on our end and knock out a couple low hanging "noise" sources. We used a prolog engine to find traces where all the information required to predict the final memory address was present in the initial program description. This turned out not to matter too much because the LLM would learn from non deterministic examples too (shrug). Eventually we ended up with a monkey patched chromium browser wandering around the internet to collect javascript examples.

We sold datasets, and we also offered an on-prem agent where you could generate training examples on demand using the spare CPU cycles on GPU nodes.

It seemed like a beautiful idea at the time but sputtered out because every serious outfit we talked to would ask a bunch of questions and seemed to come to the conclusion that they would rather do it in house.

I would bet dollars to donuts that most of the AI scraping load that people are complaining about has nothing to do with grabbing the text content of their forum and more to do with executing their javascript.

Speaking from experience: we didn't really understand V8 internals so we relied on continuous navigation and reloading the page to drive more execution rather than something smarter: snapshotting or other more fine grained manipulation of VM state. Our training data harvesting bot would reload web pages over and over to take more samples rather than do something more efficient with isolate snapshots.

Edit: email in my profile if anyone wants to talk about this. I'm feeling a wave of nostalgia thinking about this project again. Despite commercial failure, it was arrestingly/dangerously interesting.

silveraxe935 months ago

I'd be extremely surprised if AI labs are not doing or planning on doing this already.

The same way that reasoning models are trained on chain of thoughts, why not do it with program state?

Just have a "separate" scratchpad where the AI keeps the expected state of the program. You can verify if that is correct or not. Just use RL to train the AI to always have that correct.

[deleted]5 months agocollapsed

hunterbrooks5 months ago

Yes. My business is in the code review space, synthetic data is very helpful for evals

llm_trw5 months ago

Probably. You don't know until someone tries.

melvinroest5 months ago

This would be amazing with Smalltalk/Pharo or a similar language where the concept of debugging is a first class citizen (I guess it's the same for certain Lisp languages?)

koito175 months ago

In the case of Common Lisp, yes. Other lisps (e.g. Clojure) don't really have interactive debugging, value inspection, live disassembly, or anything like the condition system (which allows programmatic access to the debugger and automatically recovering from certain conditions).

The compiler, debugger, and runtime are always present. When optimizing Common Lisp code, I found it very useful to add types, refactor structs, recompile functions, then disassemble the functions and see if the compiler generated efficient machine code. This is all natively supported by the runtime. Editor tooling does little more than create a nice-looking UI. Doing this in Clojure is not really possible, since it's hard to guess what HotSpot will do with a given sequence of JVM bytecode.

jasonjmcghee5 months ago

Nice! I recently had and built the same idea using MCP (in order to be client / LLM agnostic) and VS Code (DAP would be even better, but haven't tried tackling it).

https://github.com/jasonjmcghee/claude-debugs-for-you

mohsen1op5 months ago

That's a really cool project. Why did you stop developing this? For me, it's a lot of work and I don't have bandwidth to productionize this

jasonjmcghee5 months ago

I wouldn't say i stopped developing it- there are always features that could be added, but it serves the purpose it's meant to! You can directly interact with an LLM and it can work just like any other chat, but can also perform its own debugging / investigations

pinoy4205 months ago

Very healthy mindset

jasonjmcghee5 months ago

Wow this really blew up since I last commented! Congrats! If you're interested in developing this further, I encourage you to check out how I made it language agnostic- instead of only Node.js. it's really not much additional effort

jdright5 months ago

also see: https://github.com/plasma-umass/ChatDBG

emeryberger5 months ago

Nice UI. We started on a project that does this about 2 years ago. ChatDBG (https://github.com/plasma-umass/ChatDBG), downloaded about 70K times to date. It integrates into debuggers like `lldb`, `gdb`, and `pdb` (the Python debugger). For C/C++, it also leverages a language server, which makes a huge difference. You can also chat with it. We wrote a paper about it, should be published shortly in a major conference near you (https://arxiv.org/abs/2403.16354). One of the coolest things we found is that the LLM can leverage real-world knowledge to diagnose errors; for example, it successfully debugged a problem where the number of bootstrap samples was too low.

flembat5 months ago

Really great idea, I currently work for an AI, compiling and debugging its code, at least that's what it sometimes feels like. Who is the agent here exactly? The fact that the AI has no understanding at all of what we are doing, and does not apply the information it does know to solve problems is challenging. At least if it debugged the code, it would be able to see that it is clobbering the same registers that it is using, instead of me having to explain that to it. Fortunately I am talking about my hobby projects, I pity people who are doing this for a living now.

dboreham5 months ago

Hopefully work like this has the side effect of people making debuggers work again. In my experience they seldom do these days (except in old-school tech like C, C++, golang). Presumably because the younger folks were told in college that debugging wasn't necessary. I don't mean debuggers just don't run, rather that they're sufficiently broken that they're not worthwhile using. Perhaps an LLM that adds print statements to code and reads the output would be more in keeping with the times?

ericb5 months ago

Very cool concept! There's a lot of potential in reducing the try-debug--fix cycle for LLMs.

On a related note, here's a Ruby gem I wrote that captures variable state from the moment an Exception is raised. It gets you non-interactive text-based debugging for exceptions.

https://rubygems.org/gems/enhanced_errors

K0IN5 months ago

Hei this is lovely,

i created a extension to help me debug a while back [0], and i thought of this (ai integration) for a long time, but did not have the time to tackle this.

Thank you so much for sharing!

I might need to add this approach to my extension as well!

[0] https://github.com/K0IN/debug-graph

bravura5 months ago

A time-traveling debugger for Python + LLM would be amazing.

mettamage5 months ago

Tangent: I now want a 10 hour YouTube video where an LLM gets stuck in some debugging reasoning loop and the loop is just recorded for 10 hours.

Preferably with some lofi music under it.

jasonjmcghee5 months ago

This isn't difficult to do, once you have debugging capabilities exposed to the llm in vs code. You just need the proper launch.json and to expose "step back".

stuaxo5 months ago

Nice, I did this manually once with ipdb, just cutting and pasting the text to LLM and having it tell me which variables to inspect and what to press.

crest5 months ago

I like the first paragraph of the README clearly stating that this is your research project instead of making a lot of grandiose claims.

bwhiting23565 months ago

Thank you for this. Would love to see it integrated into copilot or cusor

jasonjmcghee5 months ago

I added support for SSE transport (and so Cursor) in "claude debugs for you" - an autodebugger I've been working on. It works via MCP server, so you can use via Composer (Agent) in Cursor. I don't pay for premium in Cursor myself, so would be very excited about you testing it out!

https://marketplace.visualstudio.com/items?itemName=JasonMcG...

https://github.com/jasonjmcghee/claude-debugs-for-you

codenote5 months ago

Interesting experiment! This feels like it could really expand the potential applications of LLMs. Exciting to see how AI can assist in debugging with live runtime context!

maeil5 months ago

Extremely LLM-like writing style, do you translate your comments through one or something?

bbarnett5 months ago

Do not let it debug itself!!

jbmsf5 months ago

Honestly, this is the first LLM concept that makes me want to change my workflow. I don't use vscode but I'm excited by the idea.

jasonjmcghee5 months ago

Theoretically there's no reason you need to use vscode, it's just very easy to make extensions for.

source