Hacker News

gritzko

SCM as a database for the code gist.github.com

moezdan hour ago

Git works universally as a storage backend, with some reasoning capacity thanks to commit history. It didn't need to include semantics about the code or build a tree of knowledge. That would be against Linux philosophy: Do one thing and do it well.

You can build whatever you want on top to help your AI agents. That would be actually beneficial so that we stop feeding raw text to this insane machinery for once.

hallh9 hours ago

We've tackled this problem slightly differently where I work. We have AI agents contribute in a large legacy codebase, and without proper guidance, the agents quickly get lost or reimplement existing functionality.

To help the agents understand the codebase, we indexed our code into a graph database using an AST, allowing the agent to easily find linked pages, features, databases, tests, etc from any one point in the code, which helped it produce much more accurate plans with less human intervention and guidance. This is combined with semantic search, where we've indexed the code based on our application's terminology, so when an agent is asked to investigate a task or bug for a specific feature, it'll find the place in the code that implements that feature, and can navigate the graph of dependencies from there to get the big picture.

We provide these tools to the coding agents via MCP and it has worked really well for us. Devs and QAs can find the blast radius of bugs and critical changes very quickly, and the first draft quality of AI generated plans requires much less feedback and corrections for larger changes.

In our case, I doubt that a general purpose AST would work as well. It might be better than a simple grep, especially for indirect dependencies or relationships. But IMO, it'd be far more interesting to start looking at application frameworks or even programming languages that provide this direct traversability out of the box. I remember when reading about Wasp[0] that I thought it would be interesting to see it go this way, and provide tooling specifically for AI agents.

[0] https://wasp.sh/

panstromek11 hours ago

I think agree (but I think I think about this maybe a one level higher). I wrote about this a while ago in https://yoyo-code.com/programming-breakthroughs-we-need/#edi... .

One interesting thing I got in replies is Unison language (content adressed functions, function is defined by AST). Also, I recommend checking Dion language demo (experimental project which stores program as AST).

In general I think there's a missing piece between text and storage. Structural editing is likely a dead end, writing text seems superior, but storage format as text is just fundamentally problematic.

I think we need a good bridge that allows editing via text, but storage like structured database (I'd go as far as say relational database, maybe). This would unlock a lot of IDE-like features for simple programmatic usage, or manipulating langauge semantics in some interesting ways, but challenge is of course how to keep the mapping between textual input in shape.

zokier16 minutes ago

> but storage format as text is just fundamentally problematic.

Why? The ast needs to be stored as bytes on disk anyways, what is problematic in having having those bytes be human-readable text?

fuhsnn11 hours ago

Structural diff tools like difftastic[1] is a good middle ground and still underexplored IMO.

[1] https://github.com/Wilfred/difftastic

panstromek11 hours ago

IntelliJ diffs are also really good, they are somewhat semi-structural I'd say. Not going as far as difftastic it seems (but I haven't use that one).

flowerbreeze9 hours ago

I'm quite sure I've read your article before and I've thought about this one a lot. Not so much from GIT perspective, but about textual representation still being the "golden source" for what the program is when interpreted or compiled.

Of course text is so universal and allows for so many ways of editing that it's hard to give up. On the other hand, while text is great for input, it comes with overhead and core issues for (most are already in the article, but I'm writing them down anyway):

  1. Substitutions such as renaming a symbol where ensuring the correctness of the operation pretty much requires having parsed the text to a graph representation first, or letting go of the guarantee of correctness in the first place and performing plain text search/replace.
  2. Alternative representations requiring full and correct re-parsing such as:
  - overview of flow across functions
  - viewing graph based data structures, of which there tend to be many in a larger application
  - imports graph and so on...
  3. Querying structurally equivalent patterns when they have multiple equivalent textual representations and search in general being somewhat limited.
  4. Merging changes and diffs have fewer guarantees than compared to when merging graphs or trees.
  5. Correctness checks, such as cyclic imports, ensuring the validity of the program itself are all build-time unless the IDE has effectively a duplicate program graph being continuously parsed from the changes that is not equivalent to the eventual execution model.
  6. Execution and build speed is also a permanent overhead as applications grow when using text as the source. Yes, parsing methods are quite fast these days and the hardware is far better, but having a correct program graph is always faster than parsing, creating & verifying a new one.

I think input as text is a must-have to start with no matter what, but what if the parsing step was performed immediately on stop symbols rather than later and merged with the program graph immediately rather than during a separate build step?

Or what if it was like "staging" step? Eg, write a separate function that gets parsed into program model immediately, then try executing it and then merge to main program graph later that can perform all necessary checks to ensure the main program graph remains valid? I think it'd be more difficult to learn, but I think having these operations and a program graph as a database, would give so much when it comes to editing, verifying and maintaining more complex programs.

zelphirkalt11 hours ago

Why would structural editing be a dead end? It has nothing to do with storage format. At least the meaning of the term I am familiar with, is about how you navigate and manipulate semantic units of code, instead of manipulating characters of the code, for example pressing some shortcut keys to invert nesting of AST nodes, or wrap an expression inside another, or change the order of expressions, all at the pressing of a button or key combo. I think you might be referring to something else or a different definition of the term.

panstromek11 hours ago

I'm referring to UI interfaces that allow you to do structural editing only and usually only store the structural shape of the program (e.g. no whitespace or indentation). I think at this point nobody uses them for programming, it's pretty frustrating to use because it doesn't allow you to do edits that break the semantic text structure too much.

I guess the most used one is styles editor in chrome dev tools and that one is only really useful for small tweaks, even just adding new properties is already pretty frustrating experience.

[edit] otherwise I agree that structural editing a-la IDE shortcuts is useful, I use that a lot.

pegasus9 hours ago

Some very bright Jetbrains folks were able to solve most of those issues. Check out their MPS IDE [1], its structured/projectional editing experience is in a class of its own.

[1] https://www.youtube.com/watch?v=uvCc0DFxG1s

conartist611 hours ago

Come the BABLR side. We have cookies!

In all seriousness this is being done. By me.

I would say structural editing is not a dead end, because as you mention projects like Unison and Smalltalk show us that storing structures is compatible with having syntax.

The real problem is that we need a common way of storing parse tree structures so that we can build a semantic editor that works on the syntax of many programming languages

panstromek11 hours ago

I think neither Unison nor Smalltalk use structural editing, though.

[edit] on the level of a code in a function at least.

conartist69 hours ago

No, I know that. But we do have an example of something that does: the web browser.

PunchyHamster11 hours ago

No we don't.

And you can build near any VCS of your dream while still using Git as storage backend, as it is database of a linked snapshots + metadata. Bonus benefit: it will work with existing tooling

The whole article is "I don't know how git works, let's make something from scratch"

conartist611 hours ago

A syntax tree node does not fit into a git object. Too many children. This doesn't mean we shouldn't keep everything that's great about git in a next-gen solution, but it does mean that we'll have to build some new tools to experiment with features like semantic patching and merging.

Also I checked the author out and can confirm that they know how git works in detail.

ongy11 hours ago

Why do you think it has too many children? If we are talking direct descendents, I have seen way larger directories in file systems (git managed) than I've ever seen in an AST.

I don't think there's a limit in git. The structure might be a bit deep for git and thus some things might be unoptimized, but the shape is the same.

Tree.

conartist69 hours ago

Directories use the `tree` object type in git whereas files use `blob`. What I understand you to suggest is using the tree nodes instead of the blob nodes as the primary type of data.

This is an interesting idea for how to reuse more of git's infrastructure, but it wouldn't be backwards compatible in the traditional sense either. If you checked out the contents of that repo you'd get every node in the syntax tree as a file, and let's just say that syntax nodes as directories aren't going to be compatible with any existing tools.

But even if I wanted to embrace it I still think I'd hit problems with the assumptions baked into the `tree` object type in git. Directories use a fundamentally different model than syntax trees do. Directories tend to look like `<Parent><Child/></>` while syntax trees tend to look like `<Person> child: <Person /> </>`. There's no room in git's `tree` objects to put the extra information you need, and eventually the exercise would just start to feel like putting a square peg in a round hole.

Instead of learning that I should use exactly git's data structure to preserve compatibility, I think my learning should be that a successful structure needs to be well-suited to the purpose it is being used for.

charcircuit12 hours ago

>I definitely reject the "git compatible" approach

If your version control system is not compatible with GitHub it will be dead on arrival. The value of allowing people to gradually adopt a new solution can not be understated. There is also value in being compatible with existing git integrations or scripts in projects build systems.

wavemode40 minutes ago

I don't think Git/GitHub is really all that big of a lock-in in practice for most projects.

IMO Git is not an unassailable juggernaut - I think if a new SCM came along and it had a frontend like GitHub and a VSCode plugin, that alone would be enough for a many users to adopt it (barring users who are heavy customers of GitHub Actions). It's just that nobody has decided to do this, since there's no money in it and most people are fine with Git.

quadrifoliate12 hours ago

Based on reading this, I don't see anything that would prevent keeping track of a repo tracked by this database with Git (and therefore GitHub) in addition to the database. I think the "compatible" bit means more that you have to think in terms of Git concepts everywhere.

Curious what the author thinks though, looks like it's posted by them.

gritzkoop11 hours ago

Technically, exporting changes either way is not a challenge. It only becomes difficult if we have multiple gateways for some reason.

One way to do it is to use the new system for the messy part and git/GitHub for "publication".

conartist612 hours ago

A system as described could be forwards compatible with git without being backwards compatible with git. In other words let people migrate easily, but don't force the new system to have all the same flaws of the old

ongy11 hours ago

What issues do you see in git's data model to abandon it as wire format for syncing?

conartist611 hours ago

I wouldn't say I want to abandon anything git is doing as much as evolve it. Objects need to be able to contain syntax tree nodes, and patches need to be able to target changes to particular locations in a syntax tree instead of just by line/col.

ongy11 hours ago

An AST is a tree as much as the directory structure currently encoded in git.

It shouldn't be hard to build a bijective mapping between a file system and AST.

conartist69 hours ago

Right, but for what purpose? I don't see much gain, and now you're left trying to fit a square peg into a round hole. The git CLI would be technically working, but not practically useful. Same with an IDE: if you checked the files out you could technically open them but not easily change your program.

nylonstrung12 hours ago

Trustfall seems really promising for querying files as if they were a db

https://github.com/obi1kenobi/trustfall

gfody11 hours ago

I've had this idea too, and think about it everytime I'm on a PR with lots of whitespace/non-functional noise how nice it would be if source code wern't just text and I could be looking at a cleaner higher level diff instead.. I think you have to go higher than AST though, it should at least be language-aware

gritzkoop11 hours ago

(Author) In my current codebase, I preserve the whitespace nodes. Whitespace changes would not affect the other nodes though. My first attempt to recover whitespace algorithmically not exactly failed, but more like I was unable to verify it is OK enough. We clang-format or go fmt the entire thing anyway, and whitespace changes are mostly noise, but I did not find 100% sure approach yet.

gfodyan hour ago

I think about eg the "using" section at the top of a .cs file where order doesn't matter and it's common for folks to use the "Remove and Sort Usings" feature in VS.. if that were modeled as a set then diffs would consist only of added/removed items and a re-ordering wouldn't even be representable. And then every other manner of refactor that noises up a PR: renaming stuff, moving code around, etc. in my fantasies some perfect high-level model would separate everything that matters from everything that doesn't and when viewing PRs or change history we could tick "ignore superficial changes" to cut thru all the noise when looking for something specific

..to my mind such a thing could only be language-specific and the model for C# is probably something similar to Roslyn's interior (it keeps "trivia" nodes separate but still models the using section as a list for some reason) and having it all in a queryable database would be glorious for change analysis

zelphirkalt11 hours ago

Some languages are unfortunately whitespace sensitive, so a generic VCS cannot discard whitespace at all. But maybe the diffing tools themselves could be made language aware and hide not meaningful changes.

em-bee6 hours ago

hiding not meaningful changes is not enough. when a block in python changes the indentation, i want to see that the block is otherwise unchanged. so indentation changes simply need to be marked differently. if a tool can to that then it will also work with code where indentation is optional, allowing me to cleanly indent code without messing up the diff.

i saw a diff tool that marked only the characters that changed. that would work here.

procaryote11 hours ago

You can build a mergetool (https://git-scm.com/docs/git-mergetool)

DannyBee11 hours ago

Somebody call the visualage and clearcase folks and let them know their time has come again!

gritzkoop11 hours ago

https://wiki.c2.com/?VisualAge

guerrilla11 hours ago

> clear case

Please no.

DannyBee10 hours ago

What, you dont like 6 hour code checkouts of 10 megabyte projects?

All you had to do was access it over VPN and you could go on vacation because nobody would expect you to get anything done!

guerrilla9 hours ago

That's what I remember most about it: how much it lowered expectations on us. We barely ever worked because most of the time we couldn't. It was always broken (not just slow). This became a part of the culture where management didn't even care what we did. They knew eventually things would get done (late) but we'd also have to wait. That was just part of it.

DannyBee9 hours ago

I agree. I will say the slowness i could kind of understand - just poor network protocol optimization/etc on their part, etc. Not that i excused it, but i at least could understand how you get there - you are always on a fast network and so you just don't notice what it's like when you aren't.

The brokeness was always what got me. I had to believe either they had no unit tests, or 1000's of them were failing and they released it anyway. Because it was so fragile that it would have been impossible to test it and not notice easily broken things.

whazor11 hours ago

A big challenge will be the unfamiliarity of such a new system. Many people have found that coding agents work really well with terminals, unix tooling, and file systems. It's proven tech.

Where-as doing DB queries to navigate code would be quite unfamiliar.

benrutter12 hours ago

> The monorepo problem: git has difficulty dividing the codebase into modules and joining them back

Can anyone explain this one? I use monorepos everyday and although tools like precommit can get a bit messy, I've never found git itself to be the issue?

gritzkoop11 hours ago

Based on my personal experience, big-corp monorepos have all features of a black hole: they try to suck in all the existing code (vendor it) and once some code starts living in a monorepo, there is no way to separate it as it becomes entangled with the entire thing. There is no way to "blend it in", so to say.

This subject deserves a study of its own, but big-big-tech tends to use other things than git.

ongy11 hours ago

That black hole behavior is a result of corporate processes though.

Not a result of git.

Business continuity (no uncontrolled external dependencies) and corporate security teams wanting to be able to scan everything. Also wanting to update everyone's dependencies when they backport something.

Once you got those requirements, most of the benefits of multi-repo / roundtripping over releases just don't hold anymore.

The entanglement can be stronger, but if teams build clean APIs it's no harder than removing it from a cluster of individual repositories. That might be a pretty load bearing if though.

rbsmith10 hours ago

> there is no way to separate it

There is

git subtree --help

May get complicated at the edges, as files move across boundaries. It's a hard problem. And for some subset of those problems, subtree does give a way to split and join.

conartist611 hours ago

Let's say I want to fork one of your monorepo modules and maintain a version for myself. It's hard. I might have to fork 20 modules and 19 will be unwanted. They'll be deleted, go stale, or I'll have to do pointless work to keep them up to date. Either way the fork and merge model that drives OSS value creation is damaged when what should be small, lightweight focused repos are permanently chained to the weight of arbitrary other code, which from the perspective of the one thing I want to work on is dead weight.

You can also just tell that monorepos don't scale because eventually if you keep consolidating over many generations, all the code in the world would be in just one or two repos. Then these repos would be so massive that just breaking off a little independent piece to be able to work on would be quite crucial to being able to make progress.

That's why the alternative to monorepos are multirepos. Git handles multirepos with it's submodules feature. Submodules are a great idea in theory, offering git repos the same level of composability in your deps that a modern package manager offers. But unfortunately submodules are so awful in practice so that people cram all their code into one repo just to avoid having to use the submodule feature for the exact thing it was meant to be used for...

zelphirkalt11 hours ago

Hm, I never had much issues with submodules. It seems just to be something that when one has understood it, one can use it, but it might seem scary at first and one needs to know, that a repo uses submodules at all.

conartist69 hours ago

Submodule-based multirepos still have a tiny fraction of the adoption that monorepos do. Tooling support is quite poor by comparison.

EPWN3D7 hours ago

I think a lot of people misunderstand what git really is for large software organizations. It's not their version control system. It's the set of primitives on top of which their version control is built.

Monorepos are one such VCS, which personally I don't like, but that's just me. Otherwise there are plenty of large organizations that manage lots of git repositories in various ways.

Replacing git is a lot like saying we should replace Unix. Like, yeah, it's got its problems, but we're kind of stuck with it.

dist-epoch11 hours ago

For example there is no easy way to create a "localized" branch - this branch is only allowed to change files in "/backend", such that the agent doesn't randomly modify files elsewhere. This way you could let an agent loose on some subpart of the monorepo and not worry that it will break something unrelated.

You can do all kinds of workarounds and sandboxes, but it would be nice for git to support more modularity.

zelphirkalt10 hours ago

Sounds like a submodule with restrictions for push access managed on the repo level of the submodule.

PunchyHamster11 hours ago

The author doesn't know how to use git or how git works.

If he knew how to use it, he'd be annoyed at some edge cases.

If he knew how it works, he'd know the storage subsystem is flexible enough to implement any kind of new VCS on top of it. The storage format doesn't need to change to improve/replace the user facing part

gritzkoop11 hours ago

Joe Armstrong had a beautiful LEGO vs Meccano metaphor. Both things are cool and somewhat similar in their basic idea, but you cannot do with LEGO what you can do with Meccano and vice-versa. Also, you can not mix them.

zelphirkalt10 hours ago

But you could create some parts that are enabling you to combine them easily. Which is what you could do with software. Write an adapter of sorts.

fragmede11 hours ago

Sure you can. Hot glue, E6000, duct tape. This is to say, git's pack format has its shortcomings.

procaryote11 hours ago

One fundamental deal-breaking problem with structure aware version control is that your vcs now needs to know all the versions of all the languages you're writing in. It gets non-trivial fast.

conartist68 hours ago

It does! So extensible parser definition is another key piece of the technological puzzle along with common formats for parse trees.

stared11 hours ago

To me, git works. And LLMs understand that, unlike some yet-to-come-tool.

If you create a new tool for version control, go for it. Then see how it fares in benchmarks (for end-to-end tools) or vox populi - if people use you new tool/skill/workflow.

mtsolitary12 hours ago

I’ve recently been thinking about this too, here is my idea: https://clintonboys.com/projects/lit/

ferroman11 hours ago

Using prompt as source of truth is not reliable, because outputs for the same prompt may vary

mtsolitary11 hours ago

Of course. But if this is the direction we’re going in, we need to try and do something no? Did you read the post I linked?

fragmede11 hours ago

Fascinating! https://github.com/clintonboys/lit-demo-crud isn't deep enough to really get a feel for how well this would work in practice.

gethly11 hours ago

> Like everyone else with a keyboard, a brain and a vested interest in their future as a programmer, I’ve been using AI agents to write a lot of code recently.

alrighty then..

solarized12 hours ago

Talk is cheap. Show me the code.

[deleted]12 hours agocollapsed

deafpolygon11 hours ago

No, we don’t. The genius of git is in its simplicity (while also being pretty damn complicated too).

sublinear12 hours ago

Missing 4 out of 5 parts. The attempt to wrest control away from the open source community doesn't even have the effort to keep going.

bonzini12 hours ago

The gist was created 1 hour before your comment.

gritzkoop12 hours ago

Yes, parts will keep coming, this weekend and the next week.

hole_in_foot12 hours ago

[dead]

OhNoNotAgain_9911 hours ago

[dead]

forrestthewoods12 hours ago

Definitely agree that Git is mediocre-at-best VCS tool. This has always been the case. But LLMs are finally forcing the issue. It’s a shame a whole generation programmers has only used Git/GitHub and think it’s good.

Monorepo and large binary file support is TheWay. A good cross-platform virtual file system (VFS) is necessary; a good open source one doesn’t exist today.

Ideally it comes with a copy-on-write system for cross-repo blob caching. But I suppose that’s optional. It lets you commit toolchains for open source projects which is a dream of mine.

Not sure I agree that LSP like features need to be built in. That feels wrong. That’s just a layer on top.

Do think that agent prompts/plans/summaries need to be a first class part of commits/merges. Not sure the full set of features required here.

JodieBenitez12 hours ago

> It’s a shame a whole generation programmers has only used Git/GitHub and think it’s good.

Well... I used SVN before that and it was way worse.

ryanm10111 hours ago

Clearly the poster has never encountered cvs or cvs-next.

And clear OP hasn't heard of vss..

forrestthewoods44 minutes ago

I have worked with all those and more.

JodieBenitez10 hours ago

Where I worked, before SVN we didn't even used any VCS system. Most of us were not even familiar with the concept.

fragmede11 hours ago

RCS, anyone?

EliRivers11 hours ago

Continuing the theme, a new starter at my place (with about a decade of various experience, including an international financial services information player whose name is well known) had never used git or indeed any distributed, modern source control system.

HN is a tiny bubble. The majority of the world's software engineers are barely using source control, don't do code reviews, don't have continuous build systems, don't have configuration controlled release versions, don't do almost anything that most of HN's visitors think are the basic table stakes just to conduct software engineering.

dist-epoch11 hours ago

> It lets you commit toolchains

Wasn't this done by IBM in the past? Rational Rose something?

DannyBee11 hours ago

Clearcase

source