Hacker News

mgninad

From Multi-Head to Latent Attention: The Evolution of Attention Mechanisms vinithavn.medium.com

attogram7 hours ago

"Attention Is All You Need" - I've always wondered if the authors of that paper used such a casual and catchy title because they knew it would be groundbreaking and massively cited in the future....

ruuda27 minutes ago

What about the converse, the paper became some massively influential because of the catchy title? Of course the contents are groundbreaking, but that alone is not enough. A groundbreaking paper that nobody knows about cannot have any impact. Even for research, there is a marketing part to it.

sivm4 hours ago

Attention is all you need for what we have. But attention is a local heuristic. We have brittle coherence and no global state. I believe we need a paradigm shift in architecture to move forward.

radarsat127 minutes ago

To be fair it would be a lot easier to iterate on ideas if a single experiment didn't cost thousands of dollars and require such massive data. Things have really gotten to the point that it's just not easy for outsiders to contribute if you're not part of a big company or university, and even then you have to justify the expenditure (risk). Paradigm shifts are hard to come by when there is so much momentum in one direction and trying something different carries significant barriers.

ACCount372 hours ago

Plenty of "we need a paradigm shift in architecture" going around - and no actual architecture that would beat transformers at their strengths as far as eye can see.

I remain highly skeptical. I doubt that transformers are the best architecture possible, but they set a high bar. And it sure seems like people who keep making the suggestion that "transformers aren't the future" aren't good enough to actually clear that bar.

airstrikean hour ago

That logic does not hold.

Being able to provide an immediate replacement is not a requirement to point out limitations in current technology.

treyd3 hours ago

Has there been research into some hierarchical attention model that has local attention at the scale of sentences and paragraphs that feeds embeddings up to longer range attention across documents?

mxkopy2 hours ago

There’s the hierarchical reasoning model https://arxiv.org/abs/2506.21734 but it’s very new and largely untested

Though honestly I don’t think new neural network architectures are going to get us over this local maximum, I think the next steps forward involve something that’s

1. Non lossy

2. Readily interpretable

miven2 hours ago

The ARC Prize Foundation ran extensive ablations on HRM for their slew of reasoning tasks and noted that the "hierarchical" part of their architecture is not much more impactful than a vanilla transformer of the same size with no extra hyperparameter tuning:

https://arcprize.org/blog/hrm-analysis#analyzing-hrms-contri...

ACCount37an hour ago

By now, I seriously doubt any "readily interpretable" claims.

Nothing about human brain is "readily interpretable", and artificial neural networks - which, unlike brains, can be instrumented and experimented on easily - tend to resist interpretation nonetheless.

If there was an easy to reduce ML to "readily interpretable" representations, someone would have done so already. If there were architectures that perform similarly but are orders of magnitude more interpretable, they will be used, because interpretability is desirable. Instead, we get what we get.

lucidrainsan hour ago

It is a reference to the beatles song, mainly because Noam Shazeer is a music lover

adastra226 hours ago

Definitely. I always assumed that, having been involved in writing similarly groundbreaking papers… or so we thought at the time. All my coauthors spent significant time thinking about what the best title would be, and strategies like that were common. (It ended up not mattering for us.)

slickytail4 hours ago

[dead]

iLoveOncall4 hours ago

I recommend reading this article which explains how you can get your papers accepted, and explains that a catchy title is the #1 most important thing: https://maxwellforbes.com/posts/how-to-get-a-paper-accepted/ (not a plug, I just saved it because it was interesting)

hyperbovine4 hours ago

It sounds like a typical neurips paper to me. And no, they did know what a big deal it would be, else google never would have given the idea away.

JSR_FDED7 hours ago

Any way to read this without making an account?

rmonvfer7 hours ago

https://freedium.cfd/https://vinithavn.medium.com/from-multi...

kuidaumpf7 hours ago

https://freedium.cfd

qcnguy6 hours ago

Just click the x at the top right of the interstitial?

iLoveOncall4 hours ago

That only work for a few articles per month. But usually opening in incognito does the trick.

djoldman3 hours ago

just turn off JS.

mrtesthah8 hours ago

Do we know if any of these techniques are actually used in the so-called "frontier" models?

gchadwick4 hours ago

Who knows what the closed source models use but certainly going by what's happening in open models all the big changes and corresponding gains in capability are in training techniques not model architecture. Things like GQA and MLA as discussed in this article are important techniques for getting better scaling but are relatively minor tweak vs the evolution in training techniques.

I suspect closed models aren't doing anything too radically different from what's presented here.

vinithavn017 hours ago

The model names are mentioned under each type of attention mechanism

source