Ratelman2 days ago
So Minimax just "open-sourced" (I add it in "" because they have a custom license for its use and I've not read through that) but they have context length of 4-million tokens and it scored 100% on the needle in a haystack problem. It uses lightning attention - so still attention, just a variation? So this is potentially not as groundbreaking as the publishers of the paper hoped or am I missing something fundamental here? Can this scale better? Does it train more efficiently? The test-time inference is amazing - is that what sets this apart and not necessarily the long context capability? Will it hallucinate a lot less because it stores long-term memory more efficiently and thus won't make up facts but rather use what it has remembered in context?
ttula day ago
No, TITANS introduces a new "learning memory" model that is actually trained at test-time based on the input to help remember the right bits from the input to improve how the separate sequence-to-sequence model generates tokens. This is far different from in-context learning, which simply relies on the autoregressive operation of the transformer model to feed output back into input.
The memory works by tracking two kinds of "surprise" - immediate surprise (how unexpected is the current token?) and accumulated surprise (what patterns of unexpected things have we been seeing?). It uses this to decide what's worth remembering and what can be forgotten. What's clever is they formulated this as a gradient descent problem that can run efficiently in parallel despite being inherently sequential.
The really interesting part is how it integrates with the main model - they tried three approaches but the most effective was using the memory as additional context tokens alongside the input. This lets the attention mechanism figure out for itself when to use the memory versus the immediate context. And because the memory tokens are injected both at test-time and during training, the memory model and main model are trained together despite the memory model being unfrozen for each inference.
In practice, this lets them handle sequences over 2M tokens long while outperforming traditional transformers, even matching GPT-4 on some long-context reasoning tasks with far fewer parameters. It's a neat example of combining classical ideas about online learning with modern deep learning architectures.
The code isn't released yet, but the paper suggests the implementation is relatively straightforward since it builds on standard gradient descent mechanics. It'll be interesting to see if this approach influences the next generation of open source LLMs. I'm sure we will see implementations very soon even though it may take some time for open source models to be trained using this new architecture.
I'm very excited to know whether Gemini 2.0 1206 Experimental is using this new architecture. I suspect it is.
timlarshansona day ago
I doubt it. This does not seem to be a particularly well written or well thought-out paper -- e.g. equations 6 and 7 contradict their descriptions in the sentence below; the 'theorem' is an assertion.
After reading a few times, I gather that, rather than kernelizing or linearizing attention (which has been thoroughly explored in the literature), they are using a MLP to do run-time modelling of the attention operation. If that's the case (?), (which is interesting, sure): 1 -- Why did they not say this plainly. 2 -- Why does eq. 12 show the memory MLP being indexed by the key, whereas eq. 15 shows it indexed by the query? 3 -- What's with all the extra LSTM-esque forget and remember gates? Meh. Wouldn't trust it without ablations.
I guess if a MLP can model a radiance field (NeRF) well, stands to reason it can approx attention too. The Q,K,V projection matrices will need to be learned beforehand using standard training.
While the memory & compute savings are clear, uncertain if this helps with reasoning or generalization thereof. I doubt that too.
andy12_11 hours ago
The eq. 12 is a loss function to associate a given key and value in the memory MLP using test-time training with gradient-descent.
The eq. 15 is simply the operation to query a value that was previously inserted in previous tokens using eq. 12.
Basically, for each autoregressively processed segmented you do:
1) Test-time inference: query values from memory with eq. 15.
2) Test-time training: associate new keys and values into the memory with the loss from eq. 12.
The forget and remember gates is because... well, the architecture in general is very similar to a LSTM, but using test-time gradient descent to decide what to insert to the long-term memory.
timlarshanson9 hours ago
Ok, thanks for the clarification.
Seems the implicit assumption then is that M(q) -> v 'looks like' or 'is smooth like' the dot product, otherwise 'train on keys, inference on queries' wouldn't work ? (safe assumption imo with that l2 norm & in general; unsafe if q and k are from different distributions).
Correct me if I'm wrong, but typically k and v are generated via affine projections K, V of the tokens; if M is matrix-valued and there are no forget and remember gates (to somehow approx the softmax?), then M = V K^-1
ljlolela day ago
The paper has ablations
anon373839a day ago
The needle in the haystack test is unfortunately not a very good measure of long-context performance. It turns out that searching for isolated pieces of information is a lot less demanding than synthesizing the entire context to solve a task.
iamnotageniusa day ago
Sadly LLM itself is not impressive. Not creative, not exactly smartest. good but not great.
gwerna day ago
Their creative-writing dataset seems to have been literally written by ChatGPT: https://www.reddit.com/r/LocalLLaMA/comments/1i1a88y/minimax...
marmaduke2 days ago
similar to RWKV7’s new (sub quadratic) attention mechanism which models key values as v≈kS’ and does an in-context descent on ||v - kS’||^2/2 (where the state matrix S is one attentional head) , explained more by the author here https://raw.githubusercontent.com/BlinkDL/RWKV-LM/main/RWKV-...
and i tried to unpack it a bit here https://wdmn.fr/rank-1-take-on-rwkv7s-in-context-learning/
amai2 days ago
I wonder why the authors felt they need to use drop caps in this paper. It is a distraction and seems to value style over content.
fancyfredbot2 days ago
The drop caps spell titan, which they probably found entertaining. It made me smile anyway.
331c8c712 days ago
They were not so secretely hoping their paper's gonna go directly in history:) One could check the other papers by the authors to verify.
NotAnOttera day ago
This was the impression I gathered as well. It seemed the authors had already decided this will revolutionize NN before it had even gone through review.
They have several other "We took the best of both worlds" type papers.
bansuian2 days ago
From the title I thought this was talking about cramming the night before an exam. ;-) Or if it’s an open book exam learning during the exam as one goes through the textbook.
suninsight2 days ago
Key questions:
1. The key data point seems to be Figure 6a. Where it compares performance on BABILong and claims Titans performance is at ~62%, as compared to GPT-4o-mini at ~42% for 100k sequence length.
However, GPT-4o and Claude are missing in this comparison - maybe because they perform better ?
2. There is no example provided of the Neural Memory Module in action. This is the first question I would ask of this paper.
ttula day ago
This paper was written by a very small team at Google. It strikes me as similar in that regard to the original transformers paper. If this technique scales well, Google is no doubt already exploiting it for their next generation models -- and I think there are signs that Gemini 2.0 models already exploit this.
tigershark2 days ago
The biggest model that they have used has only 760M parameters, and it outperforms models 1 order of magnitude larger.
NotAnOttera day ago
Gah dmn
groceryheist2 days ago
Is it just me, or does this seem like big news?
[deleted]2 days agocollapsed
331c8c712 days ago
Same here (seen it yesterday) but I haven't parsed the technicals so far tbh.
quotemstr2 days ago
A lot of ML papers that sound revolutionary end up being duds
casey219 hours ago
all*
ttula day ago
This is big news.
OutOfHere2 days ago
What irks me is when authors only use a needle-in-the-haystack analogy to assess a long context. Humans do a lot more than this when working with a large context. Humans repeatedly go back and forth over parts of the context; it's not a simple one-pass.
PunchTornado2 days ago
If this was that good, why would Google release it?
antin0dea day ago
You won't believe how many ground breaking papers Google have released for AI research in the past 10-20 years.
We just take it for granted.
sangnoira day ago
What Has Google Ever Done for AI Research?
NotAnOttera day ago
Huh? Satire?
"Attention Is All You Need" ring any bells?
sangnoira day ago
It's a reference to a Monty Python sketch - What Have the Romans Ever Done For Us? https://youtu.be/Qc7HmhrgTuQ?si=M_0VYwJBXwkzL1qV
pizzaa day ago
They want to find more researchers that are motivated to build upon it for the much improved, private, next version.
cubefox15 hours ago
I assume: 1) They work at Google Research, not Google DeepMind. The latter seems to be more focused on developing AI products now, but Google Research seems to be still quite research based. 2) Google is overall much more liberal with researchers being able to publish their results, compared to OpenAI. I don't think this is necessarily rational from a business perspective where you get disproportionate payoff if you are a step ahead of everyone else. There is a reason why OpenAI has outpaced Google in terms of AI. Maybe those Google researchers were hired under the condition that they are allowed to publish their results?