Tensor Product Attention Is All You Need
(trying to move the critique beyond the title...)
When trying to deploy llms in with larger context windows constrained environments 2 things start to hurt: a) increased memory footprint for longer KV cache b) increased decode speed due to longer context window. this paper addresses a) only, which is useful, but we are still left with b) (right?)
I really can't with these paper titles anymore, man.
For those of us who are lay people outside of machine learning and AI, what was the critical insight that made “attention all you need” in the original Transformer paper?
Tensor decomposition has traditionally suffered from high computational complexity. Is it an issue here?
If you don’t pay to read papers, you don’t get to complain about the titles, imo.
I hate ads, but I’m not paying for YouTube Premium either. That’s how it goes. I get ads.
> a novel attention mechanism
Why do every paper has to mention this word "novel" and these titles are getting crazier day by day.
I'm sorry but can people please stop naming their papers "X is all you need"? It's super annoying.
My kingdom for renaming this paper to something like "Tensor Product Attention is a Memory-Efficient Approach for Long-Sequence Language Modeling"