Firstly: Congrats on your Cottention preprint! Thanks for covering this paper. It flew under my radar, but I've been looking for an analysis just like this. I've long suspected that most attention heads care more about semantic than positional information, and that RoPE has only been successful because bad positional information handling hurts more than bad semantic handling. At worst, RoPE forces purely-semantic key/query features to be split over 2 channels instead of 1, so that they are rotationally invariant. Unfortunately this wasn't tested. With Gemma-2B's enormous head dim of 256 (vs 128 in Llama and 96 in phi-3.5-mini, 64 in Falcon-7B and TinyLlama-1.1B), it's one of the least sensitive models to this channel redundancy effect. The p-RoPE test has an unfortunate flaw: they truncate frequencies instead of redistributing them. The 0.25-Rope test has a maximum frequency of 2500, so the lowest frequency component actually loops before the end of the 8k context! Positional confusion from this might explain the worse result. I'd also love to see an analysis of a network with learnable RoPE frequencies. However, another perspective is that hybrid SSM+Attention architectures get almost no benefit from RoPE. Jamba didn't need it at all, and GoldFinch only needed it to work around a post-training position extrapolation issue in RWKV. RoPE might be obviated soon if hybrid architectures catch on.
Thank you! Happy to finally have released the paper. Looks like your intuition was spot on! I think it makes a lot of sense for heads to use either semantic or positional information. It does seem like RoPE was kind of just thrown on the model without any tests at all. This paper is kind of the first I've seen that analyzes it. I was thinking p-RoPE would run into issues as truncation is usually not the best option on a first read through of the paper. I'm thinking the authors weren't really going for a novel positional embedding technique and rather just wanted to explore RoPE a little and then threw p-RoPE in to see if it would work. There's probably a much better way to do PEs given that heads are either mostly semantic or mostly positional. In general, I think the idea of PEs is a bit odd. At first it makes a lot of sense, you need to turn a set into a sequence. But the way it's done is odd and I've always disliked them as it kind of forces the model to overfit to a specific context window, thus having bad extrapolation outside that window. Models like Mamba have positional encodings baked in as they are an RNN, which inherently has position. I like the direction models or going with hybrid architectures and I agree, I think RoPE is going to no long be needed given these architectures don't utilize them at all. A study into how positional information is utilized in hybrid models could be interesting though. Do these hybrid models learn similar heads that are semantic/positional? Tried some tests on some learnable RoPE this weekend. Didn't get much out of it but throwing it here nonetheless: github.com/gmongaras/Learnable_Rotary_Embeddings
Firstly: Congrats on your Cottention preprint!
Thanks for covering this paper. It flew under my radar, but I've been looking for an analysis just like this. I've long suspected that most attention heads care more about semantic than positional information, and that RoPE has only been successful because bad positional information handling hurts more than bad semantic handling. At worst, RoPE forces purely-semantic key/query features to be split over 2 channels instead of 1, so that they are rotationally invariant. Unfortunately this wasn't tested. With Gemma-2B's enormous head dim of 256 (vs 128 in Llama and 96 in phi-3.5-mini, 64 in Falcon-7B and TinyLlama-1.1B), it's one of the least sensitive models to this channel redundancy effect.
The p-RoPE test has an unfortunate flaw: they truncate frequencies instead of redistributing them. The 0.25-Rope test has a maximum frequency of 2500, so the lowest frequency component actually loops before the end of the 8k context! Positional confusion from this might explain the worse result.
I'd also love to see an analysis of a network with learnable RoPE frequencies. However, another perspective is that hybrid SSM+Attention architectures get almost no benefit from RoPE. Jamba didn't need it at all, and GoldFinch only needed it to work around a post-training position extrapolation issue in RWKV. RoPE might be obviated soon if hybrid architectures catch on.
Thank you! Happy to finally have released the paper.
Looks like your intuition was spot on! I think it makes a lot of sense for heads to use either semantic or positional information. It does seem like RoPE was kind of just thrown on the model without any tests at all. This paper is kind of the first I've seen that analyzes it.
I was thinking p-RoPE would run into issues as truncation is usually not the best option on a first read through of the paper. I'm thinking the authors weren't really going for a novel positional embedding technique and rather just wanted to explore RoPE a little and then threw p-RoPE in to see if it would work. There's probably a much better way to do PEs given that heads are either mostly semantic or mostly positional.
In general, I think the idea of PEs is a bit odd. At first it makes a lot of sense, you need to turn a set into a sequence. But the way it's done is odd and I've always disliked them as it kind of forces the model to overfit to a specific context window, thus having bad extrapolation outside that window. Models like Mamba have positional encodings baked in as they are an RNN, which inherently has position. I like the direction models or going with hybrid architectures and I agree, I think RoPE is going to no long be needed given these architectures don't utilize them at all. A study into how positional information is utilized in hybrid models could be interesting though. Do these hybrid models learn similar heads that are semantic/positional?
Tried some tests on some learnable RoPE this weekend. Didn't get much out of it but throwing it here nonetheless: github.com/gmongaras/Learnable_Rotary_Embeddings
@@gabrielmongaras i was thinking something along the lines of learnable gaussian PE.