How Rotary Position Embedding Supercharges Modern LLMs

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt

Virtual VR: 🥽 I miss You

Which team will win? Incredibox Sprunki Phase 1 vs Phase 3?!

Beware! SpongeBob is Lurking Behind the Wall! 😱 #meme #gmod #cat

Round and Round We Go! What makes Rotary Positional Encodings useful?

Gabriel Mongaras

มุมมอง 753

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 ธ.ค. 2024

ความคิดเห็น • 3

@BinarySplit หลายเดือนก่อน ⁺³
Firstly: Congrats on your Cottention preprint!
Thanks for covering this paper. It flew under my radar, but I've been looking for an analysis just like this. I've long suspected that most attention heads care more about semantic than positional information, and that RoPE has only been successful because bad positional information handling hurts more than bad semantic handling. At worst, RoPE forces purely-semantic key/query features to be split over 2 channels instead of 1, so that they are rotationally invariant. Unfortunately this wasn't tested. With Gemma-2B's enormous head dim of 256 (vs 128 in Llama and 96 in phi-3.5-mini, 64 in Falcon-7B and TinyLlama-1.1B), it's one of the least sensitive models to this channel redundancy effect.
The p-RoPE test has an unfortunate flaw: they truncate frequencies instead of redistributing them. The 0.25-Rope test has a maximum frequency of 2500, so the lowest frequency component actually loops before the end of the 8k context! Positional confusion from this might explain the worse result.
I'd also love to see an analysis of a network with learnable RoPE frequencies. However, another perspective is that hybrid SSM+Attention architectures get almost no benefit from RoPE. Jamba didn't need it at all, and GoldFinch only needed it to work around a post-training position extrapolation issue in RWKV. RoPE might be obviated soon if hybrid architectures catch on.
@gabrielmongaras หลายเดือนก่อน ⁺¹
Thank you! Happy to finally have released the paper.
Looks like your intuition was spot on! I think it makes a lot of sense for heads to use either semantic or positional information. It does seem like RoPE was kind of just thrown on the model without any tests at all. This paper is kind of the first I've seen that analyzes it.
I was thinking p-RoPE would run into issues as truncation is usually not the best option on a first read through of the paper. I'm thinking the authors weren't really going for a novel positional embedding technique and rather just wanted to explore RoPE a little and then threw p-RoPE in to see if it would work. There's probably a much better way to do PEs given that heads are either mostly semantic or mostly positional.
In general, I think the idea of PEs is a bit odd. At first it makes a lot of sense, you need to turn a set into a sequence. But the way it's done is odd and I've always disliked them as it kind of forces the model to overfit to a specific context window, thus having bad extrapolation outside that window. Models like Mamba have positional encodings baked in as they are an RNN, which inherently has position. I like the direction models or going with hybrid architectures and I agree, I think RoPE is going to no long be needed given these architectures don't utilize them at all. A study into how positional information is utilized in hybrid models could be interesting though. Do these hybrid models learn similar heads that are semantic/positional?
Tried some tests on some learnable RoPE this weekend. Didn't get much out of it but throwing it here nonetheless: github.com/gmongaras/Learnable_Rotary_Embeddings
@minecraftermad หลายเดือนก่อน
@@gabrielmongaras i was thinking something along the lines of learnable gaussian PE.

ต่อไป

เล่นอัตโนมัติ

How Rotary Position Embedding Supercharges Modern LLMs

How Rotary Position Embedding Supercharges Modern LLMs

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt

Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt

Virtual VR: 🥽 I miss You

Virtual VR: 🥽 I miss You

Which team will win? Incredibox Sprunki Phase 1 vs Phase 3?!

Which team will win? Incredibox Sprunki Phase 1 vs Phase 3?!

Beware! SpongeBob is Lurking Behind the Wall! 😱 #meme #gmod #cat

Beware! SpongeBob is Lurking Behind the Wall! 😱 #meme #gmod #cat

ชาวบ้านผวา! หนุ่มบุกยิงประธาน อบต. สาหัส | ข่าวเย็นช่องวัน | สำนักข่าววันนิวส์

ชาวบ้านผวา! หนุ่มบุกยิงประธาน อบต. สาหัส | ข่าวเย็นช่องวัน | สำนักข่าววันนิวส์

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

I never understood why you can't go faster than light - until now!

I never understood why you can't go faster than light - until now!

What is the i really doing in Schrödinger's equation?

What is the i really doing in Schrödinger's equation?

RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs

RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs

Rotary Positional Embeddings: Combining Absolute and Relative

Rotary Positional Embeddings: Combining Absolute and Relative

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

เอก - เพียงรัก - Semi Final - The Voice Thailand 2024 - 1 Dec 2024

เอก - เพียงรัก - Semi Final - The Voice Thailand 2024 - 1 Dec 2024

Beware! SpongeBob is Lurking Behind the Wall! 😱 #meme #gmod #cat

Beware! SpongeBob is Lurking Behind the Wall! 😱 #meme #gmod #cat

ไฮไลท์ฟุตบอล พรีเมียร์ลีก 2024/25 สัปดาห์ที่ 13 : แมนเชสเตอร์ ยูไนเต็ด พบ เอฟเวอร์ตัน

ไฮไลท์ฟุตบอล พรีเมียร์ลีก 2024/25 สัปดาห์ที่ 13 : แมนเชสเตอร์ ยูไนเต็ด พบ เอฟเวอร์ตัน

🔴 หลังเกม: หงส์คว่ำเรือหมดจด 11แต้มช่างห่างไกล

🔴 หลังเกม: หงส์คว่ำเรือหมดจด 11แต้มช่างห่างไกล

Which team will win? Incredibox Sprunki Phase 1 vs Phase 3?!

Which team will win? Incredibox Sprunki Phase 1 vs Phase 3?!

99.9% IMPOSSIBLE

99.9% IMPOSSIBLE

สาวเมืองเพชร - ธีเดช ทองอภิชาติ [Official MV]

สาวเมืองเพชร - ธีเดช ทองอภิชาติ [Official MV]

SARAN x เถาวัลย์ - เดินทางโดยสวัสดิภาพ (Official MV)

SARAN x เถาวัลย์ - เดินทางโดยสวัสดิภาพ (Official MV)