Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

น้องๆ พี่เจอผีหมูเด้ง!!

รถถัง จิตรเมืองนนท์ vs จาค็อบ สมิธ ONE 169 | 9 พ.ย.67

CORSAIR K70 PRO TKL คีย์บอร์ดโคตรโกง สำหรับสายเกมมิ่ง !!

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Gabriel Mongaras

มุมมอง 2 638

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 10 พ.ย. 2024

ความคิดเห็น • 8

@gabrielmongaras 4 หลายเดือนก่อน ⁺³
I want to point out that after training, the inner loop in the RNN still needs to be "optimized" since the gradient update rule is the RNN update rule itself. The normal NLL training procedure just trains the model on how to "teach" the internal RNN layers. This isn't normal optimization though. It doesn't require an optimizer since we know the functional form of the gradient wrt. to hidden state so it's very efficient.
@lexer_ 4 หลายเดือนก่อน ⁺¹
I really like seeing videos like this that are actually willing to explore the math and break it down to make it easier to catch onto how it actually works. Too many just discuss the abstract or skip past the math because it hampers audience maximization.
@RazhanHameed 2 หลายเดือนก่อน
What is the application you're using to annotate or take notes?
@KaiSyunHou 3 หลายเดือนก่อน
I have understand that the inner loop will update W_t with the modified reconstruction loss, then how is the theta_K, theta_V, theta_Q being updated in outer loop? specifically, what loss is related to these three parameters?
@gabrielmongaras 3 หลายเดือนก่อน
The "outer loop" uses the normal training strategy where we have the negative log likelihood or cross entropy loss for next token prediction. This gradient of this loss backpropogated to all layers like normal. So the "outer loop" is just the rest of the network.
@KaiSyunHou 3 หลายเดือนก่อน
@@gabrielmongaras but the normal cross entropy loss for next token prediction should not incur any gradient flow to theta_K and theta_V? They are for reconstruction loss, so they are not involved in token prediction
@gabrielmongaras 3 หลายเดือนก่อน ⁺¹
The theta params are trained via the next token prediction loss. This can be thought of as the outer model querying the inner model for information by changing the loss function with these theta params. I think the outer loop actually differentiates the inner loop (including the gradient of the loss) so the K,V there params are updated by the outer loop in this way. The only param that's "trained" using the inner loop is the hidden state.
@po-yupaulchen166 2 หลายเดือนก่อน
@@gabrielmongaras Maybe similar question to this and please help to clarify. Inner loop is to update W using the loss function in (4). Does it mean that theta_K and theta_V are 'not' updated based on the loss in (4) ? I guess it is not. Otherwise, how to update theta_Q that is not shown in (4). I guess theta's are updated based on the other loss function but not written precisely in the paper.

ต่อไป

เล่นอัตโนมัติ

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

น้องๆ พี่เจอผีหมูเด้ง!!

น้องๆ พี่เจอผีหมูเด้ง!!

รถถัง จิตรเมืองนนท์ vs จาค็อบ สมิธ ONE 169 | 9 พ.ย.67

รถถัง จิตรเมืองนนท์ vs จาค็อบ สมิธ ONE 169 | 9 พ.ย.67

CORSAIR K70 PRO TKL คีย์บอร์ดโคตรโกง สำหรับสายเกมมิ่ง !!

CORSAIR K70 PRO TKL คีย์บอร์ดโคตรโกง สำหรับสายเกมมิ่ง !!

โดนเล่นหนัก! JOHNY SOMALI สตรีมเมอร์เกรียนชื่อดัง 💀 โดนฟ้องจะติดคุกในเกาหลี 10ปี

โดนเล่นหนัก! JOHNY SOMALI สตรีมเมอร์เกรียนชื่อดัง 💀 โดนฟ้องจะติดคุกในเกาหลี 10ปี

Test-Time Training with Self-Supervision for Generalization under Distribution Shifts

Test-Time Training with Self-Supervision for Generalization under Distribution Shifts

Learning to Learn at Test Time RNNs with Expressive Hidden StatesStanford & UCSD & UCB & Meta 2024

Learning to Learn at Test Time RNNs with Expressive Hidden StatesStanford & UCSD & UCB & Meta 2024

xLSTM: Extended Long Short-Term Memory

xLSTM: Extended Long Short-Term Memory

WARP: On the Benefits of Weight Averaged Rewarded Policies

WARP: On the Benefits of Weight Averaged Rewarded Policies

Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality

Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality

CoPE - Contextual Position Encoding: Learning to Count What's Important

CoPE - Contextual Position Encoding: Learning to Count What's Important

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)

รถถัง จิตรเมืองนนท์ vs จาค็อบ สมิธ ONE 169 | 9 พ.ย.67

รถถัง จิตรเมืองนนท์ vs จาค็อบ สมิธ ONE 169 | 9 พ.ย.67

มายคราฟแต่ถ้าผมเห็น "สีน้ำเงิน" คลิปนี้จะระเบิด!?

มายคราฟแต่ถ้าผมเห็น "สีน้ำเงิน" คลิปนี้จะระเบิด!?

หนังกำลังภายใน | หลินชิงเสีย เดชคัมภีร์เทวดา ภาค 2 (Swordsman II) | Mei Ah Movie | หนังจีนพากย์ไทย

หนังกำลังภายใน | หลินชิงเสีย เดชคัมภีร์เทวดา ภาค 2 (Swordsman II) | Mei Ah Movie | หนังจีนพากย์ไทย

No more ice cream mess 😫 #parenting #cleaning #diy #hacks #useful #parentingtips

No more ice cream mess 😫 #parenting #cleaning #diy #hacks #useful #parentingtips

🔴LIVE เชียร์สด : แมนเชสเตอร์ ยูไนเต็ด พบ เลสเตอร์ ซิตี้ | รุด ฟาน นิสเตลรอย คุมผีแดงนัดส่งท้าย MW11

🔴LIVE เชียร์สด : แมนเชสเตอร์ ยูไนเต็ด พบ เลสเตอร์ ซิตี้ | รุด ฟาน นิสเตลรอย คุมผีแดงนัดส่งท้าย MW11

[UNCUT] The Loyal Pin ปิ่นภักดิ์ EP.15 (1/4)

[UNCUT] The Loyal Pin ปิ่นภักดิ์ EP.15 (1/4)

จารย์❌ จาน✅ #ตลก #บ้านกูเอง

จารย์❌ จาน✅ #ตลก #บ้านกูเอง

MABELZ PiXXiE - heartstopper | OFFICIAL M/V

MABELZ PiXXiE - heartstopper | OFFICIAL M/V