TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

NEW Scan Run Challenge - Help Vineria to Find Simon Phase 2 Incredibox Sprunki

แจ้งเตือนให้ทหารว้าอพยบ ก่อนส่งF-16MLUเข้าบอมบ์ในพื้นที่ หากพ้นกำหนด

ไฮไลท์ฟุตบอล พรีเมียร์ลีก 2024/25 สัปดาห์ที่ 13 : เชลซี พบ แอสตัน วิลล่า

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Gabriel Mongaras

มุมมอง 1 269

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 ธ.ค. 2024

ความคิดเห็น • 9

@SiD-hq2fo 23 วันที่ผ่านมา ⁺¹
really appreciate for explaining such papers, please keep them coming
@AM-yk5yd 25 วันที่ผ่านมา ⁺¹
8:00 "It's just MLP" yeah, I felt bamboolzed. MLP originally was upscale, relu, downscale; later got beautiful gradient-bearing curves of ReLU replacement and additional multiplication(eg GeGLU instead of GELU). Here during initial explanation I felt that up proj and down proj got renamed to K, V.
It doesn't feel like cross attention at all as their K V don't depend on any part of the input.
23:01 It may be because they moved lots parms around: finetuning MLP layers works so much worse than finetuning QKVO. By having MLP inside they have more parms to play a role on multi head attention which is still used.
It also might be training. Their results at image classification feel feel more revealing than NLP tasks: dataset is way more fixed than "pick your favorite part of pile, filter bad parts" and models have the same weight class, and there is no different tokenizers(they use same with pythia so comparing ppl against it doesn't raise eyebrows)
In ideal world they would also train at least pythia 160M from scratch on the same training data, but considering it's "up to 300B tokens", it's not exactly surprising they didn't. Also in ideal world they would put in ablation study "hey, let's not add non-linearity to Q K V proj and instead of new weights for 3 P-attentions we simply make dimension of heads bigger keeping around the same parm count"
Also Pythia has an outdated MLP architecture: it use gelu, not geglu. P attention doesn't use GeGLU as well, but their architecture replaces MLP which already is replaced in modern LLMs.
Speaking of their results in images: Their winning model in table 2 image classification has more parameters that ViT. If it's excluded, their small 86M model loses to another 86M model, though their 307M model still wins.
(Also I remember trying somewhat similar but more naive idea on imdb dataset: replace Q,K,Vs proj with self attentions. Idea was walk toward super hierarchical nested attention: if you replace one Q projection with attention, you can go deeper and replace projection inside nested self-attn of Q with another attention and once walking over this inception is done the initial layer knows so well what token it has to attend to other token.
Worked awful. Results were worse, number of parms went through the roof(as single proj got replaced with several, though will not be surprised if better hyper parms can make it at least bad rather than awful)
@AM-yk5yd 24 วันที่ผ่านมา ⁺¹
Oh, they actually did train transformer from scratch: "Table 6 compares Transformer models trained from scratch and Tokenformer trained by parameter
reusing with varying amounts of seen tokens during training. It is evident that Transformers trained
with the same number of seen tokens do not reach the performance level of Tokenformer with
parameter reusing"
Even if modern MLP would improve transformer results, training costs still seems to be much better which is really impressive
@gabrielmongaras 20 วันที่ผ่านมา ⁺¹
I think most of the results only look impressive because they slightly scale the model above the baseline. From the experiments I've done, it seems like the activation they develop is worse than normal GeLU and moving params from the MLP into the Q K V O projections performs the same given the same # of params.
@ak-cm5eu 12 วันที่ผ่านมา ⁺¹
Can I ask what tablet/app you use to produce these videos? Thank you!
@Ryu-ix8qs 27 วันที่ผ่านมา ⁺³
Thanks
@rohollahhosseyni8564 26 วันที่ผ่านมา
I think in pattention (cross-attention), the tokens can kinda communicate with each other, so it does not act like an MLP layer.
@gabrielmongaras 26 วันที่ผ่านมา ⁺³
Pattention is mathematically the same as an MLP tho. Pattention has the following formulation:
O = f(X K^T) V
and a two layer MLP with an intermediate activation looks like the following:
O = f(X W_1) W_2
Since we are doing an inner product with the input and keys, "token interaction" is just a function applied on the input space R^d, of which there are n tokens, or rather n functions, if W_1 = K^T is of shape (d x n). Basically cross attention with learnable K,V is an MLP where the up projection is renamed to K, the down projection is renamed to V, and the activation is changed from GeLU to GeLU(Norm())
@f6y7t5 24 วันที่ผ่านมา
Diffusion forcing? Please

ต่อไป

เล่นอัตโนมัติ

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

NEW Scan Run Challenge - Help Vineria to Find Simon Phase 2 Incredibox Sprunki

NEW Scan Run Challenge - Help Vineria to Find Simon Phase 2 Incredibox Sprunki

แจ้งเตือนให้ทหารว้าอพยบ ก่อนส่งF-16MLUเข้าบอมบ์ในพื้นที่ หากพ้นกำหนด

แจ้งเตือนให้ทหารว้าอพยบ ก่อนส่งF-16MLUเข้าบอมบ์ในพื้นที่ หากพ้นกำหนด

ไฮไลท์ฟุตบอล พรีเมียร์ลีก 2024/25 สัปดาห์ที่ 13 : เชลซี พบ แอสตัน วิลล่า

ไฮไลท์ฟุตบอล พรีเมียร์ลีก 2024/25 สัปดาห์ที่ 13 : เชลซี พบ แอสตัน วิลล่า

lucky cameraman interrogation (skibidi toilet season 25)

lucky cameraman interrogation (skibidi toilet season 25)

What is the i really doing in Schrödinger's equation?

What is the i really doing in Schrödinger's equation?

I Didn’t Believe that AI is the Future of Coding. I Was Right.

I Didn’t Believe that AI is the Future of Coding. I Was Right.

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Andrew Ng Explores The Rise Of AI Agents And Agentic Reasoning | BUILD 2024 Keynote

Andrew Ng Explores The Rise Of AI Agents And Agentic Reasoning | BUILD 2024 Keynote

The Oldest Unsolved Problem in Math

The Oldest Unsolved Problem in Math

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Attention in transformers, visually explained | DL6

Attention in transformers, visually explained | DL6

Transformers (how LLMs work) explained visually | DL5

Transformers (how LLMs work) explained visually | DL5

Round and Round We Go! What makes Rotary Positional Encodings useful?

Round and Round We Go! What makes Rotary Positional Encodings useful?

[ ออกกอง ] วัยหนุ่ม 2544 โลกหลังลูกกรง..ทัวร์กองถ่ายหนังใน "คุกของจริง" | JUSTดูIT.

[ ออกกอง ] วัยหนุ่ม 2544 โลกหลังลูกกรง..ทัวร์กองถ่ายหนังใน "คุกของจริง" | JUSTดูIT.

เวอร์จิล ฟาน ไดจ์ค VS เออร์ลิ่ง ฮาลันด์ #SHOTเข้มเต็มข้อ #พรีเมียร์ลีก #ลิเวอร์พูล #แมนซิตี้

เวอร์จิล ฟาน ไดจ์ค VS เออร์ลิ่ง ฮาลันด์ #SHOTเข้มเต็มข้อ #พรีเมียร์ลีก #ลิเวอร์พูล #แมนซิตี้

ร้องเพลงสั่งข้าว Ver.DAY ONE - PUN | Feat @Kunti9

ร้องเพลงสั่งข้าว Ver.DAY ONE - PUN | Feat @Kunti9

Tuna 🍣 ⁠@patrickzeinali ⁠@ChefRush

Tuna 🍣 ⁠@patrickzeinali ⁠@ChefRush

นี่คือเรื่องราวสุดหลอน ของสตีฟที่บิดเบี้ยว !

นี่คือเรื่องราวสุดหลอน ของสตีฟที่บิดเบี้ยว !

เต้นจิงโจ้ จนคนด่า #breakdancing #olympics

เต้นจิงโจ้ จนคนด่า #breakdancing #olympics

🔴 LIVE ถ่ายทอดสด! ผลการออกรางวัลสลากกินแบ่งรัฐบาล งวด 1 ธ.ค. 67

🔴 LIVE ถ่ายทอดสด! ผลการออกรางวัลสลากกินแบ่งรัฐบาล งวด 1 ธ.ค. 67

ถึงกับหน้าเจื่อน #funny #memes #hagatestudio

ถึงกับหน้าเจื่อน #funny #memes #hagatestudio