Sigma-GPTs: A New Approach to Autoregressive Models

Tunadorable

มุมมอง 2 819

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 20 ต.ค. 2024

ความคิดเห็น • 19

@The9thDoctor 2 หลายเดือนก่อน ⁺⁵
I love seeing papers that just take these just out there ideas and see what's there to find
@alexanderbrown-dg3sy 2 หลายเดือนก่อน ⁺⁹
Chill bro 😂. You’re making it too hot. Literally using this architecture at my startup. The importance of this can’t be understated. Solves the factorization curse/reversal curse. Greatly expanding potential context utilization. The ultimate architecture for multimodal generation, due to any order support. This is not even considering the compelling sampling options(imagine mixture-of-agents with CDE sampling, would be nutty). You still need a better positional encoding method because RoPE is trash for modern LM’s(too much uniformity in high dimensional space; optimize this, along with this architecture, you can extrapolate to much longer reasoning lengths).
This architecture solves several core transformers limitations. Did I mention, context utilization 😂.
Things like this, neural symbolic architectures and synthetic data I’m sure will lead to models that can be as good, or even exceed whatever gpt5 will be(since it can reason backwards unlike gpt5, which why their claims are highly subjective, gpt4o mini is a gpt5 artifact, means same limitations as gpt4), at a fraction of its size.
This architecture is sooooo important. Stay woke. Great video. Although I really don’t like everyone aware of this 😂.
You’re becoming one of my favorite TH-camrs though bro. Great vid.
@Tunadorable 2 หลายเดือนก่อน ⁺¹
haha thanks no secrets are safe with me
@wwkk4964 2 หลายเดือนก่อน
This was awesome! AGI Solved.
@islandfireballkill 2 หลายเดือนก่อน ⁺³
If you look at "What algorthims can transformers learn" by Hattie Zhou, you will find that some tasks like addition are vastly improved in generalization just by generating output tokens in reverse order. (Because standard addition with carry actually is a right to left algorithm).
This could have implications for reasoning and code generation skills.
@revimfadli4666 2 หลายเดือนก่อน
Reminds me of the transformer/attention permutation invariance that David Ha used for reinforcement learning
@kevon217 2 หลายเดือนก่อน ⁺¹
Love the Pac-Man-esque figure. Cool method here.
@Dedjkeorrn42 2 หลายเดือนก่อน ⁺⁵
What the sigma?
@User-vy2py 2 หลายเดือนก่อน
It represents the permutation (=the order in which u consider tokens)
@tornyu 2 หลายเดือนก่อน ⁺¹
Did they mention whether their technique reduces information squashing, like you covered in that other paper this week?
@Tunadorable 2 หลายเดือนก่อน ⁺¹
they did not, and i believe i recorded this video before that one. my memory is a bit lacking but i think that whenever you ask this one to do regular auto-regressive decoding it would have the same issue. however i seem to remember this one being able to do a more diffusion style decoding in which case i don’t think it would have the same over-squashing issue. again i read/recorded this paper awhile ago so i could be wrong
@thorvaldspear 2 หลายเดือนก่อน ⁺²
They are messing with us at this point with names like that
@Yobs2K 2 หลายเดือนก่อน
No way we've got Sigma-GPT before GTA 6
@nomadicsynth 2 หลายเดือนก่อน
PURPLE!!!
@narutouzumaki2157 2 หลายเดือนก่อน ⁺²
Noice😊
@youngman5890 2 หลายเดือนก่อน
what the sigma
@GNARGNARHEAD 2 หลายเดือนก่อน
I'm really struggling to appreciate how obfuscating the information makes it more effecting at modeling the global view 🤔why not just train against a hidden A* path?
oh and the code videos sound fun

ต่อไป

เล่นอัตโนมัติ

Why Does Diffusion Work Better than Auto-Regression?