- 72
- 255 138
Gabriel Mongaras
United States
เข้าร่วมเมื่อ 14 ม.ค. 2021
Just some guy making exploring and making videos about current AI topics.
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Paper here: arxiv.org/abs/2410.23168
Code: github.com/haiyang-w/tokenformer
Notes: drive.google.com/file/d/17PsGwefQJoSQxBHykoSFeMrKZhPDFx-E/view?usp=sharing
00:00 Intro
02:48 Methodology
7:54 This is an MLP
10:18 How they change the transformer
16:00 Model scaling
20:48 Results
Code: github.com/haiyang-w/tokenformer
Notes: drive.google.com/file/d/17PsGwefQJoSQxBHykoSFeMrKZhPDFx-E/view?usp=sharing
00:00 Intro
02:48 Methodology
7:54 This is an MLP
10:18 How they change the transformer
16:00 Model scaling
20:48 Results
มุมมอง: 1 252
วีดีโอ
Round and Round We Go! What makes Rotary Positional Encodings useful?
มุมมอง 750หลายเดือนก่อน
Paper here: arxiv.org/abs/2410.06205 Notes: drive.google.com/file/d/152NPPyNjo-N6MMIaupXacS41BUJgjE5l/view?usp=drive_link 00:00 Intro 01:09 RoPE: Rotary Positional Embeddings 10:37 Notes on RoPE 12:04 Does RoPE decay with distance? 14:14 How are different frequencies used? 17:02 High frequencies: positional attention 21:29 Low frequencies: semantic attention 28:00 p-RoPE 30:36 Thoughts on this ...
Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt
มุมมอง 1.5K4 หลายเดือนก่อน
Null-text Inversion for Editing Real Images using Guided Diffusion Models: arxiv.org/abs/2211.09794 An Edit Friendly DDPM Noise Space: Inversion and Manipulations: arxiv.org/abs/2304.06140 Prompt-to-Prompt Image Editing with Cross Attention Control: arxiv.org/abs/2208.01626 00:00 Intro 01:24 Current image editing techniques 11:42 Deriving DDPM and DDIM 23:08 DDIM inversion 32:46 Null inversion ...
Attending to Topological Spaces: The Cellular Transformer
มุมมอง 7054 หลายเดือนก่อน
Paper here: arxiv.org/abs/2405.14094 Notes: drive.google.com/file/d/12g_KkHqXD6mEDILJzYbCC08i8cDHITfC/view?usp=drive_link 00:00 Intro 01:39 Cellular complexes 07:26 K-cochain 13:26 Defining structure on the cell 20:28 Cellular transformer 34:18 Positional encodings and outro
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
มุมมอง 2.9K4 หลายเดือนก่อน
Paper here: arxiv.org/abs/2407.04620 Code!: github.com/test-time-training/ttt-lm-pytorch Notes: drive.google.com/file/d/127a1UBm_IN_WMKG-DmEvfJ8Pja-9BwDk/view?usp=drive_link 00:00 Intro 04:40 Problem with RNNs 06:38 Meta learning and method idea 09:13 Update rule and RNN inner loop 15:07 Learning the loss function outer loop 21:21 Parallelizing training 30:05 Results
WARP: On the Benefits of Weight Averaged Rewarded Policies
มุมมอง 7484 หลายเดือนก่อน
Paper here: arxiv.org/abs/2406.16768 Notes: drive.google.com/file/d/11UK7mEZwNVUMYuXwvOTfaqHhN8zSYm5M/view?usp=drive_link 00:00 Intro and RLHF 17:30 Problems with RLHF 21:08 Overview of their method 23:47 EMA 28:00 Combining policies with SLERP 37:34 Linear interpolation towards initialization 40:32 Code 44:16 Results
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
มุมมอง 7885 หลายเดือนก่อน
Paper: arxiv.org/abs/2308.07926 Paper page: qiuyu96.github.io/CoDeF/ Code: github.com/qiuyu96/CoDeF My notes: drive.google.com/file/d/10PMKdd5XBd6Y60HlRB9IW9naR2bWziDT/view?usp=drive_link 00:00 Intro 03:00 Method overview 08:40 Method details 15:24 Tricks done for training and how to actually train this thing 19:24 Flow loss and masking 25:10 Conclusion
Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality
มุมมอง 9K5 หลายเดือนก่อน
Paper here: arxiv.org/abs/2405.21060 Code!: github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba2.py Notes: drive.google.com/file/d/1 XGPFeXQyx4CPxgYjzR4qrLd-baLWQC/view?usp=sharing 00:00 Intro 01:45 SSMs 08:00 Quadratic form of an SSM 15:02 Expanded form of an SSM 24:00 Attention - it's all you need?? 29:55 Kernel attention 32:50 Linear attention 34:32 Relating attention to SSMs 38:...
CoPE - Contextual Position Encoding: Learning to Count What's Important
มุมมอง 1.4K5 หลายเดือนก่อน
Paper: arxiv.org/abs/2405.18719 My notes: drive.google.com/file/d/1y9VHZc7MLqc6t2SHHdlVTYeW3czmmRbl/view?usp=sharing 00:00 Intro 02:44 Background 09:58 CoPE 24:50 Code 32:16 Results
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
มุมมอง 8716 หลายเดือนก่อน
Paper: arxiv.org/abs/2403.03100 Demo: speechresearch.github.io/naturalspeech3/ Code: huggingface.co/spaces/amphion/naturalspeech3_facodec My notes: drive.google.com/file/d/1xnzErd_86B6eLwqpLckhoEQKqkxFPyM_/view?usp=drive_link 00:00 Intro 05:34 Architecture overview 18:45 GRL and subspace independence 24:45 Discrete diffusion Model 41:00 factorized diffusion model 44:00 Conclusion and results
xLSTM: Extended Long Short-Term Memory
มุมมอง 2K6 หลายเดือนก่อน
Paper: arxiv.org/abs/2405.04517 My notes: drive.google.com/file/d/1wFYvU_1oUWcCNuQ91zTpSGAeNUsPjlt3/view?usp=drive_link 00:00 Intro 05:44 LSTM 13:38 Problems paper addresses 14:12 sLSTM 23:00 sLSTM Memory mixing 27:08 mLSTM 35:14 Results and stuff
KAN: Kolmogorov-Arnold Networks
มุมมอง 56K6 หลายเดือนก่อน
Paper: arxiv.org/abs/2404.19756 Spline Video: m.th-cam.com/video/qhQrRCJ-mVg/w-d-xo.html My notes: drive.google.com/file/d/1twcIF13nG8Qc10_qeDqCZ4NaUh9tFsAH/view?usp=drive_link 00:00 Intro 00:45 MLPs and Intuition 05:12 Splines 19:02 KAN Formulation 28:00 Potential Downsides to KANs 32:09 Results
LADD: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
มุมมอง 1K7 หลายเดือนก่อน
Paper: arxiv.org/abs/2403.12015 My notes: drive.google.com/file/d/1s1-nnWR_ZR26PNSAoZR1Xj3nuD9UZlvR/view?usp=sharing 00:00 Intro 01:31 Diffusion Models 08:08 Latent Diffusion Models 10:04 Distillation 12:02 Aversarial Diffusion Distillation (ADD) 17:06 Latent Aversarial Diffusion Distillation (LADD) 22:20 Results
Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction
มุมมอง 2.2K7 หลายเดือนก่อน
Paper: arxiv.org/abs/2404.02905 Demo: var.vision/ Code: github.com/FoundationVision/VAR My notes: drive.google.com/file/d/1qym3JG-0xqEgQhdvkt9N17o-ZzUWy2sn/view?usp=drive_link 00:00 Intro 00:53 DiTs 04:06 Autoregressive Image Transformers 06:23 Tokenization problem with AR ViTs 08:43 VAE 10:47 Discrete Quantization - VQGAN 16:42 Visual Autoregressive Modeling 21:31 Causal Inference with VAR 24:...
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
มุมมอง 3.8K7 หลายเดือนก่อน
Paper: arxiv.org/abs/2404.07143 My notes: drive.google.com/file/d/1plWJDwHTZkRK9PDdvaLMnZjFR6fVvNLH/view?usp=drive_link 00:00 Intro 07:17 Model intuition 11:00 Memory retrieval operation 16:29 Hidden state updates 21:58 Delta update 24:10 Is it causal? 25:26 Combining local attention and RNN 27:26 Results 30:25 Sampling and Conclusion
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
มุมมอง 2.1K7 หลายเดือนก่อน
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
มุมมอง 4.7K8 หลายเดือนก่อน
Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
มุมมอง 1.1K8 หลายเดือนก่อน
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet
มุมมอง 6K8 หลายเดือนก่อน
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet
DoRA: Weight-Decomposed Low-Rank Adaptation
มุมมอง 2.1K9 หลายเดือนก่อน
DoRA: Weight-Decomposed Low-Rank Adaptation
OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers
มุมมอง 12K9 หลายเดือนก่อน
OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers
A Decoder-only Foundation Model For Time-series Forecasting
มุมมอง 4.5K9 หลายเดือนก่อน
A Decoder-only Foundation Model For Time-series Forecasting
Lumiere: A Space-Time Diffusion Model for Video Generation
มุมมอง 67810 หลายเดือนก่อน
Lumiere: A Space-Time Diffusion Model for Video Generation
Exphormer: Sparse Transformers for Graphs
มุมมอง 45010 หลายเดือนก่อน
Exphormer: Sparse Transformers for Graphs
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
มุมมอง 2K10 หลายเดือนก่อน
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution
มุมมอง 47610 หลายเดือนก่อน
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution
Cached Transformers: Improving Transformers with Differentiable Memory Cache
มุมมอง 86710 หลายเดือนก่อน
Cached Transformers: Improving Transformers with Differentiable Memory Cache
Translatotron 3: Speech to Speech Translation with Monolingual Data
มุมมอง 91911 หลายเดือนก่อน
Translatotron 3: Speech to Speech Translation with Monolingual Data
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
มุมมอง 10K11 หลายเดือนก่อน
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
best
Can I ask what tablet/app you use to produce these videos? Thank you!
Thanks. These are quality videos. The more you go into the details and math behind the main algorithm, the better.
I tried running this but nothing happens when I run jupyter lab. Gives me the "not a command" message.
really appreciate for explaining such papers, please keep them coming <3
Diffusion forcing? Please
8:00 "It's just MLP" yeah, I felt bamboolzed. MLP originally was upscale, relu, downscale; later got beautiful gradient-bearing curves of ReLU replacement and additional multiplication(eg GeGLU instead of GELU). Here during initial explanation I felt that up proj and down proj got renamed to K, V. It doesn't feel like cross attention at all as their K V don't depend on any part of the input. 23:01 It may be because they moved lots parms around: finetuning MLP layers works so much worse than finetuning QKVO. By having MLP inside they have more parms to play a role on multi head attention which is still used. It also might be training. Their results at image classification feel feel more revealing than NLP tasks: dataset is way more fixed than "pick your favorite part of pile, filter bad parts" and models have the same weight class, and there is no different tokenizers(they use same with pythia so comparing ppl against it doesn't raise eyebrows) In ideal world they would also train at least pythia 160M from scratch on the same training data, but considering it's "up to 300B tokens", it's not exactly surprising they didn't. Also in ideal world they would put in ablation study "hey, let's not add non-linearity to Q K V proj and instead of new weights for 3 P-attentions we simply make dimension of heads bigger keeping around the same parm count" Also Pythia has an outdated MLP architecture: it use gelu, not geglu. P attention doesn't use GeGLU as well, but their architecture replaces MLP which already is replaced in modern LLMs. Speaking of their results in images: Their winning model in table 2 image classification has more parameters that ViT. If it's excluded, their small 86M model loses to another 86M model, though their 307M model still wins. (Also I remember trying somewhat similar but more naive idea on imdb dataset: replace Q,K,Vs proj with self attentions. Idea was walk toward super hierarchical nested attention: if you replace one Q projection with attention, you can go deeper and replace projection inside nested self-attn of Q with another attention and once walking over this inception is done the initial layer knows so well what token it has to attend to other token. Worked awful. Results were worse, number of parms went through the roof(as single proj got replaced with several, though will not be surprised if better hyper parms can make it at least bad rather than awful)
Oh, they actually did train transformer from scratch: "Table 6 compares Transformer models trained from scratch and Tokenformer trained by parameter reusing with varying amounts of seen tokens during training. It is evident that Transformers trained with the same number of seen tokens do not reach the performance level of Tokenformer with parameter reusing" Even if modern MLP would improve transformer results, training costs still seems to be much better which is really impressive
I think most of the results only look impressive because they slightly scale the model above the baseline. From the experiments I've done, it seems like the activation they develop is worse than normal GeLU and moving params from the MLP into the Q K V O projections performs the same given the same # of params.
I think in pattention (cross-attention), the tokens can kinda communicate with each other, so it does not act like an MLP layer.
Pattention is mathematically the same as an MLP tho. Pattention has the following formulation: O = f(X K^T) V and a two layer MLP with an intermediate activation looks like the following: O = f(X W_1) W_2 Since we are doing an inner product with the input and keys, "token interaction" is just a function applied on the input space R^d, of which there are n tokens, or rather n functions, if W_1 = K^T is of shape (d x n). Basically cross attention with learnable K,V is an MLP where the up projection is renamed to K, the down projection is renamed to V, and the activation is changed from GeLU to GeLU(Norm())
Thanks
Great vids! But I didn't figure out how the training process is accomplished?
Thank you for this great video. I was wondering if you could please also share the notes you wrote.
Thanks for making this video! These are very helpful for cold-start students like me - coming back to reading papers after a long time :)
Lol
noice
Thank you!
Thank you bro. Great explanation!
Firstly: Congrats on your Cottention preprint! Thanks for covering this paper. It flew under my radar, but I've been looking for an analysis just like this. I've long suspected that most attention heads care more about semantic than positional information, and that RoPE has only been successful because bad positional information handling hurts more than bad semantic handling. At worst, RoPE forces purely-semantic key/query features to be split over 2 channels instead of 1, so that they are rotationally invariant. Unfortunately this wasn't tested. With Gemma-2B's enormous head dim of 256 (vs 128 in Llama and 96 in phi-3.5-mini, 64 in Falcon-7B and TinyLlama-1.1B), it's one of the least sensitive models to this channel redundancy effect. The p-RoPE test has an unfortunate flaw: they truncate frequencies instead of redistributing them. The 0.25-Rope test has a maximum frequency of 2500, so the lowest frequency component actually loops before the end of the 8k context! Positional confusion from this might explain the worse result. I'd also love to see an analysis of a network with learnable RoPE frequencies. However, another perspective is that hybrid SSM+Attention architectures get almost no benefit from RoPE. Jamba didn't need it at all, and GoldFinch only needed it to work around a post-training position extrapolation issue in RWKV. RoPE might be obviated soon if hybrid architectures catch on.
Thank you! Happy to finally have released the paper. Looks like your intuition was spot on! I think it makes a lot of sense for heads to use either semantic or positional information. It does seem like RoPE was kind of just thrown on the model without any tests at all. This paper is kind of the first I've seen that analyzes it. I was thinking p-RoPE would run into issues as truncation is usually not the best option on a first read through of the paper. I'm thinking the authors weren't really going for a novel positional embedding technique and rather just wanted to explore RoPE a little and then threw p-RoPE in to see if it would work. There's probably a much better way to do PEs given that heads are either mostly semantic or mostly positional. In general, I think the idea of PEs is a bit odd. At first it makes a lot of sense, you need to turn a set into a sequence. But the way it's done is odd and I've always disliked them as it kind of forces the model to overfit to a specific context window, thus having bad extrapolation outside that window. Models like Mamba have positional encodings baked in as they are an RNN, which inherently has position. I like the direction models or going with hybrid architectures and I agree, I think RoPE is going to no long be needed given these architectures don't utilize them at all. A study into how positional information is utilized in hybrid models could be interesting though. Do these hybrid models learn similar heads that are semantic/positional? Tried some tests on some learnable RoPE this weekend. Didn't get much out of it but throwing it here nonetheless: github.com/gmongaras/Learnable_Rotary_Embeddings
@@gabrielmongaras i was thinking something along the lines of learnable gaussian PE.
so well explained. You saved my day. Thank you so much and keep posting stuff like that
Thanks you a lot for sharing this! It helps me a lot
congrats on writing a paper! i notice that another recent paper from NVIDIA uses a unit vector for attention (nGPT) where the dot product is naturally equal to cosine as the lengths are one. are these two works related to each other in any way?
Thanks!! I only read through the nGPT paper briefly, but I think nGPT was trying to make softmax attention/transformers more expressive and efficient by changing a few things. They do normalize before they apply the softmax function, making the logits a cosine similarity between -1 and 1. However they keep the softmax operation which forces the model to stay quadratic in terms of complexity. The paper I worked on removed the softmax function which allowed the attention mechanism to be changed into an RNN which is linear in complexity.
I wish you would have left your comfort zone and started talking more about empirical results or conclusion.
@gabrielmongaras: Hi, thanks very much for the very detailed and easy to understand explanation. Just one thing, for me it isn't super clear from the video why with the parallel pattern the error compounds. Thanks again!
Thanks so much!
Thanks for sharing! Struggled a bit with understanding flows, but you explained everything really nicely
Thanks a ton mr random internet guy, wonderful video
best video out there!
I stopped watching when you named the image height "L", BLASPHAMY!
does the acceptance meaning choosing n tokens from n heads where each head has k tokens to offer? how is the sequence decided then? is head1=position1, head2=position2 and as such?
Thanks for your helpful explanation <3
Am I correct in thinking that the rotational embedding goes from 0 to 360 degrees? In that case, wont the first word of the sequence be very close to the last word in the sequence? Did they account for this?
I've done some more digging on this for those interested. So yes the theta values do indeed loop back around. However, this is why they have multiple values of theta in equation 15. Up to d/2 unique values of theta. Theta_i is defined as 10,000 ^ -2(i-1)/d. So this set of angles vary logarithmically across the dimensions of the embedding vector. Because the exponential term scales with d and 10,000 is a large base, it would have to be an extremely long sequence of values before things start to 'loop back around' as a whole. What exactly that sequence length is? Not sure.
great piece of content, thanks for sharing! Cleared up the understanding for me
The video clearly describes the ideas in the paper, which is great. However, I’m a bit confused about one part. It seems there is no distinction between L∗(C@B^T )@X and L∗(B ^T @X)@C.
What is the application you're using to annotate or take notes?
im at the step where i am running the second cell in the infer-interface file. it keeps giving me an error that says something along the lines of: ImportError: Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes` However i then did that command in the terminal and in the notebook, and it still gives the error, any suggestions? Also thanks, great video for my use case in general, i hope to be able to do some good work after i can get this practice down
Technically impressive that its possible however I only see limited application for this. Pactically most models below 8 bit quantization are way less "aware" of input and context. If I alter a situation ina 8b model it can adjust its output accordingly any model below that is very rigid. That being said maybe you can metigate those effects when you train them on that quantization to begin with and not compress it when it was trained on higher values if it makes sense what I say...
34:00 wait, how? I don't think "(L。QK)V" is equal to "Q*cumsum(KV)"
If L is just the causal mask, then we can write the equation as Q@cumsum(K^T @ V). This is because at each time, t, the query is unique and is multiplied by the sum of all the past keys and values. For example: O1 = Q1 @ K1^T @ V1 = Q1 @ [ K1^T @ V1 ] O2 = Q2 @ K1^T @ V1 + Q2 @ K2^T @ V2 = Q2 @ [ K1^T @ V1 + K2^T @ V2] The Cumsum operation sums all the previous elements from timestep 1 to t, which is the relationship above. It is kind of just a compact way for me to write this recurrence relationship in one equation.
@@gabrielmongaras Got it, and I kinda feel the relationship between causal map and cumsum(). Thank you :D
How is the equality in DDPM established in 17:49?
Looks like I forgot to write out the square root over the first term. As for the inner term that got turned into a fraction, I just multiplied sqrt{1-a_t} by the fraction sqrt{1-a_t}/sqrt{1-a_t}.
Isn't the recurrence A_t@h_(t-1). So a_1 should be multiplied with h_0 and so on?
It depends on how you define the hidden state. Since I'm defining it as h_t = h_t-1 + K^T @ V, then the output would be defined at timestep t, not t-1. This means that the output at time t takes into account all past information and the current token. If we had a relation on t-1, then it would only consider previous information.
Good video, we need these kind of advanced paper explanation.
You did an excellent job in your explanations! I can tell you recently had to grasp the topic since it's pretty new! I'm curious as to how you think the architecture should be expanded to larger problems?
great one, really liked it thanks
My only topology knowledge comes from "Experiments in Topology" by Barr where one chapter was something like a court case about making holes and using them and repairing or something like that. Book was written in 1960s and I read it when I was teen in early 2000s. The book covered the most basics of topology and was quite fun even for teenager me as it had fun exercises.
Huge thanks dude, this explanation has saved me much agony.