Gabriel Mongaras
Gabriel Mongaras
  • 70
  • 234 728
Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt
Null-text Inversion for Editing Real Images using Guided Diffusion Models: arxiv.org/abs/2211.09794
An Edit Friendly DDPM Noise Space: Inversion and Manipulations: arxiv.org/abs/2304.06140
Prompt-to-Prompt Image Editing with Cross Attention Control: arxiv.org/abs/2208.01626
00:00 Intro
01:24 Current image editing techniques
11:42 Deriving DDPM and DDIM
23:08 DDIM inversion
32:46 Null inversion
47:15 DDPM inversion
1:01:18 Prompt-to-prompt
1:10:52 Conclusion
มุมมอง: 833

วีดีโอ

Attending to Topological Spaces: The Cellular Transformer
มุมมอง 615หลายเดือนก่อน
Paper here: arxiv.org/abs/2405.14094 Notes: drive.google.com/file/d/12g_KkHqXD6mEDILJzYbCC08i8cDHITfC/view?usp=drive_link 00:00 Intro 01:39 Cellular complexes 07:26 K-cochain 13:26 Defining structure on the cell 20:28 Cellular transformer 34:18 Positional encodings and outro
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
มุมมอง 2.3K2 หลายเดือนก่อน
Paper here: arxiv.org/abs/2407.04620 Code!: github.com/test-time-training/ttt-lm-pytorch Notes: drive.google.com/file/d/127a1UBm_IN_WMKG-DmEvfJ8Pja-9BwDk/view?usp=drive_link 00:00 Intro 04:40 Problem with RNNs 06:38 Meta learning and method idea 09:13 Update rule and RNN inner loop 15:07 Learning the loss function outer loop 21:21 Parallelizing training 30:05 Results
WARP: On the Benefits of Weight Averaged Rewarded Policies
มุมมอง 7132 หลายเดือนก่อน
Paper here: arxiv.org/abs/2406.16768 Notes: drive.google.com/file/d/11UK7mEZwNVUMYuXwvOTfaqHhN8zSYm5M/view?usp=drive_link 00:00 Intro and RLHF 17:30 Problems with RLHF 21:08 Overview of their method 23:47 EMA 28:00 Combining policies with SLERP 37:34 Linear interpolation towards initialization 40:32 Code 44:16 Results
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
มุมมอง 7412 หลายเดือนก่อน
Paper: arxiv.org/abs/2308.07926 Paper page: qiuyu96.github.io/CoDeF/ Code: github.com/qiuyu96/CoDeF My notes: drive.google.com/file/d/10PMKdd5XBd6Y60HlRB9IW9naR2bWziDT/view?usp=drive_link 00:00 Intro 03:00 Method overview 08:40 Method details 15:24 Tricks done for training and how to actually train this thing 19:24 Flow loss and masking 25:10 Conclusion
Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality
มุมมอง 7K3 หลายเดือนก่อน
Paper here: arxiv.org/abs/2405.21060 Code!: github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba2.py Notes: drive.google.com/file/d/1 XGPFeXQyx4CPxgYjzR4qrLd-baLWQC/view?usp=sharing 00:00 Intro 01:45 SSMs 08:00 Quadratic form of an SSM 15:02 Expanded form of an SSM 24:00 Attention - it's all you need?? 29:55 Kernel attention 32:50 Linear attention 34:32 Relating attention to SSMs 38:...
CoPE - Contextual Position Encoding: Learning to Count What's Important
มุมมอง 1.3K3 หลายเดือนก่อน
Paper: arxiv.org/abs/2405.18719 My notes: drive.google.com/file/d/1y9VHZc7MLqc6t2SHHdlVTYeW3czmmRbl/view?usp=sharing 00:00 Intro 02:44 Background 09:58 CoPE 24:50 Code 32:16 Results
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
มุมมอง 7793 หลายเดือนก่อน
Paper: arxiv.org/abs/2403.03100 Demo: speechresearch.github.io/naturalspeech3/ Code: huggingface.co/spaces/amphion/naturalspeech3_facodec My notes: drive.google.com/file/d/1xnzErd_86B6eLwqpLckhoEQKqkxFPyM_/view?usp=drive_link 00:00 Intro 05:34 Architecture overview 18:45 GRL and subspace independence 24:45 Discrete diffusion Model 41:00 factorized diffusion model 44:00 Conclusion and results
xLSTM: Extended Long Short-Term Memory
มุมมอง 1.9K4 หลายเดือนก่อน
Paper: arxiv.org/abs/2405.04517 My notes: drive.google.com/file/d/1wFYvU_1oUWcCNuQ91zTpSGAeNUsPjlt3/view?usp=drive_link 00:00 Intro 05:44 LSTM 13:38 Problems paper addresses 14:12 sLSTM 23:00 sLSTM Memory mixing 27:08 mLSTM 35:14 Results and stuff
KAN: Kolmogorov-Arnold Networks
มุมมอง 55K4 หลายเดือนก่อน
Paper: arxiv.org/abs/2404.19756 Spline Video: m.th-cam.com/video/qhQrRCJ-mVg/w-d-xo.html My notes: drive.google.com/file/d/1twcIF13nG8Qc10_qeDqCZ4NaUh9tFsAH/view?usp=drive_link 00:00 Intro 00:45 MLPs and Intuition 05:12 Splines 19:02 KAN Formulation 28:00 Potential Downsides to KANs 32:09 Results
LADD: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
มุมมอง 8494 หลายเดือนก่อน
Paper: arxiv.org/abs/2403.12015 My notes: drive.google.com/file/d/1s1-nnWR_ZR26PNSAoZR1Xj3nuD9UZlvR/view?usp=sharing 00:00 Intro 01:31 Diffusion Models 08:08 Latent Diffusion Models 10:04 Distillation 12:02 Aversarial Diffusion Distillation (ADD) 17:06 Latent Aversarial Diffusion Distillation (LADD) 22:20 Results
Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction
มุมมอง 1.8K4 หลายเดือนก่อน
Paper: arxiv.org/abs/2404.02905 Demo: var.vision/ Code: github.com/FoundationVision/VAR My notes: drive.google.com/file/d/1qym3JG-0xqEgQhdvkt9N17o-ZzUWy2sn/view?usp=drive_link 00:00 Intro 00:53 DiTs 04:06 Autoregressive Image Transformers 06:23 Tokenization problem with AR ViTs 08:43 VAE 10:47 Discrete Quantization - VQGAN 16:42 Visual Autoregressive Modeling 21:31 Causal Inference with VAR 24:...
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
มุมมอง 3.6K5 หลายเดือนก่อน
Paper: arxiv.org/abs/2404.07143 My notes: drive.google.com/file/d/1plWJDwHTZkRK9PDdvaLMnZjFR6fVvNLH/view?usp=drive_link 00:00 Intro 07:17 Model intuition 11:00 Memory retrieval operation 16:29 Hidden state updates 21:58 Delta update 24:10 Is it causal? 25:26 Combining local attention and RNN 27:26 Results 30:25 Sampling and Conclusion
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
มุมมอง 2K5 หลายเดือนก่อน
Paper: arxiv.org/abs/2404.02258 My notes: drive.google.com/file/d/1o4v5te1yfuK_FQPvvS8SR55Sysg04dYK/view?usp=drive_link 00:00 Intro 06:02 Mixture of Experts (MoE) 15:12 Mixture of Depths (MoD) 17:04 The gradients must flow! 22:40 Autoregressive Sampling 33:58 Results
Q* AGI Achieved (Apr Fools)
มุมมอง 7815 หลายเดือนก่อน
Q* paper link: link.springer.com/content/pdf/10.1007/BF00992698.pdf April fools 😏
Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
มุมมอง 3.7K5 หลายเดือนก่อน
Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
มุมมอง 9975 หลายเดือนก่อน
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet
มุมมอง 5K6 หลายเดือนก่อน
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet
DoRA: Weight-Decomposed Low-Rank Adaptation
มุมมอง 1.9K6 หลายเดือนก่อน
DoRA: Weight-Decomposed Low-Rank Adaptation
OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers
มุมมอง 11K7 หลายเดือนก่อน
OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers
A Decoder-only Foundation Model For Time-series Forecasting
มุมมอง 3.8K7 หลายเดือนก่อน
A Decoder-only Foundation Model For Time-series Forecasting
Lumiere: A Space-Time Diffusion Model for Video Generation
มุมมอง 6557 หลายเดือนก่อน
Lumiere: A Space-Time Diffusion Model for Video Generation
Exphormer: Sparse Transformers for Graphs
มุมมอง 4237 หลายเดือนก่อน
Exphormer: Sparse Transformers for Graphs
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
มุมมอง 1.6K7 หลายเดือนก่อน
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution
มุมมอง 4618 หลายเดือนก่อน
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution
Cached Transformers: Improving Transformers with Differentiable Memory Cache
มุมมอง 8448 หลายเดือนก่อน
Cached Transformers: Improving Transformers with Differentiable Memory Cache
Translatotron 3: Speech to Speech Translation with Monolingual Data
มุมมอง 8128 หลายเดือนก่อน
Translatotron 3: Speech to Speech Translation with Monolingual Data
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
มุมมอง 9K9 หลายเดือนก่อน
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
มุมมอง 2K9 หลายเดือนก่อน
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Adversarial Diffusion Distillation
มุมมอง 1.8K9 หลายเดือนก่อน
Adversarial Diffusion Distillation

ความคิดเห็น

  • @afsarabenazir8558
    @afsarabenazir8558 2 วันที่ผ่านมา

    does the acceptance meaning choosing n tokens from n heads where each head has k tokens to offer? how is the sequence decided then? is head1=position1, head2=position2 and as such?

  • @NockyLucky
    @NockyLucky 6 วันที่ผ่านมา

    Thanks for your helpful explanation <3

  • @LewisHughes-d2o
    @LewisHughes-d2o 8 วันที่ผ่านมา

    Thompson Melissa Rodriguez Carol Allen Frank

    • @gabrielmongaras
      @gabrielmongaras 8 วันที่ผ่านมา

      @@LewisHughes-d2o what's up with you?

  • @gunnerstone120
    @gunnerstone120 8 วันที่ผ่านมา

    Am I correct in thinking that the rotational embedding goes from 0 to 360 degrees? In that case, wont the first word of the sequence be very close to the last word in the sequence? Did they account for this?

    • @gunnerstone120
      @gunnerstone120 8 วันที่ผ่านมา

      I've done some more digging on this for those interested. So yes the theta values do indeed loop back around. However, this is why they have multiple values of theta in equation 15. Up to d/2 unique values of theta. Theta_i is defined as 10,000 ^ -2(i-1)/d. So this set of angles vary logarithmically across the dimensions of the embedding vector. Because the exponential term scales with d and 10,000 is a large base, it would have to be an extremely long sequence of values before things start to 'loop back around' as a whole. What exactly that sequence length is? Not sure.

  • @vladandronik5711
    @vladandronik5711 9 วันที่ผ่านมา

    great piece of content, thanks for sharing! Cleared up the understanding for me

  • @TonyaMartin-b7c
    @TonyaMartin-b7c 14 วันที่ผ่านมา

    Lee Larry Martin Donald Lee Cynthia

  • @YunliWu
    @YunliWu 23 วันที่ผ่านมา

    The video clearly describes the ideas in the paper, which is great. However, I’m a bit confused about one part. It seems there is no distinction between L∗(C@B^T )@X and L∗(B ^T @X)@C.

  • @RazhanHameed
    @RazhanHameed 25 วันที่ผ่านมา

    What is the application you're using to annotate or take notes?

  • @DSHeroX
    @DSHeroX 26 วันที่ผ่านมา

    im at the step where i am running the second cell in the infer-interface file. it keeps giving me an error that says something along the lines of: ImportError: Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes` However i then did that command in the terminal and in the notebook, and it still gives the error, any suggestions? Also thanks, great video for my use case in general, i hope to be able to do some good work after i can get this practice down

  • @szebike
    @szebike หลายเดือนก่อน

    Technically impressive that its possible however I only see limited application for this. Pactically most models below 8 bit quantization are way less "aware" of input and context. If I alter a situation ina 8b model it can adjust its output accordingly any model below that is very rigid. That being said maybe you can metigate those effects when you train them on that quantization to begin with and not compress it when it was trained on higher values if it makes sense what I say...

  • @HenryDeng-o2x
    @HenryDeng-o2x หลายเดือนก่อน

    34:00 wait, how? I don't think "(L。QK)V" is equal to "Q*cumsum(KV)"

    • @gabrielmongaras
      @gabrielmongaras หลายเดือนก่อน

      If L is just the causal mask, then we can write the equation as Q@cumsum(K^T @ V). This is because at each time, t, the query is unique and is multiplied by the sum of all the past keys and values. For example: O1 = Q1 @ K1^T @ V1 = Q1 @ [ K1^T @ V1 ] O2 = Q2 @ K1^T @ V1 + Q2 @ K2^T @ V2 = Q2 @ [ K1^T @ V1 + K2^T @ V2] The Cumsum operation sums all the previous elements from timestep 1 to t, which is the relationship above. It is kind of just a compact way for me to write this recurrence relationship in one equation.

    • @HenryDeng-o2x
      @HenryDeng-o2x หลายเดือนก่อน

      @@gabrielmongaras Got it, and I kinda feel the relationship between causal map and cumsum(). Thank you :D

  • @陈兆伟-s5w
    @陈兆伟-s5w หลายเดือนก่อน

    How is the equality in DDPM established in 17:49?

    • @gabrielmongaras
      @gabrielmongaras หลายเดือนก่อน

      Looks like I forgot to write out the square root over the first term. As for the inner term that got turned into a fraction, I just multiplied sqrt{1-a_t} by the fraction sqrt{1-a_t}/sqrt{1-a_t}.

  • @AarreLisakki
    @AarreLisakki หลายเดือนก่อน

    a tiny correction: its phOnEme, not as you say, phEnOme

    • @gabrielmongaras
      @gabrielmongaras หลายเดือนก่อน

      Thanks for letting me know! I have no idea where I got phOnEme from ha!

  • @TheCrmagic
    @TheCrmagic หลายเดือนก่อน

    Isn't the recurrence A_t@h_(t-1). So a_1 should be multiplied with h_0 and so on?

    • @gabrielmongaras
      @gabrielmongaras หลายเดือนก่อน

      It depends on how you define the hidden state. Since I'm defining it as h_t = h_t-1 + K^T @ V, then the output would be defined at timestep t, not t-1. This means that the output at time t takes into account all past information and the current token. If we had a relation on t-1, then it would only consider previous information.

  • @coc2912
    @coc2912 หลายเดือนก่อน

    Good video, we need these kind of advanced paper explanation.

  • @WesleyWilliams-x9y
    @WesleyWilliams-x9y หลายเดือนก่อน

    You did an excellent job in your explanations! I can tell you recently had to grasp the topic since it's pretty new! I'm curious as to how you think the architecture should be expanded to larger problems?

  • @EkShunya
    @EkShunya หลายเดือนก่อน

    great one, really liked it thanks

  • @AM-yk5yd
    @AM-yk5yd หลายเดือนก่อน

    My only topology knowledge comes from "Experiments in Topology" by Barr where one chapter was something like a court case about making holes and using them and repairing or something like that. Book was written in 1960s and I read it when I was teen in early 2000s. The book covered the most basics of topology and was quite fun even for teenager me as it had fun exercises.

  • @jahovajenkins5947
    @jahovajenkins5947 หลายเดือนก่อน

    Huge thanks dude, this explanation has saved me much agony.

  • @farafr46
    @farafr46 หลายเดือนก่อน

    Спасибо за видно, ничего не понял

  • @thisismambonumber5
    @thisismambonumber5 หลายเดือนก่อน

    this video was so cool thanks ❤ i know its diffusion model specific but it would be cool if you could explain lycoris like this

  • @palfers1
    @palfers1 2 หลายเดือนก่อน

    I cannot listen to you because you continually interrupt yourself. Practise speaking.

  • @nozulani
    @nozulani 2 หลายเดือนก่อน

    eli5?

    • @gabrielmongaras
      @gabrielmongaras 2 หลายเดือนก่อน

      Sure! Typical graph data usually takes on a node-edge form where the data is assigned to each node with connections via edges. This notion of a graph can be generalized by considering a graph with faces formed through a connection of edges, for example a triangle or square. These graphs are called "cell complexes" as each face kind of looks like a cell. Instead of data being put only on vertices, we can also assign data to edges and faces. This makes the data representation and model more expressive since we look at higher-order connections on faces rather than only on edges. The authors form a transformer utilizing the properties of higher order structures and find good resouls on a few datasets.

    • @nozulani
      @nozulani 2 หลายเดือนก่อน

      @@gabrielmongaras thanks, wasn't exactly eli5 but i understand this is a difficult topic

    • @gabrielmongaras
      @gabrielmongaras 2 หลายเดือนก่อน

      Is there anything that's particularly confusing?

    • @nozulani
      @nozulani 2 หลายเดือนก่อน

      @@gabrielmongaras thanks i don't want to take your time, i'm going to study this myself

  • @KaiSyunHou
    @KaiSyunHou 2 หลายเดือนก่อน

    I have understand that the inner loop will update W_t with the modified reconstruction loss, then how is the theta_K, theta_V, theta_Q being updated in outer loop? specifically, what loss is related to these three parameters?

    • @gabrielmongaras
      @gabrielmongaras 2 หลายเดือนก่อน

      The "outer loop" uses the normal training strategy where we have the negative log likelihood or cross entropy loss for next token prediction. This gradient of this loss backpropogated to all layers like normal. So the "outer loop" is just the rest of the network.

    • @KaiSyunHou
      @KaiSyunHou 2 หลายเดือนก่อน

      @@gabrielmongaras but the normal cross entropy loss for next token prediction should not incur any gradient flow to theta_K and theta_V? They are for reconstruction loss, so they are not involved in token prediction

    • @gabrielmongaras
      @gabrielmongaras 2 หลายเดือนก่อน

      The theta params are trained via the next token prediction loss. This can be thought of as the outer model querying the inner model for information by changing the loss function with these theta params. I think the outer loop actually differentiates the inner loop (including the gradient of the loss) so the K,V there params are updated by the outer loop in this way. The only param that's "trained" using the inner loop is the hidden state.

    • @po-yupaulchen166
      @po-yupaulchen166 22 วันที่ผ่านมา

      @@gabrielmongaras Maybe similar question to this and please help to clarify. Inner loop is to update W using the loss function in (4). Does it mean that theta_K and theta_V are 'not' updated based on the loss in (4) ? I guess it is not. Otherwise, how to update theta_Q that is not shown in (4). I guess theta's are updated based on the other loss function but not written precisely in the paper.

  • @adidevbhattacharya9220
    @adidevbhattacharya9220 2 หลายเดือนก่อน

    That was indeeed a gr8 explanation. Can you please explain how do we get the hidden dimension d @28:19 For e.g if the img in latent space is 128x128x3 and we consider patch size of 32. Then no. of tokens = (128/32)^2 = 16 Is the number of dimnesion (d) then = p^2 = 32^2 ? Please clarify this

  • @lexer_
    @lexer_ 2 หลายเดือนก่อน

    I really like seeing videos like this that are actually willing to explore the math and break it down to make it easier to catch onto how it actually works. Too many just discuss the abstract or skip past the math because it hampers audience maximization.

  • @gabrielmongaras
    @gabrielmongaras 2 หลายเดือนก่อน

    I want to point out that after training, the inner loop in the RNN still needs to be "optimized" since the gradient update rule is the RNN update rule itself. The normal NLL training procedure just trains the model on how to "teach" the internal RNN layers. This isn't normal optimization though. It doesn't require an optimizer since we know the functional form of the gradient wrt. to hidden state so it's very efficient.

  • @hcp-s4k
    @hcp-s4k 2 หลายเดือนก่อน

    Does each pixel in the image have a box? For a pixel, the paper says that the vertex is the center of the pixel, but the video says that the vertex can be outside the box. I don't quite understand.I hope there will be a more detailed description of the training details.

  • @hi_6546
    @hi_6546 2 หลายเดือนก่อน

    Hey, if the frame rate is equal to 50*64, that mean that each second the model need to predict 50*64 values? (without counting the dim)

    • @hi_6546
      @hi_6546 2 หลายเดือนก่อน

      the model works with the index of the codebook no?

    • @nickackerman8755
      @nickackerman8755 หลายเดือนก่อน

      Where do you get the 64? If we have e.g. 4 codebooks, and there are 50 time steps per one second of audio, we need to predict 4 * 50 values per second (4 codebook indices per time step), I believe.

  • @lexer_
    @lexer_ 2 หลายเดือนก่อน

    There is a strange disconnect between the kinds concepts you explain here. Some of them are like absolute beginner, you have never coded before and don't know anything about machine learning but the majority of the video requires like at least multiple months of crash course if not years of experience in the field to understand. I think your videos could benefit from a more consistent assumption of competence of the audience. Or maybe approach it in sections for difference audience competencies? I really apprechiate these walk-throughs through papers. It makes it so much easier to concentrate for me than only reading on my own so I really want anyone that does this to succeed.

    • @gabrielmongaras
      @gabrielmongaras 2 หลายเดือนก่อน

      Thanks for the feedback!! Usually I try to keep most papers I read a little more technical as I'll find myself explaining things like transformers over and over again, which could lead to videos being unnecessarily long (they already feel way too long, been trying to reduce lengths). Some videos, like the stable diffusion one, I try to make a little more beginner friendly assuming knowledge of CNNs, MLPs and basic training. I think I should probably communicate this better and make a clear distinction between the two perhaps in the title somewhere. I think with this video, I wanted it to be somewhere in the middle which led to problems. I like the idea of approaching in sections based on knowledge level. For example, I could've explained REINFORCE a little or point to a resource about it. While those who know it could skip this part, those who don't could watch it or go to the resource. Will think about this some more for future videos!

    • @lexer_
      @lexer_ 2 หลายเดือนก่อน

      @@gabrielmongaras I personally got annoyed at the very beginner level explanations early on which felt like it took forever and I feel lucky that I stuck it out long enough to actually make it to the interesting parts later on. For a while I thought that was really all there would be in this video, explanations of some very basic concepts. But that is of course very different depending on where one might be comming from in terms of familiarity with these topics. I will watch out for future videos of yours for sure.

    • @gabrielmongaras
      @gabrielmongaras 2 หลายเดือนก่อน

      @@lexer_ Got it! I didn't realize the intros were annoying 😅 Thanks again for you feedback, it's really helpful! From now, I think I will directly mention if I am explaining a basic concept, saying to skip if one already knows said concept and to provide resources instead of explaining concepts I assume most should know. In this video, I could've probably just skipped the RLHF explanation all together and pointed to one of the many explanations of RLHF. I'm going to experiment with this a little in future videos!

    • @codylane2104
      @codylane2104 2 หลายเดือนก่อน

      @@gabrielmongaras Yeah, pointing to good basic level explanatory material is a great idea! There's a ton of materials on the net, yet only a fraction is really worth studying. 🙂

  • @EngineeredFemale
    @EngineeredFemale 2 หลายเดือนก่อน

    Great video. Thank you for covering this! Could you also have a look at 'Terminator' paper and make a video on that? It looks like it's a new architecture.

    • @gabrielmongaras
      @gabrielmongaras 2 หลายเดือนก่อน

      Will take a look at it! Not sure if I'll make a video on it yet though. I usually do if the model proposed something really interesting or has really good results!

  • @nicEDITS_
    @nicEDITS_ 2 หลายเดือนก่อน

    Really helpful, thanks!

  • @hakikatsingh6254
    @hakikatsingh6254 2 หลายเดือนก่อน

    dayummmmmmmm , pretty awesome dude

  • @yadav-r
    @yadav-r 2 หลายเดือนก่อน

    I do not understand it, but can we use it to do stock market forecast, sports forecast. How do I use the tool, is there a tutorial for it? Thank you .

  • @PhucNguyenXuan-cn8rg
    @PhucNguyenXuan-cn8rg 3 หลายเดือนก่อน

    thank you so much for yout clearly explanation!

  • @danieloh4511
    @danieloh4511 3 หลายเดือนก่อน

    Great video. BTW, what is the note-taking app in the video?

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      Thanks! I'm using the Samsung notes app. Pretty standard, but I like the style and ease of use of the app.

  • @acasualviewer5861
    @acasualviewer5861 3 หลายเดือนก่อน

    If they can convert between Transformers and Mamba2, why not convert GPT2 (which has published weights) to Mamba2 and prove that the converted performs as well as the original? Or do the same with a bigger model such as Llama? Or does their method not allow importing Transformer weights into the equivalent Mamba2?

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      GPT2 and models that use the original transformer architecture use softmax attention which cannot be decomposed down into an RNN or SSM due to the nonlinear dependence of the softmax function on the entire row. This forces you to compute an entire row in memory to get the output of that row. However, if you remove that dependence with the kernel trick or a similar method, such as using ReLU on the quereis and keys, then you can decompose the attention mechanism into an RNN/SSM. This is because you no longer have to compute an entire row of the attention matrix to get the output value, rather it's just a sum of value along that row, giving the RNN formulation. So, original softmax transformer models can't be changed into an RNN/SSM, meaning it has to stay quadratic (unless flash attention is used, but this is just a cuda trick). This is why I think we should move away from softmax attention. Changing to an SSM architecture of linear attention is way more efficient and papers such as this show that linear methods are comparable to naive transformers in terms of accuracy.

    • @acasualviewer5861
      @acasualviewer5861 3 หลายเดือนก่อน

      @@gabrielmongaras thank you for responding and for your very valuable videos. You are by far my favorite "Paper explainer" channel. Back to the question: given that the Transformers cannot be converted to Mamba2, I'm not sure its convincing to say that Mamba2 is equivalent to GPT as the paper claims. If it is, then the proof should be in the results. And it should be able to compete with the state of the art. Though I get your point about Softmax. Frankly I find it confusing that Softmax counts as a non-linearity. But in the end, results should speak for themselves.

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      Thank you! Glad you're finding my videos helpful! I think it may be better to say that SSMs, GPT, and RNNs are very similar in that changing the nonlinearity in GPT and changing the A block in SSMs result in an equivalent RNN. This RNN formulation is what Mamba 2 is. The softmax "nonlinearity" also confuses me. If you try replacing it with other nonlinearities, you'll find that for some reason it has crazy good properties when used on the attention matrix that other nonlinearities don't have, making it get very good accuracy. It also has the fact that the values sum up to 1 which is very nice for stability reasons. Very happy to see something replace this thing as it feels very naive to just slap a softmax onto a matrix. I think the authors make a convincing argument with their results. Since other papers have shown that using a linear attention block can be comparable to softmax attention in terms of accuracy, Mamba 2 is pretty believable. Also they form a lot of theory and release code making me believe this papers claims more. I guess we'll see how Mamba 2 compares when tested on large scale LLM cases.

    • @acasualviewer5861
      @acasualviewer5861 3 หลายเดือนก่อน

      @@gabrielmongaras Softmax seems to compare values, though layer norm does something similar. Maybe the comparison is what makes it special. With the output of softmax we're always creating a weighted sum of sorts. Like an interpolation of vectors. Relu seems incapable of doing the same.

  • @elon-69-musk
    @elon-69-musk 3 หลายเดือนก่อน

    awesome 👍😎

  • @Ryu-ix8qs
    @Ryu-ix8qs 3 หลายเดือนก่อน

    Thanks for the video. Very helpful to understand how it works.

  • @gabrielmongaras
    @gabrielmongaras 3 หลายเดือนก่อน

    Sorry the audio got messed up on this one. Got a new mic and it acts a bit weird with my tablet sometimes. Tried to fix it in post processing, but wasn't able to get the wobbling sound out. Maybe I should just train an audio algorithm that takes my garbage audio in and spits out professional audio via consistency loss, ha!

  • @noadsensehere9195
    @noadsensehere9195 3 หลายเดือนก่อน

    How can I implement this paper

  • @einsteinsapples2909
    @einsteinsapples2909 3 หลายเดือนก่อน

    Don't you have the domensions wrong at 2:30? Shouldn't the Q, K, V matrices be Tokens X Channels, with each row representing a single token?

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      I usually like to transpose the matrices when drawing the QKV matrices out for attention. I feel like a sequence going left to right rather than up to down is more intuitive, but idk.

    • @einsteinsapples2909
      @einsteinsapples2909 3 หลายเดือนก่อน

      @@gabrielmongaras Interesting for me its more intuitive to have a token per line. Probably because I think of it as a Python list. Like a matrix to me, is a list of lists.

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      @@einsteinsapples2909 I'll keep this in mind for future videos! Idk why, but seeing a token be vertical is the way I've always drawn it. I guess as long as the shape is written out correctly, it's fine.

    • @einsteinsapples2909
      @einsteinsapples2909 3 หลายเดือนก่อน

      @@gabrielmongaras Hi Gabriel, I don't know about your background, if you're a student or not. I personally, have never formally studied any of these topics, so I don't know what the conventions are (for all I know, what you're doing is the norm). I know in the "Attention is All You Need" paper they write the function as QK^t (they transpose the Keys matrix), the same way you write it in your video. The only way you can do QK^t and end with an "S by S" matrix is if the Q and K matrices are S x d (each row represents a token). If you want the Q,K,V matrices to be "d x S" (columns representing tokens) then you should do K^tQ instead to get the attention scores.

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      @@einsteinsapples2909 I don't have a degree in ML, mostly just self study. I think the notation that you're suggesting would be correct from a linear algebra perspective meaning I draw the keys right, but need to flip the queries/values. Nonetheless, Q, K, and V are (S, d) regardless of how they're drawn out, or else the attention formula wouldn't work. I guess it's confusing when I draw the tokens as row vectors 😅

  • @terenceAAA
    @terenceAAA 3 หลายเดือนก่อน

    For the paper says that they get N next tokens by N heads from hidden states h_t at position t, I just don't know the arrow from Head1 to Head2 in the Figure 2. The distribution of tokens by head2 will never be determined by the output from head1. I just cannot get the connection from head1 and head2.

  • @KevinInPhoenix
    @KevinInPhoenix 3 หลายเดือนก่อน

    Does this technology make Nvidia's tech and NPUs obsolete?

  • @jonatan01i
    @jonatan01i 3 หลายเดือนก่อน

    The .log on the mask is there to make the attn_logits' to be masked values negative infinity, that's how you make them disappear to 0 in the attn

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      ah makes sense, turns a binary mask into a mask of -inf and 0. Usually I just pass a mask of -inf and 0 through the entire model.

    • @jonatan01i
      @jonatan01i 3 หลายเดือนก่อน

      @@gabrielmongaras btw, thanks for the upload!:)

  • @jonatan01i
    @jonatan01i 3 หลายเดือนก่อน

    It's not true that they don't have positional encoding "at all", it's very far from true. Every token only gets access to what came before, and that's a lot of info available to rely on.

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      It still doesn't have information of position though. Position tells the model where in the sequence a certain token is, differentiating the same token in different positions in the sequence. Without positional encodings, the model operates on sets of information, not sequences.

    • @jonatan01i
      @jonatan01i 3 หลายเดือนก่อน

      @@gabrielmongaras yes I know what you mean but the elements of it are not discrete but very dependent on their positions (what is there before me) in the sequence. If you switch two tokens’ place in it, the first few tokens that come before the leftmost of the two switched tokens will remain the same, but from that point on in the sequence everything changes.. they have contextual positional encodings built into them we could say that in some sense (given you have the future masking, for vanilla encoder it doesn’t work, that one has no position information at all I agree with that but not with the future masked one)

    • @jonatan01i
      @jonatan01i 3 หลายเดือนก่อน

      ​@@gabrielmongaras if we have a transformer model with let's say 5 tokens and without additional positional encodings, and the sole task of it would be to have a random permutation of these 5 tokens given to it two times (same permutation twice), essentially it's task is to copy the first five token one after the other in the same order.. if you believe a transformer (with future mask!) has no positional information at all, then you realise that you are suggesting that this task is impossible for the transformer to learn. Do you agree with this or are we talking about two different things?

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      oh yea I think I see what you're saying. Without PEs, the transformer cannot distinguish position of words, however it can still get a sense of the position of the word it is currently generating based on the context of the tokens. I suppose it may be easier to take a look at this through the idea of sets. A normal transformer with PEs is essentially operating on an ordered set where each element is unique due to the ordering. Without PEs, the transformer operates on an unordered set of data, but it can still get an idea of the size of the set and maybe also an idea of the "count" of the number of duplicated items in the set.

    • @jonatan01i
      @jonatan01i 3 หลายเดือนก่อน

      @@gabrielmongaras and also an idea of [2,3,(1,5,4)][2,3, and now what comes next? How would it know if it was (1,4,5),(1,5,4),(4,1,5),…or(5,4,1)? I get it that the addition operation is commutative and all, but the model will eventually rely on context of previous tokens to get to come up with a useful (relative)positional information for itself because it is possible to do given there is the future mask given to it (if no mask, then the transformer yes becomes commutative truly, you can change the order of the tokens and it won’t make any difference, (given you don’t use convolutions with kernels bigger than 1 and positional embeddings neither)

  • @lienlrac7644
    @lienlrac7644 3 หลายเดือนก่อน

    When Gabriel drops I stop everything and listen

  • @marinepower
    @marinepower 3 หลายเดือนก่อน

    I wonder if this method could be improved by having a new projection matrix of size [hidden_dim x 1] that computes the width of each token, we take the sigmoid, the cumulative sum, we do the interpolation as described, but we add it to the queries and keys, then do normal attention. This requires a new ((small)) matrix but would allow us to use flash attention directly without needing a new cope kernel.

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      This makes perfect sense to work just as well as the CoPE method to me! The only drawback I see is that CoPE recalculates all positions for each token while positions in this case would not change for past tokens. However, I'm guessing this recomputation is unneeded and that you will get similar performance as CoPE, but it's a lot more efficient computationally. Are you thinking of implementing this?

    • @marinepower
      @marinepower 3 หลายเดือนก่อน

      @@gabrielmongaras All positions here would be recalculated every layer, so they would affect past tokens. For example, the width for token 1 might be computed as 0.8 in layer 1 and 0.5 for layer 2, etc. Since we're doing the cumulative sum all positions are updated, since every token relies on token 1 (but this applies to every token in the series). As far as actually implementing this, however, I ran into a small issue. What this method uses (floor, ceil) is non-differentiable. Instead, the embeddings would need to be calculated directly from the cumulative embedding position, which I think is fairly expensive since these embeddings use a lot of sines and cosines, etc. It seems possible, it's just that we might want to simplify the embedding function a bit.

    • @Anonn724
      @Anonn724 3 หลายเดือนก่อน

      Sounds interesting. Definitely will give it a try to implement this. I quickly implemented some CUDA as Gabriel mentioned if sb is interested gh juvi21/CoPE-cuda

    • @gabrielmongaras
      @gabrielmongaras 3 หลายเดือนก่อน

      ​@@marinepower That makes sense, though at the same layer, the positions are calculated once, in CoPE, they're recalculated for each token, but I don't think this difference will be very beneficial, meaning this method can probably work just as well! I think a detach can be used on the floor/ceil, the weight on the other hand will be the differentiable part: For a given token, we have a positional value of say p which is the sum of all positional value up to that token. For the positional embedding for a specific token, we would have: e = (1 - (p - floor(p)))*PE[floor(p)] + (p - floor(p))*PE[ceil(p)]) Where PE[i] is the ith absolute positional encoding. In this case, floor(p) and ceil(p) are not differentiable, but p is, so there's still gradient flow So I think this idea should still work!

    • @marinepower
      @marinepower 3 หลายเดือนก่อน

      @@gabrielmongaras Hmm, yeah, I think that works! I am not really training any transformers right now that need variable positional encodings so I probably won't test this method, but the next time I look into llms I'll try this out! (Although... training LLMs from scratch requires a tremendous amount of compute, so I moreso hope someone else notices this comment chain and tries it out lol)

  • @DataScienceGuy
    @DataScienceGuy 3 หลายเดือนก่อน

    great job! Very clear explanation indeed. How do you think, why control net approach works worse with sdxl model?