Were RNNs All We Needed? (Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ธ.ค. 2024

ความคิดเห็น • 76

  • @fireinthehole2272
    @fireinthehole2272 2 หลายเดือนก่อน +233

    Next paper: was NAND gates and registers all we needed?

    • @ickorling7328
      @ickorling7328 2 หลายเดือนก่อน +12

      Wait, literally...

    • @yurona5155
      @yurona5155 2 หลายเดือนก่อน

      Shame on you, NERF (NOR-exclusionary reductive functionalist)!

    • @achunaryan3418
      @achunaryan3418 2 หลายเดือนก่อน +6

      Was newton all we needed?

    • @Cereal.interface
      @Cereal.interface 2 หลายเดือนก่อน +9

      are organic molecules and nucleotides all we needed?

    • @ickorling7328
      @ickorling7328 2 หลายเดือนก่อน +2

      @@Cereal.interface was DNA as central dogma all we needed?

  • @Bikameral
    @Bikameral 2 หลายเดือนก่อน +130

    It's great having you back !! Thank you and please don't leave us again

  • @wolpumba4099
    @wolpumba4099 2 หลายเดือนก่อน +44

    *Were RNNs All We Needed? Revisiting the Power of Minimal Recurrent Networks*
    * *0:00** Introduction:* The video explores a paper questioning the necessity of complex recurrent neural network (RNN) architectures like S4 and Mamba, suggesting that simpler RNNs might achieve comparable performance.
    * *0:16** RNNs vs. Transformers:* RNNs handle sequences efficiently with constant memory requirements compared to Transformers' quadratic memory needs, but suffer from backpropagation through time (BPTT).
    * *3:52** BPTT Limitations:* BPTT requires backpropagating gradients through all intermediate steps, limiting the length of sequences RNNs can effectively handle.
    * *5:30** State Space Models:* Newer models like S4 and Mamba address BPTT by removing hidden state dependencies from input computations, allowing for parallel processing and training.
    * *9:06** Minimal RNNs (minGRU, minLSTM):* The paper introduces minimal versions of GRUs and LSTMs that eliminate hidden state dependencies in gating mechanisms, further simplifying computation.
    * *12:54** Parallel Scan:* These minimal RNNs can be trained efficiently using a parallel scan algorithm, similar to S4 and Mamba.
    * *14:56** Trade-offs:* While simpler, minimal RNNs are less powerful than traditional RNNs in a single layer. However, this can be mitigated by using multiple layers.
    * *19:55** Experimental Results:*
    * *19:57** Selective Copying Task:* Minimal RNNs struggle with long-range dependencies in a single layer, but improve significantly with multiple layers.
    * *21:02** Reinforcement Learning Benchmarks:* Minimal RNNs perform well, but the benchmarks are considered too simple to draw strong conclusions.
    * *23:59** Language Modeling (Shakespeare):* Minimal RNNs perform comparably to Mamba on this small character-level dataset, where Transformers struggle due to the task's local nature.
    * *26:45** Conclusion:* The paper's hypothesis that minimal RNNs can achieve comparable performance to complex state-space models is valid, but requires stronger experimental evidence. However, the potential for scalability and efficiency makes them promising candidates for future research.
    I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript.
    Cost (if I didn't use the free tier): $0.03
    Input tokens: 21161
    Output tokens: 467

    • @onlyms4693
      @onlyms4693 2 หลายเดือนก่อน +1

      Does their solution to attention head are just with more layer? if i not wrong even mamba have limitation that it use transformer multi head attention to mitigate it.
      What we need to find are a replacement formula for attention head because i feel its the biggest compute cost with the fact the bigger the context the bigger its need to be procces in the attention head by calculating each context to other context if it more than one words.

  • @novantha1
    @novantha1 2 หลายเดือนก่อน +50

    Imagine how influential this paper could have been if it released in 2014, lol. It would have been revolutionary.

  • @Neomadra
    @Neomadra 2 หลายเดือนก่อน +2

    Excellent analysis of the benchmarks. Especially the analysis of character level tasks makes so much sense.

  • @HoriaCristescu
    @HoriaCristescu 2 หลายเดือนก่อน +1

    Great explanation of the distinction between SSM and RNN at 5:30

  • @maccloud8526
    @maccloud8526 2 หลายเดือนก่อน +84

    Use a dark theme, then you won't have to wear sunglasses.

    • @achunaryan3418
      @achunaryan3418 2 หลายเดือนก่อน +9

      Even then how is he going to enter the matrix neo?

    • @ikartikthakur
      @ikartikthakur 2 หลายเดือนก่อน +2

      dang

    • @apncahere137
      @apncahere137 2 หลายเดือนก่อน

      Lmao

  • @lizardy2867
    @lizardy2867 2 หลายเดือนก่อน +10

    TLDR: It would have been more experimentally interesting to see results on an ensemble of minGRUs.
    It is hard for me to say there is much takeaway here besides confirmation of the Mamba architecture's success. Perhaps they were a bit too excited with the release of the paper, that they decided to not focus on the stronger aspect of the paper, that being the minGRU and the concept of ensemble that Mamba also relies on.

  • @GNARGNARHEAD
    @GNARGNARHEAD 2 หลายเดือนก่อน +3

    was looking at doing something similar last week, but compressing the layers of a transformer into the weights for the RNN get around the training inefficiencies

  • @r.alexander9075
    @r.alexander9075 2 หลายเดือนก่อน +3

    Why were the benchmarks chosen to be RL tasks, instead of Seqeunce Modelling tasks? And why would we then compare then to Decision Transformers?

  • @creepi.
    @creepi. 2 หลายเดือนก่อน +2

    Why do most papers concerning SSMs and RNNs not include RWKV in their benchmarks. Would've been interesting to see how it fairs against Mamba S4/S6 and minGRU/LSTM

  • @elpepemandioca
    @elpepemandioca 2 หลายเดือนก่อน +3

    In spite of not getting good results right now, I'd like more research to go this way, attempting to synthesize the plethora of models

  • @PaganPegasus
    @PaganPegasus 2 หลายเดือนก่อน +1

    To me this paper highlights that RNNs actually aren't all we need and how powerful the transformer really is. A two layer transformer alone is capable of solving a bunch of tasks such as copying, sorting or other sorts of linear classification and reasoning thanks to the QK/OV circuits.

  • @the_primal_instinct
    @the_primal_instinct 2 หลายเดือนก่อน +9

    Next paper: "Can multiplications be replaced with multiple additions?"

    • @dougrattmann1
      @dougrattmann1 2 หลายเดือนก่อน +1

      *cough* AutoML-Zero moment

    • @quasimodo1914
      @quasimodo1914 2 หลายเดือนก่อน

      I don't know, KAN they?

  • @MohamedMagdy-u2k
    @MohamedMagdy-u2k 2 หลายเดือนก่อน

    24:00 correct me if I am wrong, what I see is Transformers is more generalized architecture that requires more training time, on the other side there is an inductive bias in Mamba, minLSTM, and minGRU that makes these architectures converges very quickly to that dataset

  • @lifeofcode
    @lifeofcode 2 หลายเดือนก่อน +1

    Wonderful overview, thanks!

  • @DanFrederiksen
    @DanFrederiksen 2 หลายเดือนก่อน +1

    wouldn't it be straight forward to try it on the GPT2 training set and compare? or is that inconvenient

  • @danielsautot4521
    @danielsautot4521 2 หลายเดือนก่อน +1

    Welcome back. Can you make a video of the architecture of the liquid foundation model?

  • @Mordenor
    @Mordenor 2 หลายเดือนก่อน +5

    Thank you Mr Yannic for discussing whether RNNs are all we needed.

  • @Timotheeee1
    @Timotheeee1 2 หลายเดือนก่อน +1

    can you review the nGPT paper?

  • @shikhars4816
    @shikhars4816 2 หลายเดือนก่อน +2

    iiuc, selective copying of a token depends on the current input token alone (?). In that case why does a single layer perform so bad on the task?

    • @achunaryan3418
      @achunaryan3418 2 หลายเดือนก่อน

      Single layer selection cannot be optimally used in rnn for selective token copying based on current input token. Recurrence requires more than one for creating outputs with less error even when it is input dependent. Maybe CNN, or mnn can produce better result.

  • @yorth8154
    @yorth8154 2 หลายเดือนก่อน

    Hey Yan, I was wondering if you've seen the new google paper Selective Attention. It looks good

  • @box-mt3xv
    @box-mt3xv 2 หลายเดือนก่อน +2

    Missed your videos

  • @Metalhead121396
    @Metalhead121396 2 หลายเดือนก่อน

    Comment on Table 2 -- I was under the impression that the S6 configs were generally the "best" for S4/H3/Mamba? Or does someone know of cases where the S4 or Hyena layers are better-suited?

  • @xelaxander
    @xelaxander 2 หลายเดือนก่อน +1

    25:55 Constant gate decay might actually interesting for surrogate models of physical systems. Ignoring damage accumulation, a system response is independent of it‘s history.

    • @xelaxander
      @xelaxander 2 หลายเดือนก่อน +2

      You need knowledge of the past though since you can’t include the entire phase space in your input, making you loose higher order information.

    • @nias2631
      @nias2631 2 หลายเดือนก่อน

      This kind of happens in Echo State Networks since the previous signals ring-down due to the eigenvalues of the reservoir matrix.

  • @marcotito9873
    @marcotito9873 2 หลายเดือนก่อน

    Really great content

  • @black-snow
    @black-snow 2 หลายเดือนก่อน +5

    5th!
    Finally able to leave a high-quality comment.

  • @alekseyburrovets4747
    @alekseyburrovets4747 2 หลายเดือนก่อน +1

    Randomly stumbled. Subscribed.

  • @ensabinha
    @ensabinha 2 หลายเดือนก่อน

    The fact one of the experiments is run on a simple benchmark is not the issue. As long as all architectures were run on such then that is not an argument not to use as a benchmark. Good architectures should perform well in simple problems as well. However, they should run on hard problems too.

  • @testboga5991
    @testboga5991 2 หลายเดือนก่อน

    I think they're onto something, but I also think that in the strict sense is impossible to demonstrate if it can't be mathematically proven (likely not by a human possible anyway, if at all). They're basically trying to prove a negative, which strictly doesn't work.

  • @2dapoint424
    @2dapoint424 2 หลายเดือนก่อน +1

    Next Paper.. Electricity is all we need,

  • @makhalid1999
    @makhalid1999 2 หลายเดือนก่อน

    GPRNN when?

  • @mrpocock
    @mrpocock 2 หลายเดือนก่อน +1

    RNNs are effectively map-reduce.

  • @andytroo
    @andytroo 2 หลายเดือนก่อน

    how many layers are in mingru - current transformers have >20 complex layers ....

  • @andrewaverbah4809
    @andrewaverbah4809 2 หลายเดือนก่อน

    Please review REPA paper

  • @-E42-
    @-E42- 2 หลายเดือนก่อน +2

    damn I wish I was at flying altitude to fly with you through these papers ahah :D

  • @tresuvesdobles
    @tresuvesdobles 2 หลายเดือนก่อน +1

    I doubt it (answering the question in the title)

  • @davidlearnforus
    @davidlearnforus 2 หลายเดือนก่อน

    but I do not get who has decided that better performance of compositional bare-bone element means anything? Ameba has so much more capabilities than human singe neuron, but there is a big "BUT".

  • @MarceloTeixeiraa
    @MarceloTeixeiraa 2 หลายเดือนก่อน

    What's the next ALL WE NEEDED?

  • @tallwaters9708
    @tallwaters9708 2 หลายเดือนก่อน

    What do people use RNNs for these days? I though they went the way of GANs.

    • @chickenp7038
      @chickenp7038 2 หลายเดือนก่อน +3

      GANs are definitely still very used. all of the vaes in the LDMs use a gan loss

    • @novantha1
      @novantha1 2 หลายเดือนก่อน +2

      Well, the problem with deep learning seems to be that you can do most tasks with most architecture giving enough scale, data, and training compute. RNNs are kind of nice in that paradigm because they have stable memory allocation with large sequences as compared to Transformers, but they’re also a lot easier to optimize because you effectively just need efficient kernels for the linear transformations, activation functions, and parallel scan algorithms, which is quite a bit simpler than in, for instance, a full Transformer.
      As for what you’d use them for? Presumably the same things you could use a Transformer for, essentially. It appears that for a lot of the things you would use a smaller LLM for (ie: 1.3B and below) it actually really doesn’t matter which architecture you have.
      I’ve also thought about extending the context length for a Transformer LLM with some sort of RNN adapter for the ultra long range dependencies but I’m not even sure what that would look like exactly.

    • @AM-yk5yd
      @AM-yk5yd 2 หลายเดือนก่อน +2

      I think translatatron still uses LSTM.
      It mentioned in Translatatroron 2 paper, iirc paper 3 doesn't explicitly goes into what decoder is. Only vaguely. And says that its backbone is translatatron 2.
      There was also xLSTM. I think Yannic covered it.
      RWKV is still alive and being developed. Still weak, but one day....
      I will not be surprised if it is not used in time series prediction. Mamba would fit perfectly and rnn is generally probably the first thing I'd try to model time series from scratch.

    • @AM-yk5yd
      @AM-yk5yd 2 หลายเดือนก่อน

      Simplest form of adaptor would probably make just insert extra layers of rnn.
      Memory transformers(or retro I don't remember which) found that inserting kv-lookup near end like around layer 10 in 12 layers network gives very good results. We can replace kv lookup with rnn layers and either add their output as prefix tokens RMT style or just add values

  • @4thpdespanolo
    @4thpdespanolo 2 หลายเดือนก่อน

    Were Transistors All We Needed?

  • @crassflam8830
    @crassflam8830 2 หลายเดือนก่อน +1

    yes

  • @zrmsraggot
    @zrmsraggot 2 หลายเดือนก่อน

    I just saw the title and i laughed

  • @MartinDxt
    @MartinDxt 2 หลายเดือนก่อน

    Say whaaaat?

    • @lanessarosel
      @lanessarosel 2 หลายเดือนก่อน

      Right - I’m still wondering what the borealis ai is - sounds like a reset machine