ไม่สามารถเล่นวิดีโอนี้
ขออภัยในความไม่สะดวก

Fast and Slow Learning of Recurrent Independent Mechanisms (Machine Learning Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 ส.ค. 2024
  • #metarim #deeprl #catastrophicforgetting
    Reinforcement Learning is very tricky in environments where the objective shifts over time. This paper explores agents in multi-task environments that are usually subject to catastrophic forgetting. Building on the concept of Recurrent Independent Mechanisms (RIM), the authors propose to separate the learning procedures for the mechanism parameters (fast) and the attention parameters (slow) and achieve superior results and more stability, and even better zero-shot transfer performance.
    OUTLINE:
    0:00 - Intro & Overview
    3:30 - Recombining pieces of knowledge
    11:30 - Controllers as recurrent neural networks
    14:20 - Recurrent Independent Mechanisms
    21:20 - Learning at different time scales
    28:40 - Experimental Results & My Criticism
    44:20 - Conclusion & Comments
    Paper: arxiv.org/abs/2105.08710
    RIM Paper: arxiv.org/abs/1909.10893
    Abstract:
    Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning agent interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic manner to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs and its reward function are stationary and can be re-used across tasks. An attention mechanism dynamically selects which modules can be adapted to the current task, and the parameters of the selected modules are allowed to change quickly as the learner is confronted with variations in what it experiences, while the parameters of the attention mechanisms act as stable, slowly changing, meta-parameters. We focus on pieces of knowledge captured by an ensemble of modules sparsely communicating with each other via a bottleneck of attention. We find that meta-learning the modular aspects of the proposed system greatly helps in achieving faster adaptation in a reinforcement learning setup involving navigation in a partially observed grid world with image-level input. We also find that reversing the role of parameters and meta-parameters does not work nearly as well, suggesting a particular role for fast adaptation of the dynamically selected modules.
    Authors: Kanika Madan, Nan Rosemary Ke, Anirudh Goyal, Bernhard Schölkopf, Yoshua Bengio
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    TH-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

ความคิดเห็น • 30

  • @nx6803
    @nx6803 3 ปีที่แล้ว +33

    It’s kind of funny that the authors (or subset of them) ended up proving that their previous paper (modular) is worse than vanilla LSTM. That paper was ICLR spotlight. So did the authors not test their own method enough or did they cherry pick metrics/results for last paper (like they do here) or is it you only need one established “coauthor” to pass the muster of ICLR/ICML etc?

    • @priyamdey3298
      @priyamdey3298 3 ปีที่แล้ว +2

      Well said😂

    • @YannicKilcher
      @YannicKilcher  3 ปีที่แล้ว +9

      I think this paper really focuses on very specific situations and environments where this two loop solution shines

    • @abcd-gj6ru
      @abcd-gj6ru 3 ปีที่แล้ว +3

      Like Yannic said, the paper (like all other papers imo) shows where it does better than existing method (RIMs). The environment and metrics used seemed fair but the curiosity of the reader about performance in other environments and use of alternate metrics is natural. As the proposed approach seems interesting, I would not be surprised to see new papers popping up consisting of testing and experiments in "other" environments.
      In regards to the co-authors, we all know that reviews are blind, and the paper was not released on ArXiV before the results so the comments are not really fair to the reviewers lol.

    • @nx6803
      @nx6803 3 ปีที่แล้ว +5

      @@abcd-gj6ru reviews being blind or double blind does not guarantee not being able to identify where the paper originated from. Yannic himself showed an example of it with regards to Google brain.
      The focus of my questions was not this paper, but the previous one that this was built on. If Vanilla LSTMs perform significantly better than it, then the experimental section of that paper was lacking. I have seen thousands of papers rejected based on unfair/insufficient comparisons, so it begs a question how it did not happen to the RIMs paper.

    • @Anirudhgoyal
      @Anirudhgoyal 3 ปีที่แล้ว +3

      @@nx6803 Thanks for your comment.
      In the original RIMs paper, we did not explore BabyAI RL problems. We did explore Bouncing Balls video prediction, Sequential predictions tasks, entire suite of Atari RL games, and imitation learning problems, which I think are sufficient number of tasks for a scientific study. :)
      For the MetaRIMs paper, we did a systematic study using RIMS for BabyAI, where we did not tune top-k active RIMs for each environment specifically (we also did not do it in the RIMs paper). We found in some of the enviornments RIMs actually perform significantly worse as compared to LSTM baseline (if we keep the top-k factor constant). If we do vary the top-k for each environment seperately, the performance of RIMs improve and infact perform better than vanilla LSTM baseline (also note that RIMs/MetaRIMs use significantly less parameters as compared to LSTM baseline). For Meta-RIMs, we use the same top-k for all the enviornments, and the performance was significantly better than vanilla LSTM model across all the environments.
      Let me know if you have any other feedback or criticism. :)

  • @G12GilbertProduction
    @G12GilbertProduction 3 ปีที่แล้ว +2

    Sokoban example with intentional reintrostruction is really fragile for a better example with a cup draw game with one key on one cup downward inside.

  • @egparker5
    @egparker5 3 ปีที่แล้ว +1

    Two things.
    1) The learning rate on Line 9 in algo 1 should probably beta instead of alpha.
    2) This idea of a network of independent modules is reminiscent of persistent (or immutable) data structures as exemplified by functional languages like Haskel.

  • @user-xs9ey2rd5h
    @user-xs9ey2rd5h 3 ปีที่แล้ว +3

    Wow first? First time I've seen a paper already before a video was made about it

  • @AICoffeeBreak
    @AICoffeeBreak 3 ปีที่แล้ว +6

    👏 -- ☕

  • @drdca8263
    @drdca8263 3 ปีที่แล้ว +1

    42:00 : "we simply give the attention a lower learning rate" : Oh! After hearing the comparison to GANs, I was just thinking of asking what happens if you do that (but for GANs, not for this)!
    You say that it isn't surprising that this doesn't work. Ok, yeah I guess it isn't the same thing, but like, it seems similar?
    Or, it seems that way to me who has no practical experience with this sort of thing. (I was a little surprised it didn't work.)
    If in a simpler network (say, some fully connected network with a dozen inputs, a few outputs, and a handful of hidden layers, and where you have some ordinary fixed dataset of input/output pairs), if you divided the learning rate by, say, 5, and train for 5 times as many time steps, compared to the ideal way of training it,
    I would have imagined that this would result in similar performance, only taking like 5 times as long to train.
    Is this wrong?
    I've thought that gradient descent was an approximation to doing something like this except taking the limit as the step size goes to zero and the number of steps goes to infinity (holding their product constant), and that the step size was mostly a trade-off between how close of an approximation it is, and how long it takes to train. I guess maybe if the step size is too small one is more likely to get stuck in a local minimum?

    • @joshuasmith2450
      @joshuasmith2450 3 ปีที่แล้ว +1

      in high dimensional spaces the probability of getting stuck in a local optimum isn't much of a concern because it approaches 0 as dimensions increases. This is more of an intuitive argument, as I believe correlation can muddy the argument but that is the general intuition and seems to be validated in practice where getting stuck in local optimums isn't much of an issue. The basic idea is if you are stuck in a local minimum according to 1 trainable parameter, then you introduce another trainable parameter, the odds of that also being in a local minimum at that point are less than 1. You are only at a local minimum if all trainable parameters are at local minimums, so the odds of being at a local minimum are (x

    • @joshuasmith2450
      @joshuasmith2450 3 ปีที่แล้ว +1

      I do agree though that the reduced learning rate experiment was surprising and I dont think it was given fair credit.

  • @akashraut3581
    @akashraut3581 3 ปีที่แล้ว +4

    why this? That's just because feelings.
    Almost died laughing

  • @004307ec
    @004307ec 3 ปีที่แล้ว +1

    This method looks like transformers with routers of attention matrices.

  • @RobertWeikel
    @RobertWeikel 3 ปีที่แล้ว +1

    So the "meta" learning issue:
    I personally feel it is conflated and almost approaching word soup of buzzterms, however ....
    Is a generalization technique based around classifying tasks with another RL policy and then then learning aspects of that 2nd policy by considered meta-learning?
    Would you say an RL algorithm that is trained to identify which RL agent to use at any given time be meta-learning?

    • @sheggle
      @sheggle 3 ปีที่แล้ว

      When you just pick one agent I would argue no, as you're just throwing more compute at the problem in the hopes of it working. When your meta learning picks a composite of functions/agents, I would say yes. That is to say, you want to learn the composite that optimizes the gradient flow (or learning) through that composite, hence I would consider it meta learning.

    • @RobertWeikel
      @RobertWeikel 3 ปีที่แล้ว

      @@sheggle so by your definition, meta-learning goes beyond "learning to learn".

  • @abcd-gj6ru
    @abcd-gj6ru 3 ปีที่แล้ว +4

    This paper seemed pretty cool in terms of proposing a new approach for overcoming the challenge of catastrophic forgetting in certain multi-task environments. I would have liked to see a bit more explanation in the video (like we usually are treated to) with perhaps a more positive approach. The video seemed harsh (rant-y) especially towards the end but I can understand where Yannic is coming from.

    • @Anirudhgoyal
      @Anirudhgoyal 3 ปีที่แล้ว

      Thanks for your comment, and for liking the method.
      To be fair to Yannic, he made a good point for improvement that we don't discuss about the size of the context. I think its an important point, which is not well discussed as of now. Regarding the slow /fast training of parameters, I think what we did is indeed fair.
      During our ICLR submission, all the reviewers mostly cared about
      (a) what different modules are learning ?
      (b) are different modules getting activated at different parts of state space ?
      So, after rebuttal, these ablations results made to the paper. The point being is: training the parameters at different time-scales IMPROVE the specialization of different modules. These ablations were supposed to highlight this point. :)

  • @G12GilbertProduction
    @G12GilbertProduction 3 ปีที่แล้ว

    And: what is that "catastrophic forgetting"? This is kind of meta-analytical concern fallacy?

  • @TudorSabin
    @TudorSabin 3 ปีที่แล้ว

    Because feelings 🤣👍

  • @mranandtrex8436
    @mranandtrex8436 2 ปีที่แล้ว

    Great video! But your pronunciation of the first and third author names are completely off :-(

  • @pensiveintrovert4318
    @pensiveintrovert4318 3 ปีที่แล้ว +2

    This is ResNets on steroids, only shortcut connections are not differentiable.

  • @RS-cz8kt
    @RS-cz8kt 3 ปีที่แล้ว

    - What is my purpose?
    - You locate the oranges.
    - ... Oh my God...