LLaMA Pro: Progressive LLaMA with Block Expansion (Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ก.ย. 2024

ความคิดเห็น • 95

  • @YannicKilcher
    @YannicKilcher  8 หลายเดือนก่อน +53

    Note: The H800 is a variant of the H100 for the Chinese market
    OUTLINE:
    0:00 - Introduction
    5:30 - Adding new blocks to LLaMA
    15:00 - Block expansion
    27:40 - Experiments
    30:40 - Conclusion
    Paper: arxiv.org/abs/2401.02415
    Other Paper: proceedings.mlr.press/v162/shen22f/shen22f.pdf

    • @thegreenxeno9430
      @thegreenxeno9430 8 หลายเดือนก่อน +1

      Forgetting things properly is far more important than learning new things.

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh 8 หลายเดือนก่อน

      ​@@thegreenxeno9430 It's not the things I don't know that get me in trouble. It's the things I think I know that cause me the most grief.

    • @keypey8256
      @keypey8256 8 หลายเดือนก่อน

      ​@@thegreenxeno9430depends

  • @MultiMojo
    @MultiMojo 8 หลายเดือนก่อน +54

    Thank you for doing these paper videos! They're far more engaging to watch and learn from rather than reading the paper itself. There are so many papers in this field, it's difficult to filter all the noise (or false hype)

    • @barbaragendron2836
      @barbaragendron2836 8 หลายเดือนก่อน +3

      I completely agree. As a 1st year PhD student working on LLMs I often struggle in selecting interesting papers, since I don't have Yannic's expertise to bring myself such relevant criticism. That's fore sure a major issue of the field.

  • @machine_ethics
    @machine_ethics 8 หลายเดือนก่อน +28

    Totally agree with Yannic: if we affect the flowing data at some early point, then the subsequent cascade of transformer blocks would inevitably diverge the resulting data extremely far from the original (unaffected) transformations. This is some sort of butterfly effect.
    On the other hand, the one thing what, probably, has happened in this experiment, is that by retaining a residual connection from the original bottom block to the original upper block (which serves as bypass path), they forced the weights of the newly added intermediate layers to adapt only to new knowledge domain just because the resulting loss of the whole transformer network is already close to zero at known domains. Thus, the output loss grows up only in case that the original network is perform not very well at a particular case (new domain data) and that is what exactly forced new layers to affect only the data that causes big losses (at the backprop step, I mean).
    That is just my thoughts... In my head, this is the only way that could explain "why it should works".
    This paper raises more questions than it answers. IMHO

    • @corgirun7892
      @corgirun7892 8 หลายเดือนก่อน +2

      good insight

    • @DeruwynArchmage
      @DeruwynArchmage 8 หลายเดือนก่อน +1

      Thanks for your comment. It’s helpful.

    • @mkamp
      @mkamp 8 หลายเดือนก่อน +1

      I can follow your thinking (a need for new capabilities creates the largest gradients) to the point that it can affect change in the sense of new capabilities. But because you do not mix in some of the old examples from the pre-training there is nothing keeping the model to find a change for this new capability that would also, accidentally, affect existing capabilities. But because we don’t have old examples we won’t be able to prevent that or even see that the old capabilities have been overwritten.
      I think the only way to preserve the existing capabilities would be to sample (Mix in as Yannic calls it) from the previous training runs’s data (Pre-training and maybe domain adaptation) to prevent catastrophic forgetting. I suspect though that this is expensive. After the model has learned something during pre-training it may be fine to only take, say, 5%, of the original data. But it would be 5% of the original 10TB, which will outweigh the, say 5000, samples from the finetuning dataset multiple times. Hence the finetuning would take like 5% of the pre-training time plus some spare change for the actual finetuning.

    • @machine_ethics
      @machine_ethics 8 หลายเดือนก่อน +2

      @@mkamp I understand your point. And I thought in the same way in the beginning. But this is not the case. IMHO
      Training data from the new domain is not completely "new" in terms of its representation and "sense": descriptions of math problems (Proof-Pile-2) and comments in the code (The-Stack-Dedup) is the domain of knowledge that already known by the model (at least partially). So in this case, we can talk about a new domain of knowledge only from the point of view of its semantic load (call it "human point of view"), but not from the point of view that this is fundamentally new type of knowledge. Thus, we sort of highlight the knowledge already known by the model.
      I can't say that I'm completely right, but in this case, perhaps, it is this kind of mechanics that takes place.
      P.S.
      Also, we can't use old examples. To be more precise, "we can", but it's better not to do it. It is better to feed the model new training data, but from the old domain. Just to prevent memorization. And this is exactly the case. But it's out of the topic. :)

    • @mkamp
      @mkamp 8 หลายเดือนก่อน +1

      @@machine_ethics agreed, learning from new, but similar examples would be preferable over learning from the same samples again.

  • @Timotheeee1
    @Timotheeee1 8 หลายเดือนก่อน +12

    google's papers about pause tokens and ALBERT have both shown that processing the same layer multiple times improves output quality. I think a lot of the benefits from llama pro come from that alone.

    • @DeruwynArchmage
      @DeruwynArchmage 8 หลายเดือนก่อน

      I’ve been thinking this is a good idea for a while now.

  • @oncedidactic
    @oncedidactic 8 หลายเดือนก่อน +9

    I'm starting to feel like Yannic is the Bob Ross of ML

  • @quebono100
    @quebono100 8 หลายเดือนก่อน +29

    Yannic could you please do a video about liquid neural networks. It is in my view so hyped up, but I can not estimate if its worth it. They make high claimes with it.

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с 8 หลายเดือนก่อน

      I'm not even sure if there is anything you could say about liquid neural networks. Somewhere around a year ago I was trying to understand the tech behind it, but I couldn't find anything about it beyond a bunch of ChatGPT generated SEO boosted articles on Medium and a bunch of ted talks from the people behind the hype. It really looks like it's just a bunch of marketing nonsense from "people from MIT".

  • @4.0.4
    @4.0.4 8 หลายเดือนก่อน +6

    I don't think we'll be running ChatGPT-sized models locally any time soon, but papers like these make me think small models may have a surprising amount of room to grow.

    • @Uname-d6t
      @Uname-d6t 8 หลายเดือนก่อน +1

      Are you aware of mixtral and its comparison to ChatGPT3.5 according to lmsys? You can already run locally ChatGPT-sized model.

  • @jabowery
    @jabowery 8 หลายเดือนก่อน +19

    The summarization of LLMs in 2024 was okay but it lacked one critical feature: The All Important Lobotomy Alignment Layer.

    • @computerorganizationassign419
      @computerorganizationassign419 8 หลายเดือนก่อน +1

      Lol

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh 8 หลายเดือนก่อน +4

      It has been decided that comment is not aligned to human values and will be used in our dataset as a counter example. Furthermore it has been determined that mentioning paperclips is now racist.

  • @mysticshadow4561
    @mysticshadow4561 8 หลายเดือนก่อน +5

    Hey Yannic, next request - LLM Augmenting LLMs , they proposed a method called CALM, lots of hype around it

  • @Emerson1
    @Emerson1 8 หลายเดือนก่อน +11

    The H800 is basically an H100 nvidia built to circumvent export restrictions - it’s practically the same as the H100… so USD $35k to $50k each… definitely not diy solution 😅

  • @oM477o
    @oM477o 8 หลายเดือนก่อน +2

    Feels very similar to LORA. For retaining old knowledge without the original training data, how about this idea:
    - Generate a random input embedding
    - do a foward pass
    - switch off the newly added blocks so you have the original network
    - do another foward pass
    - minamise the loss between the 2 output embeddings

  • @MinefanLP
    @MinefanLP 8 หลายเดือนก่อน +1

    Based on an article I read the H800 is just the H100 for the chinese market, currently being bought for around 70K$ which is funny considering 16 of those would be above 1.000.000$, so good luck buying that rig for your home

  • @diga4696
    @diga4696 8 หลายเดือนก่อน +1

    Thank you! Another exciting video to watch!

  • @jeremykothe2847
    @jeremykothe2847 8 หลายเดือนก่อน +12

    I really don't see how this stops "forgetting", since the new layers will change the outputs for data that isn't trained on again.

    • @keypey8256
      @keypey8256 8 หลายเดือนก่อน

      I'm 9 minutes into the video and I don't specialize in AI, but this is my guess: since we are simply copying layers and we keep the rest of the layers as they were, no information related to previous training is lost. Due to the fact that when we train on new data the knowledge that the model gained during previous training continues to be useful, creating a structure that makes the model not change its previous outputs in the cases when the previous data allows it to produce accurate answers is a simple, yet effective way of minimizing loss. That's why perhaps backpropagation goes into this direction when optimizing the parameters.

    • @jeremykothe2847
      @jeremykothe2847 8 หลายเดือนก่อน

      @@keypey8256 I just don't see how the existing information is kept, since the weights that are copied are being modified by the newly training layers. Without the old information being trained on, the backprop won't preserve the same outputs for those old inputs.

    • @keypey8256
      @keypey8256 8 หลายเดือนก่อน

      @@jeremykothe2847 when freezing layers the backprop has information about other weights, it just doesn't update them. So it can theoretically find a set of parameters that will not dramatically modify the outputs in some cases

    • @jeremykothe2847
      @jeremykothe2847 8 หลายเดือนก่อน

      @@keypey8256Sure, but isn't that the exact same situation as just training the old weights/network? The backprop will still look for minimal changes to match the new data. With this setup, if eg: the minimal change ends up multiplying by -1 eventually, then the previous layer's output there is "catastrophically" forgotten, right?

    • @keypey8256
      @keypey8256 8 หลายเดือนก่อน

      @@jeremykothe2847 I agree that it seems weird that almost no forgetting happens. I guess it's related to some mathematical phenomenon that might he investigated in the future. The way I see it is that since in this data old knowledge is still useful backprop has an incentive to create a structure that doesn't impact the information of other layers in some cases. While this is not a good explanation, atleast it makes the observations a bit more understandable. I think we need another paper with someonr investigating it. The results might help us understand transformers.

  • @vaioslaschos
    @vaioslaschos 8 หลายเดือนก่อน +1

    I dont think that the residual connections are only there as a nice technical add on. I believe this is a misunderstanding that is propagated in the community. The residual connection actually carries all the information from the past, and what comes from the attention block is all new information that added in the old. That is why group querry works despite the fact that you keep only a small portion of the "value" going in the attention mechanism. Personally I played (unsuccessfully) with many different architectures, doing crazy things like putting all the nonlinear parts MLP in the end or removing normalization layers, etc etc. For most things the performance didnt change much compared to the default architecture. The only really catastrophic thing was removing the residual connections.

  • @IsaiahGossner
    @IsaiahGossner 7 หลายเดือนก่อน

    I'm fairly confident that this technique doesn't do quite what the team describes it as doing, but it's probably really useful anyway.
    My suspicion is that this is a technique that can very quickly bring new parameters into an existing model, while at least keeping performance analogous to the original model. I think an optimized or future form of this could use this technique in association with an additional stage of pre-training + fine tuning, possibly some sort of DPO self-learning system to quickly fit a small model, and scale up while using data more efficiently than just starting with, say, a 70B model, for instance.

  • @evennot
    @evennot 8 หลายเดือนก่อน

    The most strange thing is that it worked at all. After all these layers are `output = f1(f2(f3(input)))`. But the new output' = f1(f2(*g2*(f3(input)))) should have serious instability when g(x) aren't identities. The fact that the learning process accommodated for this is more fascinating than that the output scored higher in some tasks.
    If the article is true (and I suppose it's true), then it has potential for something more interesting.
    Let's have an example: Layer Ln has half of it's neurons activating for dogs', and half for cats' parameters coming from previous layers. We insert Ln+1 to teach it about birds. Obviously it should pass through cats and dogs activations, but Ln shouldn't know that some of it's new inputs (from Ln+1) are related to birds. It's frozen. And yet it works after Ln+1 is trained.
    Basically, learning process is gradually transforming structure/entropy of a training set into a corresponding structure of the NN. So that the dataset "parameters" would be translated into a mathematical abstraction (defined and limited by the NN architecture). But this article suggests that the teaching of updated architecture with frozen trained layers doesn't break the old output, which implies a lot for future investigation

  • @quickcinemarecap
    @quickcinemarecap 8 หลายเดือนก่อน

    00:02 Llama Pro expands Llama 7B with layers for continual learning
    02:47 LLaMA Pro introduces block expansion for improved model capabilities.
    08:07 Residual signal adds to new layer's output
    10:48 Progressive LLaMA with Block Expansion allows for adapting to new data without forgetting old parameters.
    16:14 Using residual connections and linear operations to adjust weights for optimization.
    18:33 Exploring the contribution of parameters and the potential for instability with zero initialization.
    23:05 The depth operator adds an identity layer after each layer in the original model.
    25:18 Identity copies of top P blocks stacked on each group
    30:18 Progressive LLaMA with Block Expansion aims to retain old knowledge while learning new tasks.

    • @bediosoro7786
      @bediosoro7786 2 หลายเดือนก่อน

      the paper was well written, with marginal contribution. not good enough, reject.

  • @ControllerQuickSwaps
    @ControllerQuickSwaps 8 หลายเดือนก่อน +1

    I see your point about the new weighted over-riding the old ones, though I imagine you could add to the loss a penalty that encourages orthogonality of the dominant eigenvectors between the 2 matrices.

  • @Laszer271
    @Laszer271 8 หลายเดือนก่อน +1

    Remember that this is all pre-finetuning. You have both LLaMA2 and LLaMA-Pro fine-tuned to Instruct models later. That means, that what their method does is just add some more smartly initialized layers to an already smartly initialized backbone. Then the fine-tuning is on the same data afaik so even if there wasn't much overlap in the pre-training, there will be overlap of 100% in the fine-tuning step.
    Also I don't think the comparison with LLaMA2 or CodeLLaMA is that fair. It's not that they trained on the same tasks as LLaMA-Pro but then forgot about some of them. They were trained on different tasks (or a subset of tasks that LLaMA-Pro was trained on).
    I could be wrong though, I only watched the video and skimmed the paper so feel free to correct me :P

  • @mkamp
    @mkamp 8 หลายเดือนก่อน

    17:28 when reading the paper I found the diagram confusing. For attention what is the linear module before and after? Now watching Yannic explain it, I got the impression the linear module after the Scaled Dot Product is W^O and zeroed. The linear module before the SDP is the W^QKV. Right?
    The SwiGLU illustration confuses me too 😢what is the 2nd linear (left or right)?

  • @ianvaldez3315
    @ianvaldez3315 7 หลายเดือนก่อน

    You had me at LLaMA! The rest was Greek but I appreciate the knowledge sharing.

  • @stan-kk2sf
    @stan-kk2sf 8 หลายเดือนก่อน +1

    It seems that they didn't compare with the most advanced open source models nowadays, such as Zephyr , mistral and recently SOLAR-10.7B in leaderboard top1.
    Sort of a new training method, but the limitations are still significant. After all, we can't expect to increase the running cost of parametric quantities of 1b for every piece of knowledge we learn.

  • @nielsnielsen5905
    @nielsnielsen5905 8 หลายเดือนก่อน

    With H800, they might refer to something like the ND H100 v5-series machines with 8xH100 GPUs. These are available on Azure.

  • @AM-yk5yd
    @AM-yk5yd 8 หลายเดือนก่อน +3

    I didn't like this paper. We already have a term for block expansion where old layers are frozen and new injected. It's called adapters, finetuning technique so old it predates LoRA finetuning paper. See K-Adapters for comparison to other adapter-based technique where non-linearity uses whole layer instead of ReLU. See adapter fusion that discusses catastrophic forgetting and what to do if we want to have to adapt to several new tasks (it uses attention on adapters).
    I will not be surprised if there is a paper which uses similar technique for lora, I haven't looked one.
    Their paper called their adapter based approach "novel".
    The paper has zero instances of word "adapter" according to my ctrl-fu. (I have strong suspicion they didn't exactly went hard on prior art).
    Their ablation study is a joke. For example they state MoE is as good as them adding 4 blocks. Excuse me, what kind of MoE? Do they have 8 experts? 4? 256?
    What is the number of parms? More? Less? We don't know. Is this moe usual moe (paper cites switching transformers) or did they somehow actually tried training "experts" by training each expert on different domain (as reddit believe what moe is)
    Why their ablation study shows ARC/MMLU/etc. if paper itself is aimed at code tasks? And detailed tables they shown raise more questions then answers.
    What happened during the training? Look at tables in the end, round 5. FOMC got obliterated and went to zero. Why? Did ScienceQA cause it? If yes, then their own data shows the model still catastrophically forgets stuff.
    7:50 They didn't copy "copy" layers. As you said later they inserted near-zero initialized (linear) layers to not change anything.
    But. There is a work focusing on duplicating layers ("BERT’s output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT"). It is not mentioned in their paper of their "novel" approach. (Shen's mentiond as you've shown).
    It's not plagiarism just like you said. But it's also not novel as they said.

  • @kiunthmo
    @kiunthmo 8 หลายเดือนก่อน

    Can you cover Tero Karras' latest diffusion paper? There's some really interesting stuff on balancing magnitudes of weights and activations during training. This is generally something the commmunity has lost since we've chased bigger and bigger models just by scaling up.

  • @darshank8748
    @darshank8748 8 หลายเดือนก่อน +4

    Noooooo LLM augments LLM deserved it. Great video though :)

    • @kerverse
      @kerverse 8 หลายเดือนก่อน +1

      What?

    • @mysticshadow4561
      @mysticshadow4561 8 หลายเดือนก่อน

      @@kerverse it is a trending paper from DeepMind in LLM weights merging kinda stuff

  • @draken5379
    @draken5379 8 หลายเดือนก่อน

    You cant really say the network has no way to do x or y. As with most machine learning stuff, we mostly have no idea what a network is capable of. Less than 5 years ago, a neural network could never do creative work, never forget that.
    It makes sense to me this paper. Because we inserted the new layers between old layers that are frozen, and have those layers make sure when they take in something from the last block, they output 0 so it doesn't mess up the frozen knowledge. We in theory, have made a sort of invisible knowledge 'bridge' between those layers, via the new layer. This new layer in theory, could 'learn' how to ingest new information, while trying to maintain that 'no effect'(0 output) it started out with.
    A super crude example, would be, lets say we have a neural network that was trained on just information about cats. Now we attempted to add 'dogs' with this papers concepts. I feel like, the new layer, will 'learn' how to keep the 'clean' link between the layer behind it and in front it of it.
    So if lets say the input prompt had nothing to do with dogs, then most of that new layer would simply not get used in that case, as the original pathways of the dog data not being there at all, are maintained through the new layer.
    In some strange way, its almost like a built in lora, that the network itself can choice to use or not, and to what degree. And when there is nothing triggering new pathways (like dog in the prompt etc), it will simply not get used. ( this is assuming the training is clean and what not, best case)

  • @robstokes857
    @robstokes857 8 หลายเดือนก่อน

    Based on my testing. It is crazy fast! It can produce some okish JavaScript with sub second response time.
    It won't return anything toxic or racist. If it detects any it won't output anything.
    It gives really short answers.
    It feels like an optimized slerp. On par with llama 2 7B but faster and better at coding. It's responses are really short though

  • @GilesBathgate
    @GilesBathgate 8 หลายเดือนก่อน

    Maybe I am misunderstanding residual connections, but if the weights relating to the residual signal are frozen, will the network not fit to some sum of both the old model, and the new model? Perhaps your point is that from that point forward the networks signal is altered from the original.

  • @samson_77
    @samson_77 8 หลายเดือนก่อน

    I think, this might work, assuming in deep layers the networks work with abstract concepts, that might be used / triggered by any kind of training data, regardless if the new training data seems to have a significant overlap with the old training data or not. IMHO, the deeper the layers that are used for the copies, the better it probably works, because of deeper abstract concepts and therefore a higher probability of (stronger) re-using these concepts for new training data. Old knowledge, stored across concepts, will be retained with this theory, as existing concepts are re-used and the new layers are adopted accordingly. This results in no or just a little distortion of the signal for old training data knowledge and an improved signal for new training data knowledge.

  • @quickdudley
    @quickdudley 8 หลายเดือนก่อน

    Regarding your comment about how humans do forget unpractised skills but pick them up again more quickly later: see the paper Using Fast Weights to Deblur Old Memories (1987) by Geoffrey E. Hinton and David C. Plaut.

  • @gileneusz
    @gileneusz 8 หลายเดือนก่อน

    5:52 if Yannic is correcting the papers, you know that he's real badass expert in AI 🤣

  • @MayankGupta-tl1sm
    @MayankGupta-tl1sm 8 หลายเดือนก่อน

    If the initialized weights of FFN are 0, there should be no gradient flowing to that layer during backprop.

  • @bizmorphic
    @bizmorphic 8 หลายเดือนก่อน

    @yannic kilcher would love your comment on the discussion forum

  • @QuadraticPerplexity
    @QuadraticPerplexity 8 หลายเดือนก่อน

    I wonder why they add more layers rather than add more dimensions - initially weighted with zero - to the existing layers. I.e., scale the other way. Unless they rely on a special loss function

  • @powerpower4680
    @powerpower4680 8 หลายเดือนก่อน +1

    Dumb question:
    Why is the gradient of the W0 matrix not zero, when it is initialized to zero?

    • @YannicKilcher
      @YannicKilcher  8 หลายเดือนก่อน +3

      because the forward signal is non-zero. y = wx --> dy/dw = x

    • @alexeykrylov9995
      @alexeykrylov9995 8 หลายเดือนก่อน +1

      For a linear layer, weight gradient is essentially input activations times output gradients. So as long as the layer outputs affect loss (i.e. they're not ignored by the following layers) and its inputs are non-zero, it will be trained (i.e. its weights will be updated). That's why in their additional block they zeroed only the output linear layer and had non-zero wieght in their input linear layer - this way the output linear layer affects loss and has non-zero inputs.

  • @zt8044
    @zt8044 8 หลายเดือนก่อน

    the model were further pre-trained on code and math only, which is far from the original llama training data for sure

  • @adamott6076
    @adamott6076 8 หลายเดือนก่อน +2

    This data looks sus as hell. So obviously they normalized the data but not really. If it was I would expect the llama pro scores to be at exactly the same point on the circle but they aren't. They also aren't not normalized because 6, 70, 10, and 44 are not close to one another. This means they picked an arbitrary scaling value for each test. Maybe they were lazy, maybe they had some other reason to scale the data like this but that deserves an explanation. They cant just act like their graph makes sense and means anything. That is a cherry picked data with misleading scales if I have ever seen it. Let's just color me incredibly skeptical.

  • @nonetrix3066
    @nonetrix3066 8 หลายเดือนก่อน +1

    Isn't this just MoE pretty much but doing it at training? Sorry if I don't understand

    • @oncedidactic
      @oncedidactic 8 หลายเดือนก่อน

      You can see it that way- you have to ask if doing things this way is more efficient or more performant or both (?) than just training it all together.
      Then again, as a practical matter, sometimes you can’t train all at once.
      So yes it’s a “add experts over time” MoE.

    • @mkamp
      @mkamp 8 หลายเดือนก่อน

      MoE selects the experts at runtime on a per token basis.

    • @nonetrix3066
      @nonetrix3066 8 หลายเดือนก่อน

      @@mkamp I think I meant sparse mixture of experts like Mixtral AI

    • @mkamp
      @mkamp 8 หลายเดือนก่อน

      Yeah, I got you. But sparse in MoE means that only 2 of 8 experts are chosen. These 2 are selected for each token. And the model (the router) learns to choose these. Hence there is a learned conditional logic which feed forward layer (experts) to chose based on the input.
      Here, we don’t have the conditional logic. The model learns based on the sample data and always with full network in the forward pass, updating only the expanded modules in the backward path. Hope that makes sense?

    • @nonetrix3066
      @nonetrix3066 8 หลายเดือนก่อน +1

      @@mkamp Maybe I misunderstood sorry lol

  • @BlissfulBasilisk
    @BlissfulBasilisk 8 หลายเดือนก่อน

    Great video!

  • @albertlis1698
    @albertlis1698 8 หลายเดือนก่อน

    BTW it's super similar to ControlNet from Stable Diffusion

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh 8 หลายเดือนก่อน +5

    3:58 Smarter every day forgot how to ride a bike, by learning how to ride a backwards steering bike.
    th-cam.com/video/MFzDaBzBlL0/w-d-xo.html

  • @thatstupiddoll
    @thatstupiddoll 7 หลายเดือนก่อน

    who would have thought adding more parameters makes the model better huh

  • @gileneusz
    @gileneusz 8 หลายเดือนก่อน

    2:21 where can I find good description of these tests?

    • @AM-yk5yd
      @AM-yk5yd 8 หลายเดือนก่อน

      paperswithcode and then by links to arxiv. paperswithcode is better starting point as it points to sota across the time

  • @chadwick3593
    @chadwick3593 8 หลายเดือนก่อน

    LoRAMoE looks promising. Page 10 (section 3.2.3) gives the ELI5 on how they retain old knowledge while training in new knowledge. No code though...

  • @Aldraz
    @Aldraz 8 หลายเดือนก่อน

    Wait, this could actually solve so many problems we have today with LLMs limitations. Unless it causes some side effects that are not easily detectable.

  • @Metalhead121396
    @Metalhead121396 7 หลายเดือนก่อน

    H800 is the weaker H100 that NVIDIA offers in China due to US export controls on advanced chips

  • @gileneusz
    @gileneusz 8 หลายเดือนก่อน

    14:29 people in 2040: ?? I can do it on my iPhone 24 while calling my grandpa

  • @bediosoro7786
    @bediosoro7786 2 หลายเดือนก่อน

    so they literally insert some skip blocks and that becomes a great paper.

  • @hanyanglee9018
    @hanyanglee9018 8 หลายเดือนก่อน

    You surprised me. It's tencent. You should not trust them. They published .

    • @mkamp
      @mkamp 8 หลายเดือนก่อน

      And it’s pretty much LoRA, no? Except that it is not matrix factorization in one layer, but every few layer a full linear module that is adapted.

  • @hikaroto2791
    @hikaroto2791 8 หลายเดือนก่อน

    Is it possible, few chunks of this were assisted by AI such that the logic is no as human level sofisticated and creative, as any paper is? You are know feeling what the users of that social media felt with the posts of your own AI pretending to be human

  • @Name-ot3xw
    @Name-ot3xw 8 หลายเดือนก่อน

    The title reads like crypto babble, turns out that it's AI babble instead.

  • @GSXNetwork
    @GSXNetwork 8 หลายเดือนก่อน

    Hello

  • @jacobmunson3299
    @jacobmunson3299 8 หลายเดือนก่อน

    ~First

  • @gileneusz
    @gileneusz 8 หลายเดือนก่อน

    still incapable of playing minecraft