Open Pretrained Transformers - Susan Zhang | Stanford MLSys #77

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 พ.ย. 2024

ความคิดเห็น • 23

  • @aixueer4ever
    @aixueer4ever ปีที่แล้ว +15

    This is the best video on big language model training! Covered all details, pitfalls & tweaks. Thanks for sharing!

  • @grantyu4693
    @grantyu4693 ปีที่แล้ว +3

    The best practice advice for training LLMs from scratch. Thanks for sharing!

    • @beagle989
      @beagle989 ปีที่แล้ว

      I was not expecting this level of excruciating pain and suffering. What a nice window into the reality of engineering solutions when we don't understand what we really ought to do yet

  • @beagle989
    @beagle989 ปีที่แล้ว +3

    this is incredible, it's the wild west out there
    makes me feel better about my home brew models :)

  • @nowithinkyouknowyourewrong8675
    @nowithinkyouknowyourewrong8675 ปีที่แล้ว +8

    appreciate the openness

  • @stasbekman8852
    @stasbekman8852 ปีที่แล้ว +10

    You did it! Amazing work guys!

  • @u850159yeung
    @u850159yeung ปีที่แล้ว +2

    Thanks for sharing. Really enjoy watching the whole video.

  • @zeweichu550
    @zeweichu550 ปีที่แล้ว +7

    Why would transformers also have the problem of gradient explosion? I thought for a model of, say, 24 layers, the multiplicative effect is limited. So does the gradient explosion come from one particular neuron get surprisingly huge gradient?

    • @robertjflynn4206
      @robertjflynn4206 ปีที่แล้ว

      I think it is partially to do with earlier layers receiving larger gradient updates and this ends up causing problems for training stability normformer paper goes into this

  • @brandomiranda6703
    @brandomiranda6703 ปีที่แล้ว +4

    The changes in hyperparms seem random. Isn't it possible to diagnose the issue and change the architecture itself? Or tackling the issue more systematically/less randomly?
    This is not a criticism -- is is hard but was curious.

  • @carsonwang2283
    @carsonwang2283 ปีที่แล้ว

    Thanks for sharing! This is a great exercise. Was Ray used in OPT-175B training like chatGPT did? It will be good to take advantage of the flexible scheduling, scalability and reliability provided by Ray.

  • @RyanZJC
    @RyanZJC ปีที่แล้ว

    Could you share the slides of this talk?

  • @senx8758
    @senx8758 11 หลายเดือนก่อน

    can anyone shed more light on activation norm? Susan said it is last layer activation value for sofmax

    • @senx8758
      @senx8758 11 หลายเดือนก่อน

      from metareq repo, it is last decoder layer output. between last decoder layer and last softmax, there is a linear h-->V projection.

  • @扣脚晒太阳
    @扣脚晒太阳 ปีที่แล้ว

    质量很高

  • @brandomiranda6703
    @brandomiranda6703 ปีที่แล้ว

    Is it true opt 175b doesn't display emergence? Only closed models do?

  • @BoominGame
    @BoominGame 10 หลายเดือนก่อน

    The stack is a cluster-fuck, pun intended.

  • @developer-uh9dh
    @developer-uh9dh ปีที่แล้ว

    That is open source

  • @karanbirchahal3268
    @karanbirchahal3268 ปีที่แล้ว +1

    I dont know why people want to work on this stuff. Very alchemic

  • @brandomiranda6703
    @brandomiranda6703 ปีที่แล้ว

    what is ppl?

  • @brandomiranda6703
    @brandomiranda6703 ปีที่แล้ว

    More data bloat 16 secrets. Why the latter?