I was not expecting this level of excruciating pain and suffering. What a nice window into the reality of engineering solutions when we don't understand what we really ought to do yet
Why would transformers also have the problem of gradient explosion? I thought for a model of, say, 24 layers, the multiplicative effect is limited. So does the gradient explosion come from one particular neuron get surprisingly huge gradient?
I think it is partially to do with earlier layers receiving larger gradient updates and this ends up causing problems for training stability normformer paper goes into this
The changes in hyperparms seem random. Isn't it possible to diagnose the issue and change the architecture itself? Or tackling the issue more systematically/less randomly? This is not a criticism -- is is hard but was curious.
Thanks for sharing! This is a great exercise. Was Ray used in OPT-175B training like chatGPT did? It will be good to take advantage of the flexible scheduling, scalability and reliability provided by Ray.
This is the best video on big language model training! Covered all details, pitfalls & tweaks. Thanks for sharing!
The best practice advice for training LLMs from scratch. Thanks for sharing!
I was not expecting this level of excruciating pain and suffering. What a nice window into the reality of engineering solutions when we don't understand what we really ought to do yet
this is incredible, it's the wild west out there
makes me feel better about my home brew models :)
appreciate the openness
You did it! Amazing work guys!
Thanks for sharing. Really enjoy watching the whole video.
Why would transformers also have the problem of gradient explosion? I thought for a model of, say, 24 layers, the multiplicative effect is limited. So does the gradient explosion come from one particular neuron get surprisingly huge gradient?
I think it is partially to do with earlier layers receiving larger gradient updates and this ends up causing problems for training stability normformer paper goes into this
The changes in hyperparms seem random. Isn't it possible to diagnose the issue and change the architecture itself? Or tackling the issue more systematically/less randomly?
This is not a criticism -- is is hard but was curious.
Thanks for sharing! This is a great exercise. Was Ray used in OPT-175B training like chatGPT did? It will be good to take advantage of the flexible scheduling, scalability and reliability provided by Ray.
Could you share the slides of this talk?
can anyone shed more light on activation norm? Susan said it is last layer activation value for sofmax
from metareq repo, it is last decoder layer output. between last decoder layer and last softmax, there is a linear h-->V projection.
质量很高
Is it true opt 175b doesn't display emergence? Only closed models do?
the same team just released llama, which certainly does
The stack is a cluster-fuck, pun intended.
That is open source
I dont know why people want to work on this stuff. Very alchemic
what is ppl?
perplexity
More data bloat 16 secrets. Why the latter?