Timestamps (on my phone so lazy descriptions) Sara Guo 20:52 Best of Vision 2024 (Roboflows x Moondrram) 1:18:44 Loubna (HF) synthetic data and smol models 5:22:59
Dang, missed this at NeurIPS. Also, Jonathan's logic is quite flawed. He's saying scaling laws are all about a log-linear relationship between cost and log-loss. Two issues with this assertion: (a) Who said that log-loss represents a linear relationship with the model quality. What if the pretraining loss is represented as a log-log-cross entropy loss. Now, the loss linearly grows with the compute. (b) GPT-4/Llama pretrained vs post-trained had a less than 2x compute difference, assuming that pretraining compute > post-training compute, and the quality comparison is not even close.
Timestamps (on my phone so lazy descriptions)
Sara Guo 20:52
Best of Vision 2024 (Roboflows x Moondrram) 1:18:44
Loubna (HF) synthetic data and smol models 5:22:59
Added! Thank you
7:37:40 - The scaling/wall debate w/ Dylan Patel
7:45:26 - opening statements here
thanks
Added! Thank you
very cool
Will slides made available? Thanks
Yes will likely go out on our newsletter: latent.space
anyone got some timestamps
we will recut the recordings dw
Dang, missed this at NeurIPS. Also, Jonathan's logic is quite flawed. He's saying scaling laws are all about a log-linear relationship between cost and log-loss. Two issues with this assertion:
(a) Who said that log-loss represents a linear relationship with the model quality. What if the pretraining loss is represented as a log-log-cross entropy loss. Now, the loss linearly grows with the compute.
(b) GPT-4/Llama pretrained vs post-trained had a less than 2x compute difference, assuming that pretraining compute > post-training compute, and the quality comparison is not even close.