Understanding STaR and how it powers Claude and Gemini/Gemma 2 (and maybe OpenAI Q* or Strawberry)

Chris Hay

มุมมอง 7 884

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 25 ส.ค. 2024
Understanding STaR and how it powers Claude and Gemini/Gemma 2B (and maybe Q* or Strawberry). STaR is short for Self-Taught Reasoning and is rumored to power OpenAI's Q* (now Strawberry), but definitely powers Claude 3.5 sonnet and Gemma / Gemini models. In this video Chris breaks down how Self Taught reasoning works and how it is used in the fine tuned phases of a model to improve training. Chris also shows how you can use NVidia Nemotrons reward model to judge the outputs for STaR. If you want to understand how to use the same techniques that frontier AI models such as Anthropic Claude and Google Gemini / Gemma use to improve their fine tuning, then check out this video

ความคิดเห็น • 30

@kusanagi2501 หลายเดือนก่อน ⁺³
I really liked the video. it was a mystery for me for a while.
@leeme179 หลายเดือนก่อน ⁺⁷
I believe you are correct in that both Claude and Lllama 3 are finetuned using STaR generated dataset but this method still needs a ranker or human to mark the correct answers, whereas from what I have read online is that OpenAI's Q* is a combination of "A* search" algorithm combined with Q learning from reinforcement learning to self improve, where the model generates 100 different answers and picks the best answer to improve similar to AlphaCode2 from Deepmind.
@spoonikle หลายเดือนก่อน ⁺²
It does not. There is no human marker needed.
For example, you can use a series of prompts + the dataset to judge aspects of the answers with really well trained fine tuners. You can even train a model to predict the human evaluation, then you just need to human eval in a given domain until an evaluator model is ready.
In addition, this incentivizes further investment in synthetic datasets.
Finally - the best argument for this. Big model prunes dataset to make a small model - that prunes the dataset for the next big model repeat ad infinitum.
the smaller model is cheaper and faster which means you can prompt more data for the next big one - which will make the next improved small model.
@chrishayuk หลายเดือนก่อน ⁺¹
Some folks use human feed back with RL and some folks use synthetic at the end of the video I talk about how it could be done with a mixture of judges and I show how you could use Nemotron for your reward model. I will do a video on RL for this soon to cover the Q part
@testales หลายเดือนก่อน ⁺¹
I still don't get how a pathing algorithm like A* can be utilized to find the best answer. I mean it's not like navigating some terrain with exactly known properties. Maybe it's a thing in the latent space? So actually the explanation that this is a modified version of the STaR approach seems to be more plausible but if so then again it doesn't seem to be such a big thing.
@chrishayuk หลายเดือนก่อน
I’m only covering the star part for now. I’ll cover the RL part in a later video
@GodbornNoven หลายเดือนก่อน ⁺¹
@@testalesQ* (Q-star) is a concept from reinforcement learning, a type of machine learning. In simple terms, it's a way to measure the best possible future rewards an agent can expect if it follows the optimal strategy from any given state. Think of it like a guide that tells you the best move to make in a game to maximize your chances of winning, based on all the possible future outcomes. Kinda like in chess.
@venim1103 หลายเดือนก่อน ⁺²
You have to check about the Claude 3.5 sonnet system prompt leak and all the talk about “artifacts” and persisting data with LLMs.
@chrishayuk หลายเดือนก่อน ⁺¹
Oooh persisting with llms sounds interesting, I’ll find out about that
@venim1103 หลายเดือนก่อน ⁺¹
@@chrishayukit seemed to me they are using clever prompt engineering with their “artifact” system in a way that resembles memory management and tool usage with the help of the massive context window.
They must have also finetuned their models to support this syntax. Just crazy to think how the system message itself is able to help the AI with coherence and task management.
All this seems fascinating as I’m trying to figure out why the Claude 3.5 sonnet is so good at code related tasks especially related to re-editing and updating the code compared to most other models.
I can’t wait to see some open source models reach this level! Maybe finetuning and clever prompt engineering is all that is needed for now 👍
@chrishayuk หลายเดือนก่อน
@@venim1103 i'll check out their system prompt... but i'm convinced they're using STaR backed by a reinforcment learning policy. the new mistral nemo model has followed this approach also. not checked out how they implemented artificat yet. but i'm convinced this is all now in the fine tune phase, hence these videos
@Mercury1234 18 วันที่ผ่านมา
Someone please correct me if I'm wrong here. I think that neither of the examples you showed comes from reasoning. The order is flipped, they should first provide the reasoning and then the answer, not the other way around as in your examples. The models take all the tokens into account from the input and the output (generated up to that point). What is giving the right answer a better chance is if the previously generated tokens contain the reasoning steps. In your examples the previous tokens did not contain the reasoning steps as those were generated after the answer.
@omarei หลายเดือนก่อน
Great content 👍😁
@mrpocock หลายเดือนก่อน ⁺¹
The private scratch pad in claude 3 5 explains why it seems to behave as if it had private state in addition to the text visible in the conversation.
@chrishayuk หลายเดือนก่อน
Yeah really nice technique for giving readable answers but not losing chain of thought reasoning
@theklue หลายเดือนก่อน ⁺¹
Very good content, thanks! I was comparing models manually, and I'll integrate Nemotron into the eval. One off-topic question, the over imposed screen on top of your your video is a post prod edit or is there any software that let's you record the video like this? Thanks!
@chrishayuk หลายเดือนก่อน ⁺¹
awesome glad it was useful. the overimposed screen effect is a post prod edit that i do. the way i set the lights, screen backdrop, combined with lumetric settings and use of opacity, allows me to achieve the effect
@theklue หลายเดือนก่อน ⁺¹
@@chrishayuk Thank you! it looks very good
@chrishayuk หลายเดือนก่อน
Thank you, I like to think it’s one of the techniques that give a little uniqueness, glad you like it
@testales หลายเดือนก่อน ⁺²
I don't like that very much. Why? I absolutely hate getting walls of text and code thrown at me for simple yes/no questions all the time! Both ChatGPT and Claude have this issue. So in the end It's just that you hardcode a system prompt like "think step by step" into your model and it's very hard then make it giving quick and short answers again. A hidden scratch pad is a good compromise but still slows down responses and could by achieved with a system prompt too. The system-prompt method could also include multiple agents or personas with different strengths to provide input. The best would be to also train the model to estimate the complexity of a question and then decide whether to do additional thinking or not. Also I've seen open weight models answering harder questions with just one or very few words correctly where others generated a text wall and still came to the wrong result. So whether an explicity step-by-step thinking is really required remains debatable. Obviously the chances for a correct answer increase the more relevant (!) information is in the context and that's all what CoT etc. actually does: pulling more information into context. Another similiar thing that I see doing Claude quite often and which I like is that it does summarizations before responding. If the problem is complex and there was a lot of back and forth the perceptions of it may diverge. Summarizations greatly help to create a synchronization point between the LLM and the user and then focus on the established and relevant intermediate results.
@chrishayuk หลายเดือนก่อน
I agree, it’s a balance and a trade off, and I think this is where RL can be used to bring this down to a more succinct response.
@raymond_luxury_yacht หลายเดือนก่อน ⁺¹
That explains why Claude 200k context is more like 50k for me. So much taken up with the scratchpad
@bamh1re318 หลายเดือนก่อน ⁺²
Can you please give a tutorial on how to load private data, train/Rag/evaluate and deploy an open-source model on WatsonX or other online platform (AWS, Azure or Huggingface)? Many thanks!
BTW Nemotron4 broke down this noon (PST), maybe due to too many users. I was in line 771 with a simple question. It gave out some sort of communication problem, after two minutes od waiting
@chrishayuk หลายเดือนก่อน
Sure, will add to the backlog
@ckpioo หลายเดือนก่อน ⁺¹
so this is why gpt-4o is so much better at maths
@rodneyericjohnson หลายเดือนก่อน ⁺¹
How can a full grown adult hide behind some decorations?
@chrishayuk หลายเดือนก่อน
Merry Christmas

ต่อไป

เล่นอัตโนมัติ

Inside the LLM: Visualizing the Embeddings Layer of Mistral-7B and Gemma-2B