Stanford CS224N | 2023 | Lecture 10 - Prompting, Reinforcement Learning from Human Feedback

แชร์
ฝัง
  • เผยแพร่เมื่อ 11 พ.ย. 2024

ความคิดเห็น • 34

  • @futuremojo
    @futuremojo หลายเดือนก่อน +3

    I love this style of lecture where it starts from a basic solution, then runs into a problem which motivates the next level of refinement. Great teacher.

  • @ЕгорБедринский-щ2у
    @ЕгорБедринский-щ2у ปีที่แล้ว +14

    It's one of the most awesome lecture that I have ever watched! The lecturer is wonderful!

  • @gemini_537
    @gemini_537 6 หลายเดือนก่อน +8

    Gemini: This lecture is about prompting instruction fine-tuning and RLHF, which are all techniques used to train large language models (LLMs). LLMs are trained on a massive amount of text data and are able to communicate and generate human-like text in response to a wide range of prompts and questions.
    The lecture starts with going over zero-shot and few-shot learning, which are techniques for getting LLMs to perform tasks they weren't explicitly trained for. In zero-shot learning, the LLM is given a natural language description of the task and asked to complete it. In few-shot learning, the LLM is given a few examples of the task before being asked to complete a new one.
    Then the lecture dives into instruction fine-tuning, which is a technique for improving the performance of LLMs on a specific task by fine-tuning them on a dataset of human-written instructions and corresponding outputs. For example, you could fine-tune an LLM on a dataset of movie summaries and their corresponding reviews to improve its ability to summarize movies.
    Finally, the lecture discusses reinforcement learning from human feedback (RLHF), which is a technique for training LLMs using human feedback. In RLHF, the LLM is given a task and then asked to complete it. A human expert then evaluates the LLM's output and provides feedback. This feedback is then used to improve the LLM's performance on the task.
    The lecture concludes by discussing some of the challenges and limitations of RLHF, as well as the potential future directions for this field. One challenge is that it can be difficult to get humans to provide high-quality feedback, especially for complex tasks. Another challenge is that RLHF can be computationally expensive. However, RLHF is a promising technique for training LLMs to perform a wide range of tasks, and it is an area of active research.

  • @khalilbrahemkbr3584
    @khalilbrahemkbr3584 5 หลายเดือนก่อน +1

    Great lecture! Thank you Stanford and the lecturer for making this public

  • @ericchang9568
    @ericchang9568 2 หลายเดือนก่อน +1

    57:16 they did RLFH on IFT instead of PT model because the Instructions are necessary for multi-tasks, e.g. summation, Q&A, etc, those capabilities are all 'infused' during the IST phase.

  • @akhileshgotmare9812
    @akhileshgotmare9812 6 หลายเดือนก่อน +1

    The question at 50:10 is interesting! To combat this to a certain extent, what Llama2 authors did was to collect annotator preference responses on a scale of 4 points, and use that to include a margin component in the RM training loss. See Section 3.2.1 and 3.2.2 in the Llama2 paper. They report that the margin component can improve reward model accuracy.

  • @susdoge3767
    @susdoge3767 5 หลายเดือนก่อน

    by far the best lecture on modern llms, great to witness this

  • @philippvetter2856
    @philippvetter2856 10 หลายเดือนก่อน +2

    Amazing lecture, really well presented.

  • @ReflectionOcean
    @ReflectionOcean 10 หลายเดือนก่อน +2

    - Utilize prompting and instruction fine-tuning to align language models with user intent (start: 25:29).
    - Implement penalty terms in RLHF to prevent models from deviating too far from pre-trained baselines (start: 52:14).
    - Train reward models on human comparisons instead of direct human responses for more reliable reinforcement learning (start: 47:09).
    - Normalize reward model scores post-training for better reinforcement learning outcomes (start: 49:27).
    - Explore reinforcement learning from AI feedback to reduce human data requirements (start: 1:11:12).

  • @mavichovizana5460
    @mavichovizana5460 7 หลายเดือนก่อน

    great lecture! very helpful!

  • @willlannin2381
    @willlannin2381 8 หลายเดือนก่อน

    Fantastic lecture, thank you

  • @Pingu_astrocat21
    @Pingu_astrocat21 6 หลายเดือนก่อน +1

    Thank you for uploading this lecture :)

  • @uraskarg710
    @uraskarg710 ปีที่แล้ว +2

    Great Lecture! Thanks!

  • @ningzeng2239
    @ningzeng2239 11 หลายเดือนก่อน +1

    great lecture, tks!

  • @HoriaCristescu
    @HoriaCristescu 10 หลายเดือนก่อน

    1. RLHF is updating the model for whole sentences. Does that carry special meaning? Because next token prediction is focusing too much on short term, this changes the focus to whole answer.
    2. RLHF uses model generated outputs for training, so it is on-policy data. Does that make it more effective than training on random internet texts?

    • @susdoge3767
      @susdoge3767 5 หลายเดือนก่อน

      does backprop and updating PPO here means updating the decoder only model or just updating the policy network? what my understanding says so far is that; we have a pretty good decoder only model that can summarise well enough(fine tuned on tons of data) , but the reward function and PPO network is there to align it more to human preferences. Please correct me if i am wrong anywhere, would like to know your insights!

  • @dontwannabefound
    @dontwannabefound 5 หลายเดือนก่อน

    38:20 for RLHF

  • @ThamBui-ll7qc
    @ThamBui-ll7qc 6 หลายเดือนก่อน

    If a model is just instruction-finetuned without any RLHF, does hallucination occur?

  • @asdf_12345
    @asdf_12345 16 วันที่ผ่านมา

    22:40

  • @munzutai
    @munzutai 11 หลายเดือนก่อน +1

    Are there any promising strategies to reduce the amount of data that's necessary to do RLHF?

  • @Andrewlim90
    @Andrewlim90 11 หลายเดือนก่อน +1

    I really like the question at 00:34:20! Anyone know if this is being explored? People who can produce questions like this seem like they'd make excellent researchers. I'm jealous.

    • @DrumsBah
      @DrumsBah 11 หลายเดือนก่อน

      The embedding distance is commonly utilised in contrastive loss based optimisation. It's actuslly been shown to be useful in training sentence embeddings, see: SimCSE. Of course in this setting, the embeddings are directly of interest.
      I could imagine a possibility of it being used for alignment. However, it has significant disadvantages compared to reward model approaches in terms of the salience of embedding distance actually relating to human preference.

  • @buoyrina9669
    @buoyrina9669 ปีที่แล้ว +2

    At 44:35, does theta_t refer to the LM's entire parameters ?

    • @DrumsBah
      @DrumsBah 11 หลายเดือนก่อน +1

      In the case of Instruct GPT, theta was the full parameter set of the foundation model. However, there's no reason RLHF couldn't be performed on the head or adapter parameters (e.g. lora) instead.

  • @theneumann7
    @theneumann7 11 หลายเดือนก่อน

    👌

  • @marshallmcluhan33
    @marshallmcluhan33 ปีที่แล้ว

    Cool 😎

  • @isalutfi
    @isalutfi ปีที่แล้ว

    💙💙💙

  • @MacDonaldEdmund-t6o
    @MacDonaldEdmund-t6o หลายเดือนก่อน

    Hernandez Angela Robinson Shirley Johnson David

  • @TheresaLopez-r7t
    @TheresaLopez-r7t 2 หลายเดือนก่อน

    Hernandez Thomas Martin Matthew Robinson Michael

  • @SysknShall
    @SysknShall 2 หลายเดือนก่อน

    Hernandez Jessica Harris Gary Thompson Carol

  • @merlinmilton7596
    @merlinmilton7596 2 หลายเดือนก่อน

    Brown Sandra Davis Margaret Wilson Christopher

  • @JohnsonGwendolyn-v2z
    @JohnsonGwendolyn-v2z 2 หลายเดือนก่อน

    Miller Amy Lopez Jennifer Clark Jennifer

  • @JeffreyMiller-h7k
    @JeffreyMiller-h7k 2 หลายเดือนก่อน

    Johnson Kevin Wilson Sandra Thompson Kevin

  • @梁某某-t3l
    @梁某某-t3l 7 หลายเดือนก่อน +1

    Great lecture! very helpful!