🐐Llama 3 Fine-Tune with RLHF [Free Colab 👇🏽]

แชร์
ฝัง
  • เผยแพร่เมื่อ 6 ก.ย. 2024

ความคิดเห็น • 61

  • @_SHRUTIDAYAMA
    @_SHRUTIDAYAMA 8 หลายเดือนก่อน +2

    Hey,
    this video is really helpful.
    Can you please tell how to give input and generate output after step3?
    Also when we create UI , then how feedback from UI will be given to policy model ?
    Can you please make a video on it, it will be really helpful !!!!
    Thanks :)

    • @WhisperingAI
      @WhisperingAI  8 หลายเดือนก่อน +1

      Sure i will try creating a short video for it with in couple of days.

    • @_SHRUTIDAYAMA
      @_SHRUTIDAYAMA 8 หลายเดือนก่อน +1

      That will really be helpful!!!!!
      Thanks:)@@WhisperingAI

    • @shrutidayama8193
      @shrutidayama8193 8 หลายเดือนก่อน

      Hey ,
      It will really be helpful if you make it ...please help me

    • @WhisperingAI
      @WhisperingAI  8 หลายเดือนก่อน

      @@shrutidayama8193 It will be uploaded tomorrow. Thanks

  • @brainybotnlp
    @brainybotnlp ปีที่แล้ว +3

    Great content.

  • @user-rl8di9dc9l
    @user-rl8di9dc9l 11 หลายเดือนก่อน +2

    this video is really helpful.Thanks for sharing 🙌 but could you please share with us which version of libraries you are
    using?

    • @WhisperingAI
      @WhisperingAI  11 หลายเดือนก่อน

      All the libraries are upto dated one, but i guess transformer was 4.29

  • @rabin1620
    @rabin1620 ปีที่แล้ว +1

    Excellent topic

    • @WhisperingAI
      @WhisperingAI  ปีที่แล้ว

      Thank you. Glad you liked it

  • @dibyanshuchatterjee4126
    @dibyanshuchatterjee4126 5 หลายเดือนก่อน +1

    Great video. Just a quick question, is it possible to intercept just the reward model's output for an LLM response, before the reward produced for each response goes into the LLM from reward model. Meaning, is there anyway can just use the reward model to see what response of LLM was good Vs bad and store those results?

    • @WhisperingAI
      @WhisperingAI  5 หลายเดือนก่อน

      Yes you can. In step 3 there is a line which takes the result from policy model and pass it to reward model for score.
      You can print the output.

  • @user-cr1sk9fq6o
    @user-cr1sk9fq6o ปีที่แล้ว +1

    Thanks for your insightful sharing. The fine-tune Llama 2 model return an incomplete sentence for the last sentence. Do you have ways to solve this?

    • @WhisperingAI
      @WhisperingAI  ปีที่แล้ว +1

      Try increasing the max length while inference.

  • @HaroldKouadio-gj7uw
    @HaroldKouadio-gj7uw 3 หลายเดือนก่อน

    what of doing a translation task with the LLMs and reinforce it with RLHF?

    • @WhisperingAI
      @WhisperingAI  3 หลายเดือนก่อน

      We can do that

  • @ivanleung6034
    @ivanleung6034 7 หลายเดือนก่อน +1

    I notice the reward model structure is the same as the fine-tuned model. As someone said, we can use a small model with much fewer parameters and layer to do the reward model, that's work too right?

    • @WhisperingAI
      @WhisperingAI  7 หลายเดือนก่อน

      That works, but in case of reward model its basically the sequence classification model with one head, so output produced is only one logits, but i guess it is handled internally by the trl library.

  • @mahmoudmohamed-lr9ql
    @mahmoudmohamed-lr9ql 4 หลายเดือนก่อน +1

    does this use reference model and kl divergence?

    • @WhisperingAI
      @WhisperingAI  4 หลายเดือนก่อน

      Yes it use both

  • @talhaanwar2911
    @talhaanwar2911 ปีที่แล้ว +1

    thanks

  • @cookiearmy2960
    @cookiearmy2960 6 หลายเดือนก่อน

    how is the reward model trained can anyone explain in detail? i know that we used the starcoder model with chosen and rejected input ids, but how are these mapped to a particular score, since the output of the reward model is not always binary , it returns logits as it's output, how it is done here ?

  • @mohamedsatti3038
    @mohamedsatti3038 ปีที่แล้ว +1

    thank you

  • @talhaanwar2911
    @talhaanwar2911 ปีที่แล้ว +1

    can you create a tutorial on inference of this

  • @user-gv1bc2kk4b
    @user-gv1bc2kk4b 10 หลายเดือนก่อน +1

    can you share the jupyter folder. i dont have any idea about paths.

    • @WhisperingAI
      @WhisperingAI  10 หลายเดือนก่อน

      I haved updated the code base path, let me know if you don''t understand again.

    • @user-gv1bc2kk4b
      @user-gv1bc2kk4b 10 หลายเดือนก่อน

      @@WhisperingAI how to train a chatbot iteratively with userfeedbacks. train chatbot over time with users interaction and feedbacks. how do i do that with pretrained models.

    • @WhisperingAI
      @WhisperingAI  9 หลายเดือนก่อน

      @@user-gv1bc2kk4b The answer is simple you must keep retraining the model, if you wish to train it with userfeedback. Offload the model from production, and train it with data, including older version of data with timestamp. Evaluate the model performance and if its good, shift the train model in production. But if you dont want to do that RAG method can be used in this case you can check my this video if its help
      th-cam.com/video/Db3VmvFnx9I/w-d-xo.html

    • @denidugamage2096
      @denidugamage2096 9 หลายเดือนก่อน

      @@WhisperingAI. My idea is creating a low providing chatbot. It’s a group project. And other’s doing the chatbot part and NLP part. My party is RLHF. Chatbot dataset is constitute. How do i train my chatbot with user feedbacks. Im asking you because i don’t have idea about it🥲

    • @WhisperingAI
      @WhisperingAI  9 หลายเดือนก่อน

      Please watch my earlier video th-cam.com/video/B5dhaZPJQx0/w-d-xo.html
      if you want to do it for feedback. I have used the amazon review in there.

  • @rajeepthapa5426
    @rajeepthapa5426 3 หลายเดือนก่อน +1

    Are you nepali ?

  • @sanduntharaka4256
    @sanduntharaka4256 7 หลายเดือนก่อน +1

    Can we use the same code for llama2??

    • @WhisperingAI
      @WhisperingAI  7 หลายเดือนก่อน

      Yes you can but i guess you cannt run it on google colab unless you use lora or 4bit

    • @sanduntharaka4256
      @sanduntharaka4256 7 หลายเดือนก่อน

      @@WhisperingAI Im using Kagle notebooks. I have created policy model. But in reward training it gives IndexError: index out of range in self Why?

    • @sanduntharaka4256
      @sanduntharaka4256 7 หลายเดือนก่อน +1

      @@WhisperingAI And have executed your same code in high ram environment. But it gives same error: IndexError: index out of range in self, I want to apply RLHF in llama2. Your video is the only one i found that relates with RLHF.

    • @WhisperingAI
      @WhisperingAI  7 หลายเดือนก่อน +1

      There might be some issue while loading the dataset, or tokenizing.
      Can you share on which step you are facing this issue?

    • @WhisperingAI
      @WhisperingAI  7 หลายเดือนก่อน +1

      Please check your dataloader and try running each step individually

  • @user-kl8ov7dg7d
    @user-kl8ov7dg7d ปีที่แล้ว +1

    First, thank you for the good video, and I would like to ask you two questions.
    1) In the third part of the colab code, th-cam.com/video/R2paulc3P2M/w-d-xo.html , I am confused about which model goes into the "model_path" of "starcoder_model = AutoModelForCausalLMWithValueHead.from_pretrained(model_path)".
    "model_path" is "bigcode/tiny_starcoder_py"? or "summarization_policy_new/"? which one is correct?
    2) Can I think of the first part ("Creating the policy model for human Evaluation") of the colab code on the previous TH-cam as SFT Training? and Can I think that the resulting policy model is the SFT model?

    • @WhisperingAI
      @WhisperingAI  ปีที่แล้ว

      Thats actually the policy model that we have trained on first step, that is summarization_policy_new/. As we are refining the model from step 1 with reward in step 3. Hope that clarify. If you have any question fell free to ask. I would love to help

    • @user-kl8ov7dg7d
      @user-kl8ov7dg7d ปีที่แล้ว

      @@WhisperingAI Thank you. It's clear. Could you please answer the second question regarding the SFT?

    • @WhisperingAI
      @WhisperingAI  ปีที่แล้ว

      For the second question.
      Dont think it as that way. In the first step we are simply finetuning the model ( model can be anything like llama , gpt, starcoder) SFT is the library we are just using it to get rid of writing pytorch code for creating dataloader and training loop. So after first step resulting policy model is finetuned model not sft model.

    • @sayansamanta3775
      @sayansamanta3775 10 หลายเดือนก่อน

      @@WhisperingAIhey can you please tell which model are we using in the second step where we have MODEL_PATH = "model/"? Is it big_code/tiny_starcoder_py or the policy model trained in the first step?

  • @developer_deepak_bhattarai
    @developer_deepak_bhattarai ปีที่แล้ว

    Are you nepali?