WARNING: Bad News for LLM Fine-Tuning

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 ก.ค. 2024
  • 🔗 Links 🔗
    Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? Gekhman et al.!
    arxiv.org/pdf/2405.05904v2
    ❤️ If you want to support the channel ❤️
    Support here:
    Patreon - / 1littlecoder
    Ko-Fi - ko-fi.com/1littlecoder
    🧭 Follow me on 🧭
    Twitter - / 1littlecoder
    Linkedin - / amrrs
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 63

  • @unclecode
    @unclecode 12 วันที่ผ่านมา +18

    Hey, as usual, such a good paper you brought up. I read this paper a few days ago, and tbh I am a bit skeptical, let me share my points, like to know yours. First, I totally agree that what fine-tuning is really about. It's not about teaching the model new facts, but helping it access its existing knowledge more efficiently or in specific ways. Like if we want the model to always respond in JSON format, we'd fine-tune for that, we're not teaching it new info, just tweaking how it presents what it knows.
    Now, I've got three main concerns with this study:
    1/ They didn't mention how much of the model they actually fine-tuned. If they used something like LoRA, which is common, they're only training a tiny fraction of a massive model. That's a major flaw in their methodology because fine-tuning a small portion of model with unknown knowledge could just be adding noise to the model's activations, leading to hallucinations. This could invalidate their whole claim.
    2/ They only tested on a huge model like PaLM 2-M, which probably has over 340 billion parameters (If I am not wrong). What if the results are totally different for smaller models, like a 7B one? We really need to see this tested on a range of model sizes to draw any meaningful conclusions.
    3/ What if they fine-tuned most of the model, like 80%? That'd be more like pre-training, and the results could be way different. They didn't explore this scenario at all.
    These gaps make me skeptical about how useful or generalizable their findings really are. It feels like they missed some crucial aspects of how fine-tuning actually works in practice. I couldn't find these details in their paper. To be honest, I didn't go through in details, perhaps have to check it again.
    Kudos to your taste in selecting papers for your channel.

    • @delojikuhlii1
      @delojikuhlii1 11 วันที่ผ่านมา +1

      What do you think about prompt tuning/learning as a solution? It showed good results without finetuning the whole model. Is there some similar study for this approach?

    • @davidlepold
      @davidlepold 11 วันที่ผ่านมา

      Promo tuning? U mean "simple" prompt optimization?​@@delojikuhlii1

    • @mirek190
      @mirek190 11 วันที่ผ่านมา

      you just described to learn model follow instructions better. That is different learning method.

    • @unclecode
      @unclecode 10 วันที่ผ่านมา

      @@delojikuhlii1 Very good point. In-context learning and prompt tuning are fabulous, with a really interesting impact. I suggest you check out Anthropic Documents; they've done a lot of intriguing research, and their new model 3.5 uses this approach effectively. One community member deciphered the system prompt, showing how much we can improve a model with in-context learning.
      The best approach is to start with in-context learning and experiment. You might think you're at the edge of the model's ability, but often a new trick works better. If in-context learning hits its limit, then play around with RAG, which is another form of in-context learning, injecting facts and knowledge.
      When you see the model's issue isn't about knowledge but the response style, it's time for fine-tuning. Many times, a small training dataset is enough to improve the model without causing confusion or hallucinations. LLms are trained to be chatty and provide answers no matter what.
      So, start with in-context learning, move to RAG, and then fine-tuning. In fine-tuning, consider the kind of data, the amount, and how much of the model parameters you want to fine-tune. Decide which layers to freeze and which to make trainable.
      There's a lot to discuss, and it's really fun. I suggest everyone explore this, as it helps you understand how these models think and act.

  • @nocturnomedieval
    @nocturnomedieval 12 วันที่ผ่านมา +2

    Great. Therr is also a late june paper appeared in nature journal, semantic entropy applied to detect hallucinations. You can keep it in backlog for calmer weeks.

  • @fhsp17
    @fhsp17 11 วันที่ผ่านมา +6

    Thumbnail: STOP FINE-TUNING.
    Opens the video. It's discussing a google clickbait title paper. They state way more than they are entitled to by this paper. As if it was a general answer for every use case and every method. It's just for their own controlled closed-book setup using precisely whatever method they used to finetune (which should be meticulously described in the paper for validation, because it's the only one the results are useful for). No more. Lol.

  • @Macorelppa
    @Macorelppa 12 วันที่ผ่านมา +2

    Love your videos 😊

  • @dlyog
    @dlyog 12 วันที่ผ่านมา

    Great work and completely agree

  • @marcfruchtman9473
    @marcfruchtman9473 11 วันที่ผ่านมา +2

    I have to say I am not convinced. We would need to see more "examples", where this adversely affects different models. Also, my guess is that the method used to fine tune will have a different effect.
    Another issue, is that I am not seeing too much in the way of "specifics". I would like to be able to see the example set of all questions with answers (without fine tuning) vs hallucinated responses for the fine trained model to see how it correlates with their definitions of hallucinations.

    • @1littlecoder
      @1littlecoder  11 วันที่ผ่านมา

      There was another paper about fine-tuning and knowledge forgetting, let me see if I can find it!

  • @aks8285
    @aks8285 11 วันที่ผ่านมา

    This i could correlate with my experience with vision models, and they also perform similar on fine tuning like you said.

  • @SonGoku-pc7jl
    @SonGoku-pc7jl 11 วันที่ผ่านมา

    yes, better example of fine-tunning i see, is for the kind of style of speak, similar to somebody you make fine-tuning for example with alot of transcript interviews. As you said, it serves the style

  • @DB-Barrelmaker
    @DB-Barrelmaker 11 วันที่ผ่านมา

    I thought since last year that the miracle of llms was that they managed to understand referencing AKA linguistic pointers. The increase in hallucination upon fine-tuning clearly points to a negative on that front.
    That means the door is open!

  • @KevinKreger
    @KevinKreger 11 วันที่ผ่านมา

    I can enhance hallucinations with one ICL example if there is a near void in that space.

  • @neffex-purpose
    @neffex-purpose 11 วันที่ผ่านมา

    @1littlecoder Could you please post a video on DSPy?

  • @Basant5911
    @Basant5911 11 วันที่ผ่านมา

    Fine tuning create misalignment in weights, hence do it with caution.

  • @medirobot96
    @medirobot96 11 วันที่ผ่านมา +2

    How to know whether the data we use in fine tuning is unknown to llm or not?

    • @mirek190
      @mirek190 11 วันที่ผ่านมา

      ask model ?

  • @Tony-cw6om
    @Tony-cw6om 12 วันที่ผ่านมา +3

    Where can we find similar papers for knowing what's happening and learn new things?

    • @supreme4256
      @supreme4256 12 วันที่ผ่านมา

      I wonder it too. How can we know that this is the most update thing we should know about?

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 12 วันที่ผ่านมา

      Hugging face papers. Arxiv.

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 12 วันที่ผ่านมา

      Papers with code

    • @Tony-cw6om
      @Tony-cw6om 12 วันที่ผ่านมา

      Thanks I'll look at them. Let me know if there are other websites / sources as well

  • @noorahmadharal
    @noorahmadharal 11 วันที่ผ่านมา

    How do you find the new papers on LLMs?

  • @testales
    @testales 11 วันที่ผ่านมา

    I wonder what the implications of this are for finetuning diffusion models or whether that is a completely different story.

  • @ChristianNode
    @ChristianNode 11 วันที่ผ่านมา

    just fully retrain the model on the new data.

  • @therobotocracy
    @therobotocracy 11 วันที่ผ่านมา

    How about diffusion models? Fine tuning is night and day!

  • @__________________________6910
    @__________________________6910 11 วันที่ผ่านมา

    Ohhh Noooo

  • @elon-69-musk
    @elon-69-musk 11 วันที่ผ่านมา

    👍

  • @Cat-vs7rc
    @Cat-vs7rc 12 วันที่ผ่านมา +1

    Fine-tuning is just additional pre training.

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 12 วันที่ผ่านมา

      No, far from it

    • @Cat-vs7rc
      @Cat-vs7rc 11 วันที่ผ่านมา

      @@MichaelBarry-gz9xl why is it far from it?

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 11 วันที่ผ่านมา

      ​@@Cat-vs7rc They're completely different. After pretraining the model is imbued with knowledge, but it spits out random garbage. Finetuning is for showing it how to format it's outputs correctly. Otherwise it just goes off on a tanjent.

    • @Cat-vs7rc
      @Cat-vs7rc 10 วันที่ผ่านมา

      @@MichaelBarry-gz9xl But the video says don't finetune ;)

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 10 วันที่ผ่านมา

      ​@@Cat-vs7rcAt this point I can't remember what the video says in it's entirety without rewatching, I doubt he says not to fine tune, rather I think he is expressing what finetuning is and what it isn't. When to use it and when not.

  • @freeideas
    @freeideas 12 วันที่ผ่านมา +7

    I find this disturbing. How, then, do we give an LLM new knowledge? RAG makes prompt size quite a bit larger and more expensive, and there are a few pieces of information that will be fed to the LLM in the prompt over and over. Seems way more efficient to teach the LLM. One example: baby otters are very good swimmers but they can't dive because too much air is embedded into their fur. This is too obscure for most LLMs to know, but this information will dramatically affect the quality of reasoning about the lives of baby otters. Do I need to feed that plus 1000 other obscure truths into an LLM's prompt every time the LLM is used? Apologies if the answer is already in the video, but it was not clear to my simple mind. :)

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 12 วันที่ผ่านมา +3

      Continued pretraining imbues it with new knowledge. Finetuning only affects the "style" i.e the way that it expresses that which it already knows. That being said a mixture of RAG and FT is about as good as you'll get, unless you've got a small fortune to spend

    • @freeideas
      @freeideas 12 วันที่ผ่านมา

      So let's say I want to teach an LLM all about the Star Trek show. If I fine-tune it on the Star Trek wikipedia, the Klingon dictionary, and the transcript of every Star Trek episode and movie, can I at least depend on that model to get Star Trek questions correct most of the time, perhaps at the expense of real world knowledge? Using RAG for this purpose would probably not work well for questions like, "which two species would be most likely to ally against Humans?" Because, given a large number of species, too many vectors would be pulled to feasibly compare every pair. But a good background knowledge might lead to an easy plausible answer.

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 11 วันที่ผ่านมา +3

      First, check to see if the knowledge already exists in the LLM. If so, you're good to go. But if it's not already in there it's going to hallucinate like crazy. In depends on whether the new data is in or out of distribution of the previous data (think Venn diagrams) the wider the gap between the "circles" the greater the hallucinations. There's a good chance models like llama 3 already have what you want considering the sheer volume of pretraining data. But let's pretend it knows nothing about star trek. Then you implement a vector database and use FT to teach it how to pull the data out of the vector database. It will be stuck with the reasoning abilities it already has, and you're effectively specialising it into one task at the expense of others, but it will do the trick. But will won't have "depth of understanding" it will basically be Google on steroids. But if the data is in distribution (overlapping Venn diagrams) then it will have a deep "depth of knowledge". That's the difference between pretraining (deeply inter-connected neurons) vs finetuning (a shallow layer of loosely connected neurons)

    • @MichaelBarry-gz9xl
      @MichaelBarry-gz9xl 11 วันที่ผ่านมา +1

      ​@@freeideasalso the best data by far, would be everything you mentioned PLUS scrape every website you can find where people talk about star trek I. E chat logs. You may think those chat logs and forums are low quality but believe me unstructured data is far superior to structured wiki like data. To have both is even better

    • @freeideas
      @freeideas 11 วันที่ผ่านมา

      @@MichaelBarry-gz9xl yes, that's good enough; at least i can teach an LLM knowledge about Star Trek, possibly causing it to forget knowledge about Star Wars and the real world. Of course we are both theorizing and speculating until i actually try it, but at least you made me feel better about trying. :)

  • @user-yg2qv4kf4r
    @user-yg2qv4kf4r 11 วันที่ผ่านมา

    We don't have proper way to build generative AI.

  • @AA-wp8pp
    @AA-wp8pp 11 วันที่ผ่านมา

    i am gonna start writing these shitty papers LMFAO... It never happend to my models, dont finetune it for a and ask it b

    • @AA-wp8pp
      @AA-wp8pp 11 วันที่ผ่านมา

      also halusination is worse with rag... also pappers about that so no need for writting one myself lol

    • @1littlecoder
      @1littlecoder  11 วันที่ผ่านมา

      Should we do author 😁😁😁

  • @msokokokokokok
    @msokokokokokok 11 วันที่ผ่านมา

    This is a shitty paper. Fine tuning only works to re-orchestrate prior skills. New skills can not be learnt in fine tuning. Try answering in french on a english pre trained model. It will not only screw up french but also english.

    • @1littlecoder
      @1littlecoder  11 วันที่ผ่านมา

      The paper says exactly what you said, why do you think it's not a good paper?

  • @ArtisanTony
    @ArtisanTony 11 วันที่ผ่านมา

    My experience is that the data you fine tune with is amalgamated (blended) so that you cannot get the exact responses you want.

  • @AbderrahmaneDiop-f4k
    @AbderrahmaneDiop-f4k 12 วันที่ผ่านมา +1

    first

    • @TensorTom
      @TensorTom 12 วันที่ผ่านมา +1

      second

    • @JoanApita
      @JoanApita 11 วันที่ผ่านมา +1

      third im not hallucinating

  • @manojtiwari7754
    @manojtiwari7754 11 วันที่ผ่านมา

    Dude change your clickbaity and scamming tittle

    • @1littlecoder
      @1littlecoder  11 วันที่ผ่านมา +2

      Explain what's scamming in this?

    • @figs3284
      @figs3284 11 วันที่ผ่านมา +1

      @1littlecoder nothing wrong with your title bro.