Distillation of Transformer Models

แชร์
ฝัง
  • เผยแพร่เมื่อ 11 ม.ค. 2025

ความคิดเห็น • 17

  • @gregmeldrum
    @gregmeldrum 3 หลายเดือนก่อน +3

    Thank you for consistently producing such in-depth, informative content. Your long-format videos are a treasure trove of knowledge. Really appreciate the effort you put into making these detailed explanations!

  • @sergerylenberg8711
    @sergerylenberg8711 3 หลายเดือนก่อน

    Thank you! This is fascinating AND instructive. You have a true talent for explaining complex ideas.

  • @loicbaconnier9150
    @loicbaconnier9150 3 หลายเดือนก่อน

    Always an excellent share, congratulations

  • @EternalKernel
    @EternalKernel 3 หลายเดือนก่อน +1

    Nice work. So glad you do such in depth processes. Question; in this video you go over distillation with the goal of keeping as much knowledge and functionality of the original model as possible. But what about if you really are only interested in a smaller domain of said functionality? i would assume instead of using 2% of whatever dataset you could use even fewer samples of a compatible dataset? You would end up with a very small very specialized model that may be better then the original at your specific domain?
    Even better if I could train locally on a single 3090.

    • @TrelisResearch
      @TrelisResearch  3 หลายเดือนก่อน

      Yes perhaps.
      The thing is that the background knowledge may provide useful scaffolding for your smaller subset of knowledge.
      My guess is that you should distill on 2% plus your subset of data.
      And yes, if you are doing less than 1B models, then distilling on local hardware is possible. Much bigger is hard although perhaps - with galore approaches or adafactor - you could do a 4-5B modem

  • @EternalKernel
    @EternalKernel 3 หลายเดือนก่อน +1

    How would distilation compare to archetecture search, when only concerned with a smaller domain. For instance in T2I only pictures of animals. Would it be less compute in total to find and train a NOVEL 100M param architecture vs a 4B param distilled model.
    I feel like there is more work to be done in model archetecture.

    • @TrelisResearch
      @TrelisResearch  3 หลายเดือนก่อน +1

      Well if the task you’re developing a model for is novel, you may not be able to distil.
      However, maybe you could distill and then do fine tuning. Or do fine tuning and distill from that

    • @EternalKernel
      @EternalKernel 3 หลายเดือนก่อน

      @@TrelisResearch Thank you. The purpose of the exercise would be mainly to find a new layer or sub layer architecture for the same task as the original model.

  • @btaranto
    @btaranto 3 หลายเดือนก่อน

    Hi! What models do you recommend for coding smaller than 48gb? Do you have any fine-tuned?

    • @TrelisResearch
      @TrelisResearch  3 หลายเดือนก่อน +1

      Check the latest qwen and deepseek models

  • @SiD-hq2fo
    @SiD-hq2fo 3 หลายเดือนก่อน

    very helpful, thanks Trelis
    also is there a discord server we can join and get connected

    • @TrelisResearch
      @TrelisResearch  3 หลายเดือนก่อน

      there is, but - fair warning - it's paid lifetime access. You can find some free and paid options for support at trelis.com/about though .

  • @danieladama8105
    @danieladama8105 3 หลายเดือนก่อน

    Nice 🔥🔥🔥