Fast Inference of Mixture-of-Experts Language Models with Offloading

แชร์
ฝัง
  • เผยแพร่เมื่อ 28 ก.ค. 2024
  • In this video we review a recent important paper titled: "Fast Inference of Mixture-of-Experts Language Models with Offloading".
    Mixture of Experts (MoE) is an important strategy to improve the efficiency of transformer based large language models (LLMs) nowadays.
    However, MoE models usually have a large memory footprint since we need to load the weights of all experts. This makes it hard to run MoE models on low tier GPUs.
    This paper introduces a method to efficiently run transformer based MoE LLMs on a limited memory environment using offloading techniques. Specifically, the researchers are able to run Mixtral-8x7B on the free-tier version of Google Colab.
    In the video, we provide a reminder for how mixture of experts works, and then dive into the offloading method presented in this paper.
    -----------------------------------------------------------------------------------------------
    Paper page - arxiv.org/abs/2312.17238
    Soft MoE - • Soft Mixture of Expert...
    Code - github.com/dvmazur/mixtral-of...
    Post - aipapersacademy.com/moe-offlo...
    -----------------------------------------------------------------------------------------------
    ✉️ Join the newsletter - aipapersacademy.com/newsletter/
    👍 Please like & subscribe if you enjoy this content
    We use VideoScribe to edit our videos - tidd.ly/44TZEiX (affiliate)
    -----------------------------------------------------------------------------------------------
    Chapters:
    0:00 Paper Introduction
    1:34 Mixture of Experts
    3:44 MoE Offloading
    10:29 Mixed MoE Quantization
    11:13 Inference Speed
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 6

  • @winterclimber7520
    @winterclimber7520 6 หลายเดือนก่อน +5

    Very exciting work! The resulting speed the paper proposes won't break any land speed records (2-3 tokens per second), but in my experience one of the most productive and practical applications of LLMs is prompting it with multiple choice questions, which only require a single token.
    This paper (and provided code!) for GPT3.5 levels of inference running local on consumer hardware is a huge breakthrough, and I'm excited to give it a try!

  • @jacksonmatysik8007
    @jacksonmatysik8007 6 หลายเดือนก่อน +1

    I have been looking for channel like this for ages as I hate reading

  • @fernandos-bs6544
    @fernandos-bs6544 4 หลายเดือนก่อน

    I just found your channel. It is amazing. Congratulations. Your numbers will grow soon, I am sure. Great quality and great content.

  • @PaulSchwarzer-ou9sw
    @PaulSchwarzer-ou9sw 6 หลายเดือนก่อน

    Thanks! ❤

  • @ameynaik2743
    @ameynaik2743 3 หลายเดือนก่อน

    I believe this applicable only for single request? If you have change of experts, you will most likely have many experts active for various requests. Is my understanding correct? thank you.