Fast Inference of Mixture-of-Experts Language Models with Offloading
ฝัง
- เผยแพร่เมื่อ 28 ก.ค. 2024
- In this video we review a recent important paper titled: "Fast Inference of Mixture-of-Experts Language Models with Offloading".
Mixture of Experts (MoE) is an important strategy to improve the efficiency of transformer based large language models (LLMs) nowadays.
However, MoE models usually have a large memory footprint since we need to load the weights of all experts. This makes it hard to run MoE models on low tier GPUs.
This paper introduces a method to efficiently run transformer based MoE LLMs on a limited memory environment using offloading techniques. Specifically, the researchers are able to run Mixtral-8x7B on the free-tier version of Google Colab.
In the video, we provide a reminder for how mixture of experts works, and then dive into the offloading method presented in this paper.
-----------------------------------------------------------------------------------------------
Paper page - arxiv.org/abs/2312.17238
Soft MoE - • Soft Mixture of Expert...
Code - github.com/dvmazur/mixtral-of...
Post - aipapersacademy.com/moe-offlo...
-----------------------------------------------------------------------------------------------
✉️ Join the newsletter - aipapersacademy.com/newsletter/
👍 Please like & subscribe if you enjoy this content
We use VideoScribe to edit our videos - tidd.ly/44TZEiX (affiliate)
-----------------------------------------------------------------------------------------------
Chapters:
0:00 Paper Introduction
1:34 Mixture of Experts
3:44 MoE Offloading
10:29 Mixed MoE Quantization
11:13 Inference Speed - วิทยาศาสตร์และเทคโนโลยี
Very exciting work! The resulting speed the paper proposes won't break any land speed records (2-3 tokens per second), but in my experience one of the most productive and practical applications of LLMs is prompting it with multiple choice questions, which only require a single token.
This paper (and provided code!) for GPT3.5 levels of inference running local on consumer hardware is a huge breakthrough, and I'm excited to give it a try!
I have been looking for channel like this for ages as I hate reading
I just found your channel. It is amazing. Congratulations. Your numbers will grow soon, I am sure. Great quality and great content.
Thank you 🙏
Thanks! ❤
I believe this applicable only for single request? If you have change of experts, you will most likely have many experts active for various requests. Is my understanding correct? thank you.