How to pick a GPU and Inference Engine?

Trelis Research

มุมมอง 2 972

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 22 ก.ย. 2024

ความคิดเห็น • 38

@Moonz97 หลายเดือนก่อน ⁺²
Awesome overview. Love it. How come llamacpp wasnt in the inference engine comparison?
@TrelisResearch หลายเดือนก่อน ⁺¹
I probably should have included it, although it's slower.
For Llama 3.1 8B on an H100 SXM it gives:
batch 1: 90
batch 64: 15
This is slower than all the other engines.
You can see the results here: docs.google.com/spreadsheets/d/15MJAjBoQdFacmNEEwmfqLVRtSgAOqgsn9APFYo_Q59I/edit?usp=sharing
@Moonz97 หลายเดือนก่อน
@@TrelisResearch appreciate it! Thanks for sharing the results
@abhijitnayak1639 หลายเดือนก่อน ⁺²
As always top-notch content, love it!!
@abhijitnayak1639 หลายเดือนก่อน
Have you compared the inference speed of LLMs: ExLlamaV2 and SGLang?
@TrelisResearch หลายเดือนก่อน
I haven't done ExLlamaV2 but if it uses Marlin kernels it will be fast. As to whether it's fast for batching, that depends on scheduler. The fastest models are the fp8 or INT4 awq models, and they can run with sglang or vllm
@anunitb หลายเดือนก่อน
This channel is pure gold. Cheers mate
@InstaKane หลายเดือนก่อน
As an a devops/platform engineer, I learned a lot watching your video! Cheers!
@TrelisResearch หลายเดือนก่อน
Thanks
@bphilsochill หลายเดือนก่อน
Extremely insightful! Thanks for releasing this.
@gody7334-news-co8eq หลายเดือนก่อน
Thanks Trelis, thanks you conduct this experiments, you save me lots of time.
@Bluzë-o5b หลายเดือนก่อน
Great video
@mahermansour1131 19 วันที่ผ่านมา
This is an amazing video, thank you!
@TrelisResearch 18 วันที่ผ่านมา
cheers, you're welcome
@BoeroBoy หลายเดือนก่อน
I love this. Interested there doesn't seem to be a mention of Google's TPUs though which were built for lower precision AI floating point matrix math.
@TrelisResearch หลายเดือนก่อน
I just didn't think of it but it's a good idea.
Seems vLLM supports doing this: docs.vllm.ai/en/latest/getting_started/tpu-installation.html#installation-with-tpu
I also created an issue on SGLang: github.com/sgl-project/sglang/issues/919
Nvidia does benefit enormously from lots of libraries optimising a lot for Cuda, but I'd have an option mind as to whether TPUs could still be faster.
@rbrowne4255 หลายเดือนก่อน
Excellent overview, with Diffusion models, vision transformers are the performance numbers similar?
@TrelisResearch หลายเดือนก่อน
In principle it could be, but these libraries are more focused on text outputs
@fabioaloisio หลายเดือนก่อน
Thanks Ronan, awesome content as always. Suggestion for a future video: a MoA implementation (with 2 or more layers) would be very appreciated. Maybe using small language models (or 8b models!?) with llama.cpp or llama.mojo would achieve performance comparable to some frontier model. I don't know It's only a hypothesis.
@TrelisResearch หลายเดือนก่อน ⁺¹
yeah good shout. Interestingly we've seen a move away from MoE with the Llama models. I suspect the training instability is underappreciated. I'll pin a comment now on llama.cpp
@tikz.-3738 หลายเดือนก่อน ⁺¹
vLLM seemed fastest a month ago and now slang dropped out of nowhere or at lest that's what my experience has been and its taken over by storm kinda crazy , it is missing alot of features that vLLM has, hopefully both projects learn and improve.
@TrelisResearch หลายเดือนก่อน
Yeah SGLang has drawn a ton from vLLM.
@sammcj2000 หลายเดือนก่อน
MistralRS would be good to see as a comparison
@TrelisResearch หลายเดือนก่อน
have you got a link to that?
@peterdecrem5872 หลายเดือนก่อน
Do any of these give the same answer with same parameters : temperature 0, etc. as hf trl or unsloth. I am struggling to collect usable user feedback for next iteration of fine tuning if what users sees doesn’t tie with what model sees. I suspect it has to do with optimizations/quantizations in the serving. Thanks
@TrelisResearch หลายเดือนก่อน
Howdy! Could you clarify your question?
TRL and unsloth are both for training, whereas here I'm talking about inference
@peterdecrem5872 หลายเดือนก่อน ⁺¹
@@TrelisResearchyes. I learned yesterday that the inference might differ because of the kv cache in 16 bits (so reorder) vs hf. The way to reduce this is beam search and higher precision. The rounding and order makes a difference. You won’t see it in averages but compare individual results and then you will notice. Same model.
@Max6383-je9ne หลายเดือนก่อน
You only focused on Nvidia cards, but AMD seems to be competing pretty well from what I see, at least in terms of hardware. Is the software support for LLM inference decent enough?
For example, you could fit 405B in 4 MI300X for a similar price per card to the H100. The AMD card should also beat the H100 in a head-to-head speed comparison, at least if the software efficiency does not disappoint.
@TrelisResearch หลายเดือนก่อน
Yeah those are all fair comments. Running on AMD is a bit less supported but it was more a limit on how much I could stuff in to this video.
Indeed I should do some digging on AMD, would be interesting to benchmark it head to head with NVidia.
@mahermansour1131 19 วันที่ผ่านมา
Hi, Can you do paid consultation calls and put the link in your description? I would like to book a call with you.
@TrelisResearch 18 วันที่ผ่านมา
there's an option on Trelis.com/About
@kunalsuri8316 หลายเดือนก่อน ⁺¹
Any idea how well these optimizations work on T4?
@TrelisResearch หลายเดือนก่อน
Unfortunately T4 is now very old and doesn't support AWQ as far as I know. This means the templates here won't work well.
Your best bet may be to run a Llama.cpp server - you can check the one-click-llms repo
@TrelisResearch หลายเดือนก่อน
you can also run TGI using --quantize bitsandbytes-nf4 or --quantize eetq (for 8 bit), but they will be slower than AWQ.
@TemporaryForstudy หลายเดือนก่อน
hey trelis, I have a doubt about API development with LLMs. suppose my LLM takes 2 GB of RAM when we load it. now we can do inference from our model. Now if i make an API and then two requests come at same time then would the other 2 GB model is loaded into RAM? or the second request will wait for completion of first request? I don't know about this. And what if we get 100 requests? i just need to determine RAM size I would need.
@TrelisResearch หลายเดือนก่อน
Actually if you have multiple requests, they use the same weights.
Calculations are done in parallel and the model weights are only read in once (per parallel calculation).
VRAM usage does increase a bit as you increase batch size, but this is due to there being more activations (layer outputs) being stored for each of the input sequences. This tends to be small relative to the model weights.
The point of these inference engines is that they handle everything so sequences can be handled in parallel (including if you have one request that starts after another. this just means - for example - that the fifth token of the first request might be parallel processed with the first token of the second request).
@TemporaryForstudy หลายเดือนก่อน
@@TrelisResearch okay. so dose this come with hugging face model or do i need to write code by my self to do all these things?
@TrelisResearch หลายเดือนก่อน
@@TemporaryForstudy no need to write code, that's the point of using these inference engines - like sglang or vllm. If you want to rent a gpu, the best approach is to use a docker image, like the one click templates I show. If you own a gpu, then it's best to install sglang or vllm on your computer, they handle batching.

ต่อไป

เล่นอัตโนมัติ