LocalAI LLM Testing: Viewer Questions using mixed GPUs, and what is Tensor Splitting AI lab session

RoboTF AI

มุมมอง 3 598

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 1 ก.พ. 2025

ความคิดเห็น • 40

@246rs246 6 หลายเดือนก่อน ⁺⁴
I'm blown away by this comprehensive answer to my question. Thumbs up and I'm looking forward to more interesting videos.
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
Awesome, thank you!
@jackflash6377 6 หลายเดือนก่อน ⁺¹
Outstanding !
Glad I found this channel.
Thank you sir.
@RoboTFAI 6 หลายเดือนก่อน
Thanks for watching!
@AkhilBehl 6 หลายเดือนก่อน ⁺³
This is absolutely awesome stuff.
@RoboTFAI 6 หลายเดือนก่อน
Thanks!
@six1free 6 หลายเดือนก่อน ⁺¹
hands down one of the best youtube chanels out there - and i'm not just saying that for flashing my question :D I really do love how thoroughly you've taken to answering it.
.. this being the pause point... I'm going to guess that cuda will do it all for you ("as if" - I'm sure :D)
I am so envious of your test rig... as it is though I need a data center for power... as for adding the other cards, further research tensors and rewatch this video when applicable :D - downloaded and saved to my good tutorials (very long) playlist... enjoy the well deserved follow-through.
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
Thanks for the idea!
@tedguy2743 3 หลายเดือนก่อน
Your content is what I define as Gold, much appreciate the work
@RoboTFAI 3 หลายเดือนก่อน
Glad to hear it! Hope you check out more of the videos.
@kevinclark1466 6 หลายเดือนก่อน ⁺¹
Great video! Looking forward to trying this…
@RoboTFAI 6 หลายเดือนก่อน
Have fun!
@SphereNZ 6 หลายเดือนก่อน
Great video, great info, really appreciate it, thanks.
@RoboTFAI 6 หลายเดือนก่อน
Appreciated!
@mbike314 6 หลายเดือนก่อน
Thank you for creating this valuable content. I am pleased to have discovered it. I am interested in some 4060's you mentioned. I sent an email.
Please keep going with this channel!
Wonderful stuff!
@RoboTFAI 6 หลายเดือนก่อน
Thanks a ton! Didn't see any email - reach out robot@robotf.ai or ping me on reddit/etc
@mbike314 6 หลายเดือนก่อน
Thank you. I did send it to the wrong address. Just resent it to the correct address.
@alzeNL 5 หลายเดือนก่อน
very interesting and great work !
@RoboTFAI 5 หลายเดือนก่อน
Many thanks!
@andre-le-bone-aparte 6 หลายเดือนก่อน ⁺¹
Question: @03:14 - NVTOP is showing - 90+ Degrees (86 on the M40) Fahrenheit on each of those cards... WITHOUT any active usage?
- That seems excessive. Currently running a 4x3090 setup at 79-degrees or lower, in-between queries.
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
the 4060's are stacked with each other on the bench node in this test (I don't recommend that, they could use space between them since side facing fans, and why I use a lot of pcie extenders normally) and don't run their fans unless there is a load - the M40 in this test has an active fan on all the time. Also I live in a hot climate and it's been 85-100 degrees (75+ in the workshop as it's not conditioned)🔥
@andre-le-bone-aparte 6 หลายเดือนก่อน ⁺²
@@RoboTFAI 👍- Just looking to learn ways to extend the life of these GPUs and increase performance for LLM usage when running 10 hours a day (work day, remote-work, as a code assistant)
@madbike71 25 วันที่ผ่านมา
This video is really good. Very informative. I would like to know how different pcie size sockets affect tensor splitting, as the cards send their results to each other.
@RoboTFAI 25 วันที่ผ่านมา
As in say 16x vs 8x etc? I think we could do that - if interested in just seeing the PCIe traffic in a multi-gpu setup you may enjoy this video th-cam.com/video/ki_Rm_p7kao/w-d-xo.html where we show PCIe traffic during loading and inference with enterprise grade cards that emit those metrics.
@madbike71 25 วันที่ผ่านมา
@@RoboTFAI Hey, thank you very much. I'll see it tonight.
@CoderJon 6 หลายเดือนก่อน
Love your videos. I appreciate that you leave the interpretation of the results to us, but I would love a video talking about your interpretations of the data. For example: Why your results for Prompt tokens per second were higher with the 90/10 split. I can assume its because there is some sort of parallel processing happening on the interpretation of the prompt, but I am still new to the AI world so would love the education.
@RoboTFAI 6 หลายเดือนก่อน
Much appreciated! I attempt to keep my mouth shut and let the data show the info. Definitely not an expert and just learning like everyone else. I never intended on creating an actual channel, the first video was to prove a conversation with friends out with hard data, the testing app is for other uses in my lab, etc. Just turning into a place where we can all share some data and learn from it, or at least burn some of my power bill together!
@rbrowne4255 3 หลายเดือนก่อน
Can you test model splitting across hosts, using a framework such as Ray with vllm
@RoboTFAI 3 หลายเดือนก่อน
I have a few videos on distributed inference across multiple hosts using Llama.cpp if you want to check them out! Part 1: th-cam.com/video/GB-CbAwzbDs/w-d-xo.htmlsi=uDIKWhOOcOzWycbF and Part 2: th-cam.com/video/CKC2O9lcLig/w-d-xo.htmlsi=S9JS4D5tByYh0zUe
@alx8439 3 หลายเดือนก่อน
Two questions sir. Have you tried other inference backends? There's a project called Cortex cpp - a kind of alternative to ollama made by guys, who developed Jan AI. The main pro for it is that it supports TensorRT - Nvidia inference engine (and specific model format, different from ggml). These guys are claiming it's 40-60% faster, than llama cpp. They made a big comparison between post about it, running different GPUs and everywhere TensorRT was a huge win.
Second question is - did you try to equip tensor parallelism?
@RoboTFAI 3 หลายเดือนก่อน
I have played around a bit, but not a ton as don't have much time. I do run LM Studio/Ollama/etc locally - and Local AI supports many types of backends, not just Llama (which I mainly use). VLLM, etc.
I don't think llama.cpp (under LocalAI) support tensor parallelism yet, believe there is open PR/feature request. Vllm might - so answer is no I haven't.
@tsclly2377 6 หลายเดือนก่อน
I think loading is still an important factor, so do you use NVMe drives, like the large, high write level Octane p900 series for the fast load? and FPGAs for pre-setting data (like video, pictures) reconstructed in a faster use mode?
@RoboTFAI 6 หลายเดือนก่อน
I normally leave the unloaded model test off as it doesn't allow as much resolution in the smaller charts. I use Gen 4 NVMe M.2 drives in each of these systems (rated up to 5000/4800 MB/s...yea right).
@Zeroduckies 6 หลายเดือนก่อน
Or you can get 1tb ram and have 500gb ramdisk ^^
@tsclly2377 6 หลายเดือนก่อน
@@Zeroduckies Using HP ML 350p machines one only gets up to 768GB of dram that has to be LRdram, but that ram is running on three channels that actually slows it down from the 2 channel 256GB because of the required 'blocking in' and processing. It is all in the specification PDF from HP.. It is only when going to The G11 model that one actually get significantly faster (PCIe 5.0.. HP skipped the 4.0 architecture in these machines) ram and a larger capacity at a astronomically increase in price.. So when getting a 'loaded' 256 dram ML 350p G8 afor a trade of on older gamer machine with a at GTX 1660ti and a less than tenth geni7 (about a 300$ value) one must be looking for a fast economical memory solution and that is where the Optane P900 card come in (with their 4000GB/s bust) and one must also compare that at the rate that the GPU actually can take in, so this is a cheap way to run data in (and out) in a comparable manner as dram... plus you are only occupying a PCIe 4 lane. Now this is al gfine and dandy, but in dual cpu chip-sets, the PCIe lanes go all over the place and that is a major consideration as the right and left side are controlled by different CPUs and SLI or VRLinking can be required for OS recondition of the linked GPU cards that is inherently required for proper function logging.... and PCIe controllers on these machines. They are going to be slower than single CPU specifically designed mother boards that are made other companies such as the multi-PCIe 16x SuperMicro or Gigabyte professions models... that have come out specifically designed for this type of application that use NVME arrays for storage.. and then you are back to the amount of writes that are going to be applied to the storage.
@first-thoughtgiver-of-will2456 4 หลายเดือนก่อน
use an older instict or k series card for the hessian approximation history with lbfgs! it runs so much more sparesly than the models forward and backwards passes you wouldnt have much of a speed slowdown and the model will likely converge earlier anyways! (besides training small local models use case can be PEFT with LoRa etc.)
@Rob_Kandels 28 วันที่ผ่านมา
how are you using Mac OS with NVIDIA GPUs?
@RoboTFAI 28 วันที่ผ่านมา
We are not, we running the cards on Ubuntu based nodes (mostly in Kubernetes) remotely on the channel. We do test the Macs in a few other videos with Metal though!
@tbranch227 6 หลายเดือนก่อน
Can you run a larger model when you span cards? Or does your model need to be able to fit on each card that you tensor split across? What happens to performance then, if you can run larger models by aggregating card ram?
@RoboTFAI 6 หลายเดือนก่อน
You can absolutely span the larger model between cards! These tests are actually doing that, performance depends on cards you are splitting between - but will be between your lowest end card, and highest end cards (if different models). Running multiple cards doesn't necessarily increase performance, it's really for expanding your VRAM capacity.

ต่อไป

เล่นอัตโนมัติ

LocalAI LLM Testing: Can 6 Nvidia A4500's Take on the WizardLM 2 8x22b?