LocalAI LLM Single vs Multi GPU Testing scaling to 6x 4060TI 16GB GPUS

แชร์
ฝัง
  • เผยแพร่เมื่อ 22 ม.ค. 2025

ความคิดเห็น • 84

  • @nlay42
    @nlay42 3 หลายเดือนก่อน +2

    New sub! thanks. I was researching the 4060Ti for a future purchase and your video popped up in my feed. Thanks.. great content.

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน

      Hey thanks for the sub! You might want to check out our Mistral leaderboard video series also and our Mistral leaderboard at robotf.ai/Mistral_7B_Leaderboard for ideas on different hardware/costs/etc!

  • @NevsTechBits
    @NevsTechBits 7 หลายเดือนก่อน +8

    Thank you for this contribution to the internet Brother! I have learned from this. Subbed, liked and commented to show support. Good stuff sir! This will help us all.

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +2

      I appreciate that!

  • @hienngo6730
    @hienngo6730 4 หลายเดือนก่อน +21

    If your model fits on a single GPU (which your 13B Q_K_S model does), there's no benefit to running it across multiple GPUs. In fact, at best you'll be flat performance wise spreading the model across more than one GPU. Generally, there will be a slight performance penalty having to coordinate across GPUs for the LLM inference. The primary benefit for multiple GPUs is to run bigger models like a 70B or Mixtral 8x7B that do not fit on a single GPU, or to run batched inference using vLLM.
    The smaller the model (7B/8B and below in particular), the more impact the single-threaded CPU performance will have on the tokens per second speed. For a LLaMA-2 13B Q5_K_S model on an Intel i9 13900K + 4090 for example, I get 82 tokens per second:
    llama_print_timings: eval time = 11228.67 ms / 921 runs ( 12.19 ms per token, 82.02 tokens per second)
    On same machine using 3090, 71 t/s:
    llama_print_timings: eval time = 8536.02 ms / 614 runs ( 13.90 ms per token, 71.93 tokens per second)
    If you took one of those 4060 Ti cards and put it into a gaming PC with a current gen i7/i9 or Ryzen X3D CPU, you should see a big improvement in tokens per second.

    • @moozoo2589
      @moozoo2589 4 หลายเดือนก่อน +1

      How is your 13900K CPU loaded when running purely on 4090? Are multiple cores helping, or is there a bottleneck on a single threaded performance?

    • @ribeiro4642
      @ribeiro4642 หลายเดือนก่อน +2

      It depends on the objective...
      If you want more tokens per second, you need more GPU in parallel.

    • @CaridorcTergilti
      @CaridorcTergilti หลายเดือนก่อน +1

      I suggest an upgrade to llama3 - 8B much better than the llama2 13 B and much faster

  • @clajmate69
    @clajmate69 หลายเดือนก่อน +3

    so having more gpu wont help that much its better to stick into one solid gpu?

  • @andre-le-bone-aparte
    @andre-le-bone-aparte 7 หลายเดือนก่อน +6

    Just found your channel. Excellent Content - Another subscriber for you sir!

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +2

      Welcome aboard!

  • @gaius100bc
    @gaius100bc 6 หลายเดือนก่อน +6

    I came here expecting to see some tests actually utilizing that combined vram, and now I'm left confused what was the point of this video?
    Why would anyone expect speed to be different running exact same model with exact same prompt size, on 1 gpu versus 6 gpu?

    • @RoboTFAI
      @RoboTFAI  6 หลายเดือนก่อน +1

      haha I said same exact thing to my friends (which this video came from) and this was an attempt to try to settle that convo

    • @luisff7030
      @luisff7030 5 หลายเดือนก่อน +1

      Because the work is divided between all the GPUs, and they are running at the same time.
      It's the same logic why a 6 core CPU is faster when we enable all of the cores.

    • @salukage
      @salukage 3 หลายเดือนก่อน +1

      @@RoboTFAII was curious too glad that was settled

  • @animecharacter5866
    @animecharacter5866 5 หลายเดือนก่อน +1

    Amazing video, was really helpful! keep it up!

    • @RoboTFAI
      @RoboTFAI  5 หลายเดือนก่อน +1

      Glad you liked it!

  • @TazzSmk
    @TazzSmk 5 หลายเดือนก่อน +1

    so, are there additional bottleneck from NVME storage when "splitting" load into VRAMs of multiple gpus? I mean, at least for initial loadin stage...

  • @andre-le-bone-aparte
    @andre-le-bone-aparte 7 หลายเดือนก่อน +9

    Question: Do you have video going through the Kubernetes setup of using multi-gpus? Would be helpful for those just starting out.

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +6

      No, but I def could put one together. Thanks for the idea!

  • @milutinke
    @milutinke 6 หลายเดือนก่อน +5

    Can you try out LLama 70b ? Thanks.

    • @RoboTFAI
      @RoboTFAI  6 หลายเดือนก่อน +5

      sure, coming up

    • @milutinke
      @milutinke 6 หลายเดือนก่อน +1

      @@RoboTFAI Awesome, thank you.

  • @jackinsights
    @jackinsights 8 หลายเดือนก่อน +8

    Hey RoboTF, my thinking is that 6x 16GB 4060 Ti's for a total of 96GB of VRAM will allow you to run 130B params (Q4) and easily run any 70B model unquantisied.

    • @RoboTFAI
      @RoboTFAI  8 หลายเดือนก่อน +4

      Yea I run mostly 70b+ models in my work/playing for more serious things. It however depends on the quant, and how much context you set the model up for. Newer Llama 3 models with huge context will eat your VRAM real fast. If you mix offloading to VRAM, and RAM (slower) you could run some very large models.

    • @jamegumb7298
      @jamegumb7298 5 หลายเดือนก่อน +1

      @@RoboTFAI I always see LLM use and many are trying it out on the desktop and for pictures and answers and essays, but can this be used for accurate speech recognition on the desktop? What kind of parameters would one be looking at?
      Nothing fancy, just accurately understanding spoken commands. Would it be harder or easier if one were to try and command an average formula one woudl enter into Wolfram Alpha?

  • @six1free
    @six1free 8 หลายเดือนก่อน +2

    Thank you very much for this, it must have cost a small fortune

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +2

      Thanks! It did cost a bit, but not as much as my other node you folks haven't seen yet - coming soon!

    • @six1free
      @six1free 7 หลายเดือนก่อน +1

      @@RoboTFAI what single card would you recommend, considering I'm looking for the 75% - not state of the art mega bucks.... but able to keep up. 4080? will 4070 have hope? (specifically thinking huge context)

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +1

      Hmm always depends on use case, and models you want to run - and if you have lots of RAM to load into if not enough VRAM. I don't have a 4070/4080 to run tests with.....but I could run some tests of 4060's vs 4500's (more enterprise level that can be had fairly cheap these days for what they are)

    • @six1free
      @six1free 7 หลายเดือนก่อน

      @@RoboTFAI you don't think a 4080 would have been cheaper than 6x 4060's? ... or hindsight?

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +4

      I am sure it would have been if looking for large VRAM single card with speed, however that is not my specific use case with these cards. I use these to run multiple smaller LLM models (7b/13b) in parallel and not concerned about having them move as fast as possible just decently. Agent based/bot workflows/etc and playing around. For much larger models I use in my daily work, I have another GPU node with several A4500's (20GB VRAM) for running 70b+ models at speed. That's why I run this all in Kubernetes so multiple services can use multiple LLM models, across different GPU's as needed in parallel. And yes I am crazy....

  • @xtvst
    @xtvst 3 หลายเดือนก่อน +1

    Noobie question, can you use the same 6x 4060TI 16GB for model training (as in, would it make a total of 96GB of available memory) in order to overcome memory limitations on a single GPU?

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน +1

      Yep, depending on the software you are using and it's interaction with Cuda.

  • @GraveUypo
    @GraveUypo 3 หลายเดือนก่อน +2

    a reminder - you could get a gpu with lots of memory banks and upgrade the memory chips in to have huge capacity cards. you could for instance mod a 12gb rtx 3080 to have 24gbs

  • @jasonn5196
    @jasonn5196 8 หลายเดือนก่อน +2

    Doesn’t look like it split the work load well.
    It could have sent an iteration simultaneously to each gpu but it’s doesn’t look like it does that.

  • @maxh96-yanz77
    @maxh96-yanz77 9 หลายเดือนก่อน +10

    Your test very interesting. I use the old GTX 1060 with 6Gb run Ollama or LM Studio, the very significant peformance impact is about size of LLM model. And BTW I try to find someone that test RX 7900xt using ROCM, Did not find any in entire youtube.

    • @RoboTFAI
      @RoboTFAI  8 หลายเดือนก่อน +1

      I agree it's about LLM qaunt/size much more than performance. I don't have any AMD cards to test with, but would if I did. I do have several generations of Nvidia cards I might run through some tests with.

    • @blackhorseteck8381
      @blackhorseteck8381 8 หลายเดือนก่อน +4

      The RX 7900 XT (not the XTX) should be about on par with the RTX 3080 TI to 3090 (around 23 tk/s) info sourced from a Reddit post (can't send links on YT sadly). For a point of comparison my RX 6700 XT hovers around tk/s. Hope this helps, and isn't too late of an info :)

    • @malloott
      @malloott 7 หลายเดือนก่อน

      One number is missing, thanks it helps for me! ​@@blackhorseteck8381

    • @shiro3146
      @shiro3146 4 หลายเดือนก่อน +1

      @@blackhorseteck8381 im sorrry around what tk?

    • @blackhorseteck8381
      @blackhorseteck8381 4 หลายเดือนก่อน +1

      @@shiro3146 Sorry it sits around 13 tk/s in Llam 3 for me (I have 2 of them and they average on about the same 11 to 15 depending on the prompt and the length of the reply).

  • @alptraum360
    @alptraum360 6 หลายเดือนก่อน +2

    Just found your channel, this is some amazing stuff, please do post a video of your Kuberetes setup, im currently working on designing a AI powered Helpdesk with SIEM and just really getting in deep with the ML/AI areas...really love playing with all this.

    • @RoboTFAI
      @RoboTFAI  6 หลายเดือนก่อน +2

      Much appreciated! A video on how I run it all in Kubernetes is coming soon! Now if I could get AI to do the videos for me while I eat pizza that would be great.

    • @Momo-Oki
      @Momo-Oki 6 หลายเดือนก่อน +1

      @@RoboTFAI Very interested in this as well. Especially some of your other videos that show disconnecting and connecting multiple GPUs. And using Grafana/Prometheus. Haven't found anything like your channel. Keep posting these unique videos!

  • @akirakudo5950
    @akirakudo5950 7 หลายเดือนก่อน +2

    Hi thanks for sharing great video! Would you please also share a hardware list for using this test if you had a chance? I am very interested in how GPUs are connected on a mainboard.

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +2

      Sure I do have the specs listed on the app, but happy to do a follow up video on the lab machines I use for these type tests and the hardware involved for doing that many GPUs on a single node.

    • @akirakudo5950
      @akirakudo5950 7 หลายเดือนก่อน +1

      Thanks for replying! I look forward to watch your new videos, cheers!

  • @jeroenadamdevenijn4067
    @jeroenadamdevenijn4067 7 หลายเดือนก่อน +2

    I'm curious about my use case which is coding (needs high enough quantization). What t/s would I get with Codestral 22b 6-bit quantized with a moderate context size? 2x 4060TI 16GB should be enough for that and leave plenty of room for context. And secondly, what would be the speed penalty when going for 8-bit quantized instead of 6-bit? Around 33% or aim I wrong?

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +2

      I could throw together a quick test with some of the 4060's and codestral 22b at different quants levels. Tokens per second isn't just about that hardware, but also model (and model architecture MoE/etc), and context size, etc, etc.

  • @stuffinfinland
    @stuffinfinland 3 หลายเดือนก่อน +1

    I was waiting big models to be ran on these :) How about 6x a770 16GB GPUS?

  • @dfcastro
    @dfcastro 21 วันที่ผ่านมา

    When you run multi GPU I am assuming no SLI setup is required. Is that right?

    • @RoboTFAI
      @RoboTFAI  21 วันที่ผ่านมา

      That is correct - no SLI required!

  • @CelsiusAray
    @CelsiusAray 7 หลายเดือนก่อน +2

    Thank you. Can you try larger models?

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +1

      Certainly can show some tests with bigger models, if you have specifics let me know! For this test 13b was just enough to fill one card and show results as split. Happy to do models vs model comparisons also.

    • @kyu9649
      @kyu9649 6 หลายเดือนก่อน +1

      @@RoboTFAI Yry the llama405B when it comes out :D Although not sure if you can run it on this rig.

  • @Minotaurus007
    @Minotaurus007 20 วันที่ผ่านมา

    The intent of using multiple GPUs is to run larger models. So this test is not very useful and the results might be not surprising. So what about 33b- or 70b-models with this setup?

  • @aminvand
    @aminvand 7 หลายเดือนก่อน +3

    Thanks for the content, so two 16GB GPU will act as 32GB for the models ?

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +5

      Yes! Most software for running LLM's/training/etc support splitting the model between multiple GPUs, the newest version of LocalAI (llama.cpp under the hood) even support splitting models across multiple machines over the network.

  • @rhadiem
    @rhadiem 4 หลายเดือนก่อน +1

    Just saw this, thanks for testing. I already have a 4090, but definitely chasing the almighty VRAM for testing bigger models and running different things at one time. What would you recommend for system ram to run a 6x GPU setup like this?

    • @RoboTFAI
      @RoboTFAI  4 หลายเดือนก่อน +2

      That highly depends on your needs, wants, and wallet - if you want the ability to offload to RAM for any reason (super large models, or keeping KV in ram), the more RAM the better. Otherwise if offloading to the GPUs fully you only really need enough to run the processes which is fairly minimal.
      My rigs are Kubernetes nodes so I keep them stacked to be able to do anything with them, not just inference/etc.

    • @hienngo6730
      @hienngo6730 4 หลายเดือนก่อน +1

      The biggest issue you'll run into is the lack of PCIe lanes on the motherboard. If you have a consumer motherboard with recent gen Intel or AMD CPU, you will only be able to run 2 GPUs, 3 if you're lucky. You will likely need to use PCIe extension cables to space out your GPUs as consumer GPUs are generally 3+ slots wide, so you will get no cooling if you stack them next to each other.
      Once you go over 2 GPUs, you have to go into workstation or server motherboards to get the needed PCIe lanes; so EPYC, XEON, or Threadripper motherboards are needed. Best to buy used from eBay if your wallet doesn't support brand new Threadripper setup.
      Best bang for the buck will be used EPYC MB + used 3090s. You'll need muliple 1000+ W power supplies to feed the GPUs as well (I run with 2x 1600W)

  • @data_coredata_middle
    @data_coredata_middle 7 หลายเดือนก่อน +1

    what is ur pc set up to support 6 gpu cards? will greatly appreciate as i would like to set up sth similar :)

    • @RoboTFAI
      @RoboTFAI  7 หลายเดือนก่อน +3

      I go over some of the specs of my GPU nodes in other videos, but plan to do an overview video on each node soon!

    • @data_coredata_middle
      @data_coredata_middle 7 หลายเดือนก่อน +1

      @@RoboTFAI that would be great thanks for the response :)

  • @GraveUypo
    @GraveUypo 3 หลายเดือนก่อน +2

    this is due to extreme pci-e bandwidth bottlenecking, probably. putting a ton of gpus together without the bandwidth to push the data to them and back yields no improvement.

  • @tsclly2377
    @tsclly2377 6 หลายเดือนก่อน +2

    Two - three cards give better consistency, but it seems better to run them as individual nodes working on different problems, so the real test would be a more complex problem that has 6 parts.. and load times are still part of the equation. Some type of problem that your LLM has shown to have some fault with (not 95% correct more like 67-75%) and run each 3 times.. so the single GPU should take 6 times longer.. bt fore what I have seen (other reports) 6 GPUs will run only 5.5 times faster.. but when problem solving this is a time-saver. Total memory is also a real consideration along with if your are running PCIe4 vs 5. I've read that the slower your bus speed, the more advantageous it is to have a bigger GPU-vram card.

  • @JarkkoHautakorpi
    @JarkkoHautakorpi 4 หลายเดือนก่อน

    Instead of Llama13b please try for example llama3.1:70b-instruct-q8_0 to use all VRAM?

    • @RoboTFAI
      @RoboTFAI  4 หลายเดือนก่อน +1

      Yep for sure, we do that in a lot of other videos. This one was just to focus on that scaling out GPUs isn't for speed, it's for memory capacity to settle a convo amongst some friends.

  • @Matlockization
    @Matlockization 4 หลายเดือนก่อน +2

    Great video, but I need a magnifying glass to make out the names and numbers of anything.

    • @RoboTFAI
      @RoboTFAI  4 หลายเดือนก่อน +1

      Thanks for the feedback, I've tried to do better for those with smaller screens in more recent videos.

    • @Matlockization
      @Matlockization 4 หลายเดือนก่อน +1

      @@RoboTFAI Thank you.

  • @Ray88G
    @Ray88G 4 หลายเดือนก่อน

    4X 3090's would perform better ?

    • @RoboTFAI
      @RoboTFAI  4 หลายเดือนก่อน

      absolutely, but def 💰 involved. I have another video (and more coming) that pits a 3090 vs some other cards/etc if interested.

  • @steveseybolt
    @steveseybolt 5 หลายเดือนก่อน

    why??????

  • @jackflash6377
    @jackflash6377 6 หลายเดือนก่อน

    Would like to see a test of a larger LLM spread across multiple GPUs since that is what I'm wanting to do.

    • @RoboTFAI
      @RoboTFAI  6 หลายเดือนก่อน

      Lots coming!

    • @marekkroplewski6760
      @marekkroplewski6760 6 หลายเดือนก่อน +1

      I second that. Splitting one small model over many gpu's, since it fits in one, is less exciting, than fitting bigger models and comparing on same prompt. Great channel!

    • @marekkroplewski6760
      @marekkroplewski6760 6 หลายเดือนก่อน +1

      So it's either one bigger model, or using many instances of same model in parallel on this one rig. So if you find a model that works for you, now you want to serve it efficiently in parallel and max out your hardware.