3090 vs 4090 Local AI Server LLM Inference Speed Comparison on Ollama

แชร์
ฝัง
  • เผยแพร่เมื่อ 19 ม.ค. 2025

ความคิดเห็น • 92

  • @DigitalSpaceport
    @DigitalSpaceport  หลายเดือนก่อน

    AI Hardware Writeup digitalspaceport.com/homelab-ai-server-rig-tips-tricks-gotchas-and-takeaways

  • @I1Say2The3Truth4All
    @I1Say2The3Truth4All 3 หลายเดือนก่อน +25

    Looks like if the GPU use cases are for LLMs, then 3090 will be hugely economical! Great comparison with surprising results indeed and exactly what I wanted to see. Thank you! :)

    • @critical-shopper
      @critical-shopper 3 หลายเดือนก่อน +7

      llama3.2:3b-instruct-fp16
      105 tps 4090 Strix
      95 tps 3090 PNY
      gemma2:27b-instruct-q5_K_M
      38.4 tps 4090 Strix
      34.5 tps 3090 PNY
      I can buy 4 3090 for the price of the 4090 strix.

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน +4

      In my instance the 2x 4090s equaled the price of 4x 3090s + the new pads that I had to apply to them. If you buy used lower priced 3090s, you should expect to clean and repad them imo.

    • @fcmancos884
      @fcmancos884 3 หลายเดือนก่อน +1

      @@critical-shopper around 10-11% more, what about a 3090 with OC to the memory? the 3090 is 19.5gbps and the 4090 21gbps....

    • @ken860000
      @ken860000 2 หลายเดือนก่อน

      @@DigitalSpaceport I am looking for motherboard to use for AI training and inference . I have a question, if I have 2 x 3090. Does it require both PCIE slots connect to CPU directly or it can be one slot connect to CPU, the other one connect to chipset?
      It is too expensive to get a MB that has 2 slot connect CPU directly. ;(

    • @Larimuss
      @Larimuss หลายเดือนก่อน

      Yeah diffusion and training is the real benchmark. LLM it’s not gonna matter, only vram and they both the same.

  • @claybford
    @claybford 3 หลายเดือนก่อน +2

    THANK YOU for making a chart! I've been poking around your videos trying to get some straightforward info on how GPU/quantity affects TPS in inference, and this is super helpful

  • @arnes12345
    @arnes12345 3 หลายเดือนก่อน +4

    Great comparisons! Just one minor note: You mentioned that the last question is to check how fast it is with a short question. But when you ask it in the same thread as everything else, Open WebUI will pass that entire thread as input context + your short question. So you're really testing speed at long context size at that point. The token/sec numbers being consistently slightly lower for the long story and the final question confirms that. TL;DR - Start a new chat to test short questions.

    • @ScottDrake-hl6fk
      @ScottDrake-hl6fk หลายเดือนก่อน

      you can also reuse the seed value for testing

  • @reverse_meta9264
    @reverse_meta9264 3 หลายเดือนก่อน +12

    TL;DR - for LLM inference, the 4090 is generally not meaningfully faster than the 3090

    • @tsizzle
      @tsizzle 2 หลายเดือนก่อน

      @@reverse_meta9264 what about for LLM training?

    • @reverse_meta9264
      @reverse_meta9264 2 หลายเดือนก่อน

      @ I don't know but I can't imagine any single card with 24GB of VRAM does particularly well training large models

  • @YUAN_TAO
    @YUAN_TAO 3 หลายเดือนก่อน

    Great video man, thank you!😊

  • @KayWessel
    @KayWessel 2 หลายเดือนก่อน +2

    Thanks for nice comparison. Today I use a GTX 1080 8Gb, and planning to upgrade to 3090 24Gb, but wondering about 4070 TI Super 16Gb. What do you think? My pc is a Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz with 32Gb 3200Hz memory. Prices er 20% higher on the 4070.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน +1

      VRAM is always #1 and 24GB is a lot. It allows a good step up in models to run at higher stored parameters.

  • @moozoo2589
    @moozoo2589 3 หลายเดือนก่อน +5

    4:26 Again seeing CPU utilization at 100% and GPU at 81%. Have you tried using a more performant CPU to see if you can saturate GPU fully?

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน +4

      This is actually a good time for me to test this. I have a threadripper out of its case right now. Great Q!

    • @selub1058
      @selub1058 2 หลายเดือนก่อน

      @@DigitalSpaceport results?

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      @selub1058 results in this video th-cam.com/video/qfqHAAjdTzk/w-d-xo.html

  • @KonstantinsQ
    @KonstantinsQ 3 หลายเดือนก่อน +5

    Probably interesting to value 2x4090 with 3x3090s, as cost about the same. And look user friendly motherboards that can fit 2-3x 3090s? Is there for example X670E motherboards that can fit 3x GPUs with 8 PCI lanes for each and make it work? Or will there be sense of 2 3090s per 8 PCI lanes and one for 4 PCI lanes? Not easy to find even X670E which can fit 2x 3090s - because not every motherboard have dual 8 PCI lanes, and the second - not every have spacing in the matter or slots between to plug directly without risers. So probably there are only few options. And the question - is there a benefit from 3090s in terms of NVlink connection, does it bring a value? Many questions, would be glad for clarifying any of these! Because built with X670E would be much cost and ease friendlier for home setup. But 2x24GB Vram might be not enough for larger models, 3x24=72GB VRam teoretically can fit 70B LLMs, but is it even posible with X670E or similar motherboards and worth it? Thanks in advance! And good content! :)

    • @loktar00
      @loktar00 3 หลายเดือนก่อน +3

      @@KonstantinsQ PCIe lanes don't matter for inference much just initial loading. I'm using 2x3090s and a 3060ti on a 1x.

    • @KonstantinsQ
      @KonstantinsQ 3 หลายเดือนก่อน +1

      @@loktar00 Please xplain more or suggest where to dive deeper in understanding what you meant. :)

    • @ScottDrake-hl6fk
      @ScottDrake-hl6fk หลายเดือนก่อน +1

      nvlink is irrelevant here, 8 lanes are fine, pcie gen4 is important, fast storage is important, there are several older consumer mobo/cpu combos to get 16x8x8x8x8x8x8, 128gb of ddr4 is enough to run, good luck everyone

  • @CYBONIX
    @CYBONIX 3 หลายเดือนก่อน

    Well done~! I'll be looking out for the one on image generation comparison between both cards.

  • @markldevine
    @markldevine 2 หลายเดือนก่อน

    Great content. My new DIY idea: WRX90 (or whatever might be next) with Zen 5 9965X (perhaps), which has the new Zen 5 I/O die (not a disappointment like the desktop chips) for an in-home build. Getting residential quotes for 2 x new electrical circuits/outlets now. Completed in 2026H1, probably.
    Your channel will be a required viewing. Thanks for posting content!

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      In 26 it may be a newer variant but you NEED to watch my next video before you all in on the top of the line WRX90 with a 7995WX. Shocking findings my friend.

  • @aravintz
    @aravintz 3 หลายเดือนก่อน +1

    Thank you so much !! Will SLI make difference when we use 2 GPU's ? And please do test 70B

  • @janreges
    @janreges 3 หลายเดือนก่อน

    Hi Jerod,
    thank you for this video and your TH-cam channel in general! I've been watching your videos since 2021 (because of XCH, I'm a farmer with 2.3 PiB drives in the house, now effective 4.93 PiB with one RTX 4090) and you are one of my favorite creators.
    The RTX 3090 is definitely the most sensible choice for AI homelabbers today. However, I will be very happy, if you have the HW and time, to do a comparison of RTX 3090 with e.g. 6900 XT (with ROCm) for LLM models that can fit in 16GB VRAM.
    Btw, according to my measurements and for my project's needs, Qwen2.5:7b (Q4_K_M) is the best current model. It runs very fast even on a single GPU like the RTX 3080 10GB (about 90 tokens/s) and returns really high quality and precise information like other 30B+ models.
    Thank you for your work and I look forward to all your other videos in the AI HW/SW realm as well ;)

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน +1

      Sweet I may try to get a loaner ROCm capable card to test. I think their prices are very good if they can drop in and work with something like ollama it would be great. I agree Qwen 2.5 is very good. I do like the new Nemotron tune Nvidia released but that is a 70b and Qwen has a great variety of sizes all the way down it supports. Cheers

    • @slowskis
      @slowskis 3 หลายเดือนก่อน +2

      I use the qwen2.5:14b with a 6800XT and it runs great. Uses 15.2/16 GB of dedicated GPU memory 🥰

    • @janreges
      @janreges 3 หลายเดือนก่อน

      @@slowskis Thanks for the information, man! How many tokens/s are you running? And on what platform (Win/Linux/macOS) and with what LLM tool are you running?

    • @slowskis
      @slowskis 3 หลายเดือนก่อน

      @@janreges I am running on Windows 10 with Ollama. I have not tested token/s.

  • @IvanRosaT
    @IvanRosaT 6 ชั่วโมงที่ผ่านมา

    consiering that ones puts the two cards together 3090 and 4090 in the same PC, would that work with Ollama and/or LLM Studio despite the models ?

  • @mattfarmerai
    @mattfarmerai 3 หลายเดือนก่อน

    Great video, I would love to see an image generation comparison.

  • @simongentry
    @simongentry หลายเดือนก่อน

    so before i got into AI and learning about the power of local LLMs, i had already bought a 7900xtx for my x670e and 7950x system… it works great i love it - but for a local LLM - can’t get it to work even with the ROCm. Have you tested the 7900xtx like the 3090 etc?

    • @DigitalSpaceport
      @DigitalSpaceport  หลายเดือนก่อน +1

      No ive avoided ROCm based cards as I have read your story a lot of times. I to intent to add some to the mix but have a studio expansion underway and capex is limited until next year.

    • @simongentry
      @simongentry หลายเดือนก่อน

      @@DigitalSpaceport i feel you. i haven't even been able to find anyone to help me with the install. looks like i'll have to try and find a deal on a 4090 when the 5's come out. :( hey thank you for replying... most don't. subscribed.

  • @KonstantinsQ
    @KonstantinsQ 3 หลายเดือนก่อน +1

    And as it already clear that for some time 3090s will be best choice for money, would be interesting to see comparison or analysis for different 3090s, are they the same for LLMs or there might be any differences and benefits from different models? And how to choose in a distance best 3090? In Ebay or any other place. Thanks

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน

      Ill put together a written thing on the website and send that over here to you.

  • @ДмитрийКарпич
    @ДмитрийКарпич 2 หลายเดือนก่อน

    Interesting results - and it`s seems that`s the real improvement at new GPU generation, besides frame generator e.t.c. In my thoughts main problem - memory, in paper it`s has same number on bandwidth - 936.2 GB/s vs 1.01 TB/s. But for image generation huge 1.5x with CUDA may have some points.

  • @theunsortedfolder6082
    @theunsortedfolder6082 3 หลายเดือนก่อน +1

    so, in your case, model fits into single gpu, and then it did not matter if there were 2 or one gpus? speed was the same? What in case of, say, say, 3x 3070ti vs 1 x 3090? I'm thinking, a lot of these cards exist since they were used for mining, and purchased probably at the very last phase of mining era, so they exist in large numbers, and relatively new. Any test with that?

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน +1

      So if a model can fit into a single GPU, the ollama service will put it in just 1. Like 1x 3090 vs 2x 3060 12G is what your asking since they both equal up to 24GB? I dont have an answer on that. Yes many are available and indeed were used for mining. A good reason to anticipate a repad on a 3090 as they are very hot cards. Other 3000 cards are much less heat stress generators.

    • @Espaceclos
      @Espaceclos 3 หลายเดือนก่อน

      @@DigitalSpaceport it would be cool to see the difference between 2 12GB cards and one 24GB cards of the same series I.e., 3060s vs a 3090

  • @alx8439
    @alx8439 2 หลายเดือนก่อน

    Interesting to see that the TPS degrades if you continue to use the same chat with every next question asked. I have a gut feeling this happens because the whole conversation history is sent to the model each time and it has to process it first. There's no caching in ollama / openwebui. And also looks like some of them (with ollama or openwebui) calculate the generation speed wrongly, without accounting the time it took to parse the context. To overcome this you need to start a new conversation for each new query

  • @sebastianpodesta
    @sebastianpodesta 3 หลายเดือนก่อน

    Great!! Thanks for your video!!
    Do you know if the GPUs were running in x4, x8 or x16 pcie lanes? I’m looking forward to making a mini si server and I see that the new Intel core ultra 200 series will have more pci lanes. Enough to run 2 gpus at x16. But I don’t know how much of a difference this will make.
    I hope in the future we can combine npu and gpu vram for different task since the nueva motherboards will support 192 Gb but only at ddr5 speeds, for now.
    Can’t wait to see the stable diffusion benchmarks!!

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน +1

      Hey thanks for your question. I read a lot or findings, but its nice to validate the assumptions and 3090 vs 4090. On PCIe lanes, if you are sticking to pure inference the impact will be negligible outside model initial loading to performance. I will likely attempt to quantify this number however to get a firm reading. If you are doing a lot of embeddings with say a large document document collection for RAG it can utilize more bandwidth but that depends a lot on the embedding model and the documents themselves. In those instances x4 at gen 4 is still a massive amount of bandwidth. I would factor that in for say a x1 riser to x16, but negligible at x4 or greater. Do also check any specific consumer motherboard for limitations that will impact onboard m.2 usage if you enable say a 3rd slot. Sometimes that is shared and can disable some portion of onboard m.2. I really like my nvme storage as it helps speed up everything quite noticably.

  • @viniciusms6636
    @viniciusms6636 3 หลายเดือนก่อน

    Great job.

  • @tsizzle
    @tsizzle 2 หลายเดือนก่อน

    Since nvlink is no longer supported on the 4090 (Ada Lovelace). Its it better to get two 3090 and nvlink them together? Also are the vram pooled together? 24GB + 24GB = one GPU of 48 GB? Does a whole LLM need to fit on the vram of a single GPU 24 GB or can it fit on 48GB of vram on two GPUs? Or do you have to use some sort of model sharding, distributed training, page optimization, QLora, kv caching, etc. how many GPUs to run Llama3.1 405B?

    • @lietz4671
      @lietz4671 2 หลายเดือนก่อน +1

      어떤 동영상에서 405B 모델을 실행하려면 최소 221GB VRAM이 필요하다는 메시지를 보았습니다. 따라서 3090 24GB 그래픽 카드가 최소 9개는 있어야 405B 모델을 실행할 수 있을 겁니다. 9보다 그래픽 카드 수가 적다면, 그만큼 처리 속도가 느려질 것입니다. 또 다른 동영상에서 4090 그래픽 카드에 405B 모델을 실행하는 것을 보았는데, 20분 이상 기다린 다음에 출력이 시작되었습니다. 따라서 개인용 컴퓨터에서는 405B 모델을 실행하는 것은 포기하는 것이 좋을 것 같습니다.

  • @nandofalleiros
    @nandofalleiros 3 หลายเดือนก่อน

    I’m using a 4090 and a 3090 for image generation with Flux1-dev here. The 4090 generates a 1024x1024 image in 16 seconds, the 3090 capped at 280w generates in 30s

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน

      Thats a big difference

    • @jeffwads
      @jeffwads 3 หลายเดือนก่อน

      Twice as fast? Hmm.

    • @nandofalleiros
      @nandofalleiros 3 หลายเดือนก่อน

      Yes, but you should consider that its a 450w card against 280w. I’ll undervolt and change the limits of the 3090 to 400w and test again. Also, I’ll add at least 500mhz to the memory speed.

  • @masmullin
    @masmullin 2 หลายเดือนก่อน

    FYI: I've tested 4090 vs 3090 vs 7900xtx for image generation. Rough numbers as these just come from my memory:
    Flux1.dev fp8 (from the comfyui hugging face repo):
    20 steps, 1024x1024:
    7900xtx: 42->45seconds
    3090: 35->38seconds
    4090: ~12seconds
    TL;DR: 4090 is a monster for image gen. Similar differences can be seen with SD3.5 and SDXL
    For reference, the 7900xtx will run the qwen2.5-32b:Q4 with 22t/s

    • @DigitalSpaceport
      @DigitalSpaceport  หลายเดือนก่อน

      TY for dropping stats! I appreciate it. Helps guide me immensely.

  • @MrAtomUniverse
    @MrAtomUniverse 2 หลายเดือนก่อน

    What about llama 3.2 in q4 , the models on ollama are generally on q4 right ?

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน +1

      if you grab the default pull handle, yes that is q4. However if you click tags you can download q8 or fp16 as well as several other variants

    • @MrAtomUniverse
      @MrAtomUniverse 2 หลายเดือนก่อน

      @@DigitalSpaceport If you have a chance in future, do help test llama 3.2 in Q4 for 4090 , thank you so much!

  • @Alkhymia
    @Alkhymia 2 หลายเดือนก่อน

    What CPU you recommend for the 3090 and IA stuff?

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      Fast single core max speed matters and core count doesnt. I cant report on p core or e core as I dont have intel

  • @sciencee34
    @sciencee34 2 หลายเดือนก่อน

    Hi, have you looked at laptop gpus and how those compare

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน +1

      I have a 6GB 3060 in my laptop and it works great for inference workloads (like ollama) but unfortunately its 6GB and not a 8GB or 12GB which is what desktop GPU 3060s are. Could have spent an extra $ and gotten a larger VRAM card but now its this.

  • @nosuchthing8
    @nosuchthing8 3 หลายเดือนก่อน

    How do they fit these monster video cards into laptops

  • @ewenchan1239
    @ewenchan1239 3 หลายเดือนก่อน

    This is consistent with the 3090 results that I've posted as comments, back to this channel/your videos, albeit with different models.
    It averages to around somewhere between 5-8% slower than a 4090, which really, isn't a lot.
    I'm glad that I bought the 3090s during the mining craze, (where I was able to make my money back, from said mining), but now, I can also use said 3090s for AI tasks like these.

  • @Espaceclos
    @Espaceclos 3 หลายเดือนก่อน

    What CPU were you using? It seemed to be at 100%

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน

      This is getting tested right now. Video on that very soon. Maybe tomorrow. Interesting stuff.

  • @ScottDrake-hl6fk
    @ScottDrake-hl6fk หลายเดือนก่อน

    I started with two 12g 3060s (availability) then added two 10g 3080s. the 3080s finish about twice as fast, 70b-q4 and q8 are not blazing fast though usable, my upgrade path could be replacing the 3060s with two more 10g 3080s OR maybe one 24g 3090, i suspect more smaller capacity cards can apply more computation rather than one big card doing it all- i may be wrong, can anyone confirm?

    • @DigitalSpaceport
      @DigitalSpaceport  หลายเดือนก่อน

      @@ScottDrake-hl6fk a single 3090 is faster as it doesnt split the process workload. This is due to how llama.cpp handles parallelism. If you split 4 ways each gpu process runs at 25% right now. There are ways to change that but its up to the devs.

  • @KonstantinsQ
    @KonstantinsQ 3 หลายเดือนก่อน

    By the way, how about 2x1080ti with 11gbVram vs 3090? Vram 22gb vs 24, wats 250x2 vs 350, price about 200$x2 vs 1000$. That mean for 1000$ I can get 4x 1080tis. What other downsides of 2x1080ti for AI?

    • @KonstantinsQ
      @KonstantinsQ 3 หลายเดือนก่อน

      Or even 2060 12GB Vram version or 3060 12GB Vram versions for 200-250$ are good catch, no?

  • @silentage6310
    @silentage6310 2 หลายเดือนก่อน

    its cool!
    need this comparison in image generation. flux1-dev or SD.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      It is kinda weird. This didnt start out as a testing lab thing but it has evolved into a testing lab thing. Im working on it and have a comfyUI workflow setup but man I am a total noob with imgen. Any recommendations on how to start?

    • @silentage6310
      @silentage6310 2 หลายเดือนก่อน

      @@DigitalSpaceport i recommend to simply use automatic1111 (or new fork for Flux - Forge). its portable and simple to use.
      its show time spend for image generation.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      Flux Forge looks fantastic. Thanks!

  • @issa0013
    @issa0013 3 หลายเดือนก่อน

    Can you upgrade the VRAM in the 3090 to 48 Gb

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน +1

      Skills I do not have myself but is this like a service you can send GPUs into?

  • @melheno
    @melheno 2 หลายเดือนก่อน

    Performance difference of 3090 vs 4090 is only the memory bandwidth difference which is expected.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน +2

      I was in the ollama repo last night and there is issues apparently with both ada performance and fp16 that have been addressed and will be in upcoming releases. Going to be doing some retesting it looks like!

  • @alirezashekari7674
    @alirezashekari7674 3 หลายเดือนก่อน

    awsome

  • @Dj-Mccullough
    @Dj-Mccullough 3 หลายเดือนก่อน

    Man, you can really tell that nvidia gpus hold their value due to ram amounts. 3080ti about half the used price of the 3090.. finally found a 3090 though for 680. Good enough for me. People who want to get into playing with this stuff can realistically do it with as little as a 1070 8 gig.

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน

      Yeah I showed the 1070ti back in that mix n match video and the performance is really very good. I was suprised and hope to get the message out more about any nvidia card really being able to do some level of inference work.

  • @crazykkid2000
    @crazykkid2000 3 หลายเดือนก่อน

    Image generation is where you will see the biggest difference

    • @DigitalSpaceport
      @DigitalSpaceport  3 หลายเดือนก่อน

      Good to know. Will be testing soon.

  • @DataJuggler
    @DataJuggler 2 หลายเดือนก่อน

    My 3090 only has 23 gigs I think. Or that is what Omniverse shows.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      Should be 24 maybe its displaying GiB and not GB

    • @DataJuggler
      @DataJuggler 2 หลายเดือนก่อน

      @@DigitalSpaceport You are right. I looked it up in task Manager - Performance then Bing Chat explained to me the dedicated and the shared 64 gig the GPU can offload to.

  • @autohmae
    @autohmae 3 หลายเดือนก่อน

    Honestly, I think VRAM makes the biggest difference in performance.

  • @Larimuss
    @Larimuss หลายเดือนก่อน +1

    For less than a 4090 you can run 2x 3090. If power isn’t an issue for you 😂 like $1000 less.
    Of course there is a speed difference 😂 it’s pretty substantial for training, but 3090 is still great speed for the price. Nothing else has 24gb vram. So until Nvidia stop being c*nts on vram and price not buying a new card.

  • @mz8755
    @mz8755 2 หลายเดือนก่อน

    The results seem a bit odd to me... it can't be that close