LocalAI LLM Testing: Llama 3.1 8B Q8 Showdown - M40 24GB vs 4060Ti 16GB vs A4500 20GB vs 3090 24GB

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 พ.ย. 2024

ความคิดเห็น • 73

  • @hammadusmani7950
    @hammadusmani7950 2 หลายเดือนก่อน +1

    This is great, and a very professional test environment. I was especially impressed by the ability to switch GPU's using the Kubernetes cluster.

    • @RoboTFAI
      @RoboTFAI  2 หลายเดือนก่อน +1

      Thanks, much appreciated!

  • @jksoftware1
    @jksoftware1 3 หลายเดือนก่อน +5

    GREAT video... Learned a lot from this video. It's hard to find good AI benchmark videos on TH-cam.

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน

      Glad you enjoyed it!

  • @jackflash6377
    @jackflash6377 4 หลายเดือนก่อน +3

    Very informative, thank you!
    Detailed, to the point and exactly what we need to know.
    I use a 3090 at home, a 4060 at work and on my coding machine I use an old GTX 1080Ti with 11GB. It does OK for Continue in VSCode but it is slow.
    Tell you wife "it's for science" in your best Doc from back to the future voice.
    Thank you again.

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน +1

      I tried she wasn't having it lol!

    • @k1tajfar714
      @k1tajfar714 3 หลายเดือนก่อน

      wish i could have the 1080Ti someday if you wanned to get rid of it. dont forget about me in a third world country Lol.

  • @shawnvines2514
    @shawnvines2514 4 หลายเดือนก่อน +3

    Great video. It is definitely nice to see a benchmark against different Nvidia board with something similar I have ran before. At the end of June, I bought parts and built for a computer for AI development with a Ryzen 7 7800X3D for $339 and a 4060 Ti 16GB for ($450). I bought it to begin local development waiting on the RTX 5090 but it looks like that will be delayed for awhile.
    I've just been using LM Studio and Anything LLM for running local LLM to analyze data. And using many Python open source projects for audio and image processing.

    • @RoboTFAI
      @RoboTFAI  4 หลายเดือนก่อน +2

      LM Studio and Anything LLM both in my toolbox also for daily driving laptop! Excellent tools. I also use Continue/etc integrated into VScode pointing at lab or LM Studio locally.

    • @lppoqql
      @lppoqql 2 หลายเดือนก่อน +1

      Thats great, Im thinking of doing the samething. Do you mind telling me how your setup is working so far? Is the 4060 Ti 16GB good enough for code generation or are you seeing lots of errors? Thanks!

  • @dllsmartphone3214
    @dllsmartphone3214 4 หลายเดือนก่อน +3

    exactly as requested. more useful videos. thank you for your content...
    i think about buying to buy a h100 80gb because i wanna run miatral large 2 so badly 😅

    • @iheuzio
      @iheuzio 4 หลายเดือนก่อน +2

      get gaudi 3, it is cheaper at 15k per card for the same performance as a h100 and you get double the vram

    • @dllsmartphone3214
      @dllsmartphone3214 4 หลายเดือนก่อน +2

      @@iheuzio what a nice suggestion. it looks promising
      more vram more speed and lesser price. if this is true.... well i can wait a little, i will def consider now the intel one. thank you very much!

  • @ArtificialLife-GameOfficialAcc
    @ArtificialLife-GameOfficialAcc 2 หลายเดือนก่อน +10

    undervolt the 3090 and will give you bassically the same performance with around 220-250 watts

    • @maxmustermann194
      @maxmustermann194 27 วันที่ผ่านมา

      No need to undervolt, there is a simple nvidia-smi command to set the powerlimit.

    • @ArtificialLife-GameOfficialAcc
      @ArtificialLife-GameOfficialAcc 27 วันที่ผ่านมา

      @@maxmustermann194 that doesn´t work as good, the frequency jumps like crazy (well if you use tensors, it works because tensors only need a frequency of 1500 in the gpu, more than that almost make no difference)

  • @makerspersona5456
    @makerspersona5456 4 หลายเดือนก่อน +4

    Test out the 3060 12 gb cards comprehensively pleaser! Also would be nice to hear your opinions on what the best card combos might be be for cost to performance .

  • @Viewable11
    @Viewable11 4 หลายเดือนก่อน +2

    Pro tipp: You can reduce the power draw of the RTX 3090 by 90 Watts via undervolting without speed reduction during LLM inference.

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน

      Yep for sure and good info for people on power limiting, I wasn't going to do for this test of course

  • @Viewable11
    @Viewable11 4 หลายเดือนก่อน

    The read speed of an LLM (prompt eval tokens/s) only depends on the compute speed of the hardware (which depends on number and frequency of tensor cores, number and frequency of CUDA cores, chip generation). The write speed of an LLM (eval tokens/s) only depends on the memory bandwith (in GB/s) of the hardware and the chip generation.

  • @animation-nation-1
    @animation-nation-1 หลายเดือนก่อน +1

    nice. but then there is price too if its just a test lab. considering in australia a 3090 is USD $1500-$2000 in australia. and 4090 $2500 USD. so tempted to get an old tesla. but the 3090 just works in simple motherboard.

  • @vulcan4d
    @vulcan4d 3 หลายเดือนก่อน +6

    What about AMD using rocm?

  • @hablalabiblia
    @hablalabiblia หลายเดือนก่อน +2

    Superb! Could you make a tutorial on how to setup and implement everything needed (SOFTWARE WISE) to achieve what you did here?

    • @RoboTFAI
      @RoboTFAI  หลายเดือนก่อน +2

      Yes, soon

    • @nlay42
      @nlay42 หลายเดือนก่อน

      Yes, I would love to learn how to do what you do in the setup. Looking forward to the video @RoboTFAI

  • @Viewable11
    @Viewable11 4 หลายเดือนก่อน

    GGUF model file format is meant for usage in CPU inference. For GPU inference, use the GPTQ file format (.safetensor). The GGUF format takes more space, has less quality, but can be used on CPU.

    • @fontenbleau
      @fontenbleau 4 หลายเดือนก่อน

      GGUF is my territory and you need for such a decent CPU and any server motherboard with plenty 12 RAM slots, the only cheap way to get real terabyte RAM and run at best q8 quality, speed doesn't affect quality in this area if you can afford space for quality, even the slowest one give same result as cloud but later in time.

  • @iheuzio
    @iheuzio 4 หลายเดือนก่อน +5

    Can you please test the A770 16gb card? Thanks

    • @nßultz1440
      @nßultz1440 4 หลายเดือนก่อน +3

      right, even just intel and amd in general

  • @rhadiem
    @rhadiem 2 หลายเดือนก่อน

    Me with a 4090 and a 1500w psu chuckling about your concern for burning the house down at 300w. :D I triggered a 15a breaker earlier this week, found out the outlet I have my microwave plugged into is on the same circuit as my home office and I must have been pushing the GPU at the time. So glad I don't have insane power costs like Europe. Btw, if you need a "the youtubers asked for it, it's a business write-off" excuse for a 4090... you really need a 4090 for testing data comparisons for the people. :D Thanks for all the tests. Would love to see a 4070ti in there too, to fight with the 4060.

    • @RoboTFAI
      @RoboTFAI  2 หลายเดือนก่อน +1

      Haha - but I have been known to burn up a power supply or three, a big UPS, couple breakers, etc - luckily lab has a few dedicated 20amp circuits these days. There is absolutely a fire extinguisher hanging in workshop/lab!
      Hey I would love to buy a 4090, and a million other cards! To be honest, I never really planned on a channel, I put up a video from a discussion with friends (basically to prove them wrong with data) and somehow you folks seem to like what this crazy guy does in his lab? if channel continues to grow and happens to make money one day happy to throw it all back in the channel. For now my budget is not much 💸

    • @rhadiem
      @rhadiem 2 หลายเดือนก่อน

      @@RoboTFAI Haha well you've earned this sub, curious what you end up testing next. "Not much" as you have a handful of $1k gpu's. Carry on good sir. o7

  • @mwwhited
    @mwwhited 4 หลายเดือนก่อน

    For shootouts you should set your seed value for the run so they are deterministic between cards.

  • @Vaasref
    @Vaasref 4 หลายเดือนก่อน +3

    Something I am really wondering about is Radeon VII vs RX 6950XT (to keep it inside the AMD family).
    Having stuff work with ROCm is bothersome and most of what is available for NVIDIA just refuses to but as long as only inference is involved it works well (tried some tuning with no success so far).
    Would the HBM2 massive bandwidth able to score any win against a more recent more capable compute. Or if no win were to be seen for the HBM2, how would it affect the scaling ?

  • @tsclly2377
    @tsclly2377 4 หลายเดือนก่อน

    I'm impressed.. perhaps due to having used HP DL580 G7s to mine ETH years ago that are just sitting around and will take these M40s in pairs nicely, PCIe Optane and a 25Gb/s RJ45 card so they can all 'talk together'.. like 4 of them. With 1200W (208/240V or 1050w@125v) , of course bandwith limited on the board. PCIe 2 can run PCIe 3 cards pretty well, but I'm not so sure about PCIe 4 cards. I would have liked if you could have had the RTX Titan run with this group although. My other though is that quant size may vary the accuracy of the output in more subjective matter, questions that can be interpreted in differing ways at the lower quant levels varying the output results especially in training. I think that if time is not that big a consideration, the cheapness of the M40 makes it an appealing card-set at the 48GB+ level running SLI connections, but still have not evaluated the bus connections (lane 0 connected to lane 32). Presently setting up one machine with dual P40s.

  • @minagornas4285
    @minagornas4285 3 หลายเดือนก่อน

    Hey, you make really interesting and comprehensive videos! Many thanks for that. What I always ask myself and I think maybe many others too(?):
    What exactly do you use to connect the GPUs? So your system looks like a mining rig. Is there any performance loss between this extension or the direct connection via PCIe 16x lane?
    Have you already been able to test things like NVLink with your systems? Does it make sense to use different GPU models, or does this create some kind of bottlenecks?
    What do you think is important when it comes to choosing hardware to build such a system?
    Sorry for all the questions. I just find the whole topic really exciting.

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน

      Hey much appreciated!
      I use PCIe extenders - make sure they are 8/16x capable at your PCIe level (3,4,5) - I have had good luck with these ones www.amazon.com/gp/product/B09NB9D9PH
      I haven't noticed any performance difference of direct vs using these extenders.....but sounds like a good idea for a test...
      NVLink = I haven't seen a reason to do it for my needs. It adds extra cost and almost all LLM software/cuda/etc supports splitting without it. Could it be a performance increase when using two cards....I don't know, again sounds like something we could test but I have no budget left at the moment.
      Your last question is fairly subjective without knowing your requirements, as I don't think most people are doing what I do with my systems - do you want to be able to run big models? small models? multiple parallel models? looking for tokens per second or power usage?
      Performance vs Power vs Cost vs Needs (let's be honest it's Wants) - I find this tends to be different for everyone since most people will put one of those at the top of their priorities.

  • @AnOldMansView
    @AnOldMansView 3 วันที่ผ่านมา

    Hey question for you, what drivers/process is required to have a 3090 and a k80 runningn side by side, depending on the driver I install its either one or the other. I believe I need multi gpu support enabled? Not sure... maybe you might have the clue I need. cheers.

    • @RoboTFAI
      @RoboTFAI  2 วันที่ผ่านมา +1

      K80? That's a Kepler card, which I think Nvidia removed support for a few years ago in the drivers/cuda. Not sure you will get them to fully function together just from that.

  • @fontenbleau
    @fontenbleau 4 หลายเดือนก่อน

    Paradox of this area - speed not affecting quality, if you can afford max quality(many RAM) but at very slow hardware-you'll get same result as cloud but later. Slow Ai even getting more popular at corporate sector.

  • @Flixerine
    @Flixerine 4 หลายเดือนก่อน

    Good video, very detailed. I like that you looked at all aspects, power, price, efficiency etc.
    I don't suppose you have an AMD card lying around to compare as well? :D

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน

      Thanks! I do not have any AMD cards around but would be willing to test them if I got my hands on a few to borrow

  • @AndroidFerret
    @AndroidFerret หลายเดือนก่อน

    My new phone runs rocket 3b llm (~3gb) on my phone ams gives answers in under 2 seconds.
    I have 3.3ghz based on 4nm and with ai hardware support + 16gb ddr5 ram.
    I can use an offline picture generation ai which finishes a 512x512 picture with 20 steps in around 2 minutes.
    Thats absoluteley INSANE IMO

    • @RoboTFAI
      @RoboTFAI  หลายเดือนก่อน

      It's crazy, mobile is where I always predicted small models would reign. The technology and the software are advancing at a pace I haven't seen in my career.

  • @marekkroplewski6760
    @marekkroplewski6760 3 หลายเดือนก่อน

    Great job! Llama3.1 is really much better, so I would encourage you to go on a quest! How to run different flavours of 3.1 most efficiently on commodity hardware. The it projects around llm's will explode imo, because the model family is good and a lot of companies can not share their data to public clouds.

  • @tbranch227
    @tbranch227 4 หลายเดือนก่อน +2

    I think you can take your evaluation a little further and tell us cost/token and total power/token. I'm interested in seeing some more high-end builds too. What hardware do we need to achieve 100t/s for instance and beyond? Thanks for the video! This was great!

    • @Viewable11
      @Viewable11 4 หลายเดือนก่อน

      Some people achieved 100t/s with a RTX 4090. More important than chosing hardware is chosing the right software.

    • @noth606
      @noth606 4 หลายเดือนก่อน

      @@Viewable11 I'd say that your argument has flaws. Changing software is a lot easier than getting your money back fully for a GPU and buying another.

  • @Rewe4life
    @Rewe4life หลายเดือนก่อน

    I have two Tesla P40s here but I unsuccessfull in my trys on making use of both for my AI workloads. especially my stable diffusion trainings are taking very long. do you know how i could make them appear as one large gpu?

  • @jcdenton7914
    @jcdenton7914 หลายเดือนก่อน

    How many shrouds and fan sizes have you tried on Tesla GPU's? I want to get a quieter run which a larger fan could theoretically do but the shroud funneling might be a source of noise so I don't know what to get for best silence.

    • @RoboTFAI
      @RoboTFAI  หลายเดือนก่อน

      Hmm a few, I originally had them in a server that got retired so just some 3D printed shrouds. For bench testing I use high speed fans (very loud)....They require a good amount of air through them to keep them cool.

  • @davidtindell950
    @davidtindell950 4 หลายเดือนก่อน +1

    great work !😊

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน

      Many many thanks

  • @Bjarkus3
    @Bjarkus3 2 หลายเดือนก่อน +1

    I am curious to know if anyone has done tests of e.g. 3060 vs 3090 vs 4090 on big models that do not fit in vram but doing gpu offload??? E.g. I get 2 tokens/ second with a 3060 and 7950x for 40gb memoey models... Anyone knows how 3090 performs here? Ddr5 ram 6000mt btw

    • @RoboTFAI
      @RoboTFAI  2 หลายเดือนก่อน +1

      Mixing of GPUs, and CPUs is something we do here, we can dive in farther

  • @suprdiddy
    @suprdiddy 3 หลายเดือนก่อน

    @RoboTF AI Thanks for the video(s), they have very helpful. I would love to see one that goes over the software,drivers as well as the cpu,mem,mobo you use to set this up. One that would answer the question "If I wanted to combine 2 RTX 3090s so that LM Studio would be able to utilize 48Gb of VRAM, what software would I need"? The problem is there's a ton of content for exactly the opposite use case, so much so that the GPTy-bots that I've asked assume that I want to share one GPU with many VMs. Does that video exist? If not.....I'll subscribe and wait.

  • @k1tajfar714
    @k1tajfar714 3 หลายเดือนก่อน

    thank you for the great video!
    i have actually zero bucks. i hope you can give me recommendations on this one.
    i currently have spent 200 bucks on a X99-WS motherboard so i'll have 4PCIe at full x16 if i dont hook any m.2 NVMEs i assume.
    so thats awesome, it also has a 10C/20Th Xeon low profile, 32GB ram, and a okayish CPU cooler.
    i have already saved 200$ more and i dont know what to do. i was going to buy one or two P40s and later upgrade to 4 of them. but now i cannot even afford one,
    they're there for almost 300 bucks im afraid. one option is to go with M40s but im afraid they're trash for LLMs and specifically for Stable Diffusion stuff. they're pretty old, although your video shows they're quite good.
    i'm lost i'd love to get help from you. if you thought you'd have time we can discuss it. i can mail you or anything you'd think is aapropriate.
    special thanks.
    K1

    • @RoboTFAI
      @RoboTFAI  2 หลายเดือนก่อน +1

      Feel free to reach out, tis a community! I have several M40's from when I first started down this road that I would be willing to part with.... it's a slippery slope

    • @k1tajfar714
      @k1tajfar714 2 หลายเดือนก่อน

      @@RoboTFAI you're fantastic! Thanks. I'd love to reach out. Would appreciate to have your email or something so i can discuss! Maybe we can make a deal on your M40s of you have any spare of them that u don't use? Thanks.

    • @RoboTFAI
      @RoboTFAI  2 หลายเดือนก่อน +1

      @@k1tajfar714 robot@robotf.ai or can find me on reddit/discord/etc - though not as active as I would like to be.

  • @krisiluttinen
    @krisiluttinen หลายเดือนก่อน

    Can someone explain in a nutshell what this is? Is it an Ai language model like chatgpt that runs entirely offline on my own computer?

    • @RoboTFAI
      @RoboTFAI  หลายเดือนก่อน +1

      That's exactly what it is, if talking about LocalAI (localai.io). Open source API that mimics OpenAI (ChatGPT) to run open source models.

  • @tedguy2743
    @tedguy2743 4 หลายเดือนก่อน

    Just wanted to say I really appreciate your content and would appreciate even more if you can find a way to enlarge texts so it’ll be easier to read. Thank you so much

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน

      Sorry - recorded and best viewed at 4K - I'll try to do better at making things larger for people on smaller screens. Thanks for the feedback!

  • @iniyan19
    @iniyan19 2 หลายเดือนก่อน

    Could u please test out 4070ti super ?

    • @RoboTFAI
      @RoboTFAI  2 หลายเดือนก่อน +1

      There may or may not be one in the lab the channel hasn't seen yet 😜

  • @StartUpRight-dp3qz
    @StartUpRight-dp3qz 3 หลายเดือนก่อน

    These letters are very small. It's like a blank screen.

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน

      Sorry - recorded and best viewed at 4K

  • @emil8367
    @emil8367 4 หลายเดือนก่อน

    how does it feel ? 🙂 it feels that we don't see much from your screen 😀 we can trust that you say the truth 😀 joking a bit, thanks for the review but please do zoom in a bit next time to see more

    • @RoboTFAI
      @RoboTFAI  3 หลายเดือนก่อน +1

      Sorry - recorded and best viewed at 4K - I'll try to do better at making things larger for people on smaller screens. Thanks for the feedback!

    • @emil8367
      @emil8367 3 หลายเดือนก่อน

      @@RoboTFAI 👍 many thanks in advance

  • @kborak
    @kborak 4 หลายเดือนก่อน

    llama is pure garbage. Worse than GPT. It refuses to answer some of the most basic questions.

  • @fontenbleau
    @fontenbleau 4 หลายเดือนก่อน

    i don't understand why you use Mac to view server, that's the most questionable part of whole system. I've used Macbooks myself but latest MacOS is dead OS compared to earlier ones, devs abandoned it, that's why they shove iPhone apps there. One Mac mini laying on my table, i would never use it for such, it's overheating like oven even for web browser use.