It’s over…my new LLM Rig

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 ธ.ค. 2024

ความคิดเห็น • 342

  • @irrelevantdata
    @irrelevantdata 2 หลายเดือนก่อน +22

    If you are running a GGUF model, ollama will split the process, putting as many layers as it can on GPU and the rest on CPU. IT will run slower, but faster than CPU only.

    • @paulhorn24
      @paulhorn24 2 หลายเดือนก่อน +1

      👍

    • @tasdude3227
      @tasdude3227 หลายเดือนก่อน +2

      @@paulhorn24 yeah I think a 4090 paradoxically is not as good choice as let's say gpu with no graphical capabilities but way more vram right?

  • @blackhorseteck8381
    @blackhorseteck8381 2 หลายเดือนก่อน +83

    Mini PCs have revolutionized the boring PC market. The power they are able to squeeze inside these small boxes gives me hope for the future of computing.

    • @kroeken
      @kroeken 2 หลายเดือนก่อน +8

      While I agree that they are cool, they kind of become pointless once you add one of these giant eGPUs. You could build a mini itx pc instead and avoid the occulink restrictions while also gaining the increased upgradability of a full desktop.

    • @rochester3
      @rochester3 2 หลายเดือนก่อน +1

      @@kroeken but you cant over clock your cpu

    • @JustSomeGuy009
      @JustSomeGuy009 2 หลายเดือนก่อน +2

      @@rochester3 huh? Yes you can. A proper ITX or mATX build is better in every way to these mini PCs with external GPUs.

    • @TheRaretunes
      @TheRaretunes 2 หลายเดือนก่อน +1

      Mini PCs are (for the most part) just laptop bowels transplanted in a little box.
      They have use cases, for sure (hospitals, HTPCs, schools, small clients, emulation boxes, offices) but as soon as you ramp up the performances the prices and drawbacks (mainly thermals) go up quickly too, at the point where it's better to stick to a custom build solution unless you REALLY need the small form factor.
      I admit that the mainstream pc market of today is VERY boring though, it peaked just around pre-covid and stayed there (I'm still using my ryzen 3600 + 6700 xt and I don't feel the need to upgrade to PCIx 5, while I bleeded money into my home server instead).

    • @ZeerakImran
      @ZeerakImran 2 หลายเดือนก่อน +1

      @@TheRaretunes true. But i hope the tech gets better and better for them. I find pc cases to be really ugly. Didn’t mind them back in the day but those aren’t best for airflow. Talking about airflow, just how much air do you need. I like small itx cases a lot more and surprisingly, they show that you don’t really need that much space for airflow and good temps. “Building computers” is also an overhyped genre. It’s really not all that it’s cracked up to be.

  • @harryhall4001
    @harryhall4001 2 หลายเดือนก่อน +16

    Serious question: why not just use the power directly if you're UPS isn't big enough? This isn't a mission critical server with important information, it doesn't need 24/7 operation during a power outage.

  • @blackhorseteck8381
    @blackhorseteck8381 2 หลายเดือนก่อน +86

    Some points here Alex:
    1. The power cable that's split into three power plugs is a "dongle" that converts the 12V HPWR plug that's too recent for most power supplies out there, so they supply a splitter that is powered by 3 or 4 of the 8 pin PCI-E connectors.
    2. The drivers for your GPU are provided by Nvidia themselves (just Google game ready drivers for RTX 4090), as the AIB drivers (Gigabyte's) are outdated.
    3. All modern GPUs (from 2010 onwards) are all set to have zero fan revs at sub 60° C.

    • @R1L1.
      @R1L1. 2 หลายเดือนก่อน +9

      point 3 is not right.

    • @blackhorseteck8381
      @blackhorseteck8381 2 หลายเดือนก่อน +1

      @@R1L1. You're talking about exceptions to the rule, I only recall the Vega VII that had it's fans running all the time because it ran hot as the sun. Apart from that point 3 is the rule.

    • @R1L1.
      @R1L1. 2 หลายเดือนก่อน +2

      @@blackhorseteck8381 that doesn't change the fact that you are wrong, you could have just said most modern GPUs not all GPUs . Even then my rx6500xt always keeps the fans on, same with my 3060, didn't change anything after buying so yeah... Idk.

    • @cor74
      @cor74 2 หลายเดือนก่อน +2

      @@R1L1. my 970 also has always on fans

    • @pid1790
      @pid1790 2 หลายเดือนก่อน

      @@R1L1. are you using 2 monitors?

  • @Krath1988
    @Krath1988 2 หลายเดือนก่อน +6

    Haven't seen anyone do a video using multiple video cards in parallel to run a large model. So that is my humble request. Love the content.

    • @paulhorn24
      @paulhorn24 2 หลายเดือนก่อน

      👍

  • @eternalnightmare2749
    @eternalnightmare2749 2 หลายเดือนก่อน +51

    Chinese modders transplanted a chip from RTX 4090D to a custom board or a 3090 board and soldered 48 GB of memory. Real beast for AI rig. However, I'm not sure about the warranty for such Frankenstein card.

    • @Larimuss
      @Larimuss 2 หลายเดือนก่อน +12

      It's honestly sad at this point neither nvidia, Intel or amd have come out with a $700 card that just has 48gb vram and a $900 model with 128GB. The memory is cheap as nvidia just wants to squeeze you to spend $2000 on 4090.
      With Ai now I think at least just 1x consumer model would be good from someone. I mean, you'd sell at least 2m units.

    • @cena777248
      @cena777248 2 หลายเดือนก่อน +1

      ​@@Larimussas far as I know nvidia has monopoly over AI due to their CUDA technology.
      Until amd or intel isn't capable of competing with nvidia in this space, there wouldn't be any card with such memory.

    • @Larimuss
      @Larimuss 2 หลายเดือนก่อน +1

      @cena777248 yeh they do for now. Intel has already created some inference with their tech I beleive. And since CPU models exist I'm sure if amd made a cheap 48gb vram card the internet will figure out how to convert models or whatever it is. I'm pretty sure you can run some models with amd too? Just much less support where nvidia just runs everything right now and most optimally I think.

    • @jeremykothe2847
      @jeremykothe2847 2 หลายเดือนก่อน

      @@Larimuss Agreed. Everyone wants more vram, not more speed. But monopolists do monopoly stuff.

    • @jeslinmx22
      @jeslinmx22 2 หลายเดือนก่อน

      @@Larimuss it’s not good business for the average consumer to have access to good, fast, general-purpose local AI. We’d pay, what, maybe $600 more for that at an enthusiast level, less for the average consumer. But as long as AI remains the domain of big cloud companies and startups backed by VCs with deep pockets, people will keep flocking to them for this seemingly exclusive godlike power, and they will keep paying enterprise money to Nvidia.

  • @serikazero128
    @serikazero128 2 หลายเดือนก่อน +14

    small advice regarding ollama:
    use verbose
    example: ollama run llama3.1:8b --verbose
    technically these commands, including what you ran, keeps the model loaded. You have to manually unload it, or you can tell the model to unload after you go /bye:
    ollama run llama3.1:8b --verbose --keepalive 10s
    the verbose will tell you the tokens per second generated.
    the keepalive 10s will drop the model from memory after 10s

    • @Lemure_Noah
      @Lemure_Noah 2 หลายเดือนก่อน +3

      @@serikazero128 or if you already "inside" Ollama type /set verbose

    • @paulhorn24
      @paulhorn24 2 หลายเดือนก่อน

      👍

    • @paulhorn24
      @paulhorn24 2 หลายเดือนก่อน

      @@Lemure_Noah👍

  • @corvinyt
    @corvinyt 2 หลายเดือนก่อน +11

    The UPS going off was hilarious! 😂

    • @blackhorseteck8381
      @blackhorseteck8381 2 หลายเดือนก่อน

      I thought it was mine for a sec 😅

  • @autoboto
    @autoboto 2 หลายเดือนก่อน +8

    I experienced the same VRAM problems. I have a i9-32 thread and 128GB system RAM and runs the large models in slow motion but works. Small models run fast enough to use on the i9, but if it fits the gpu 16GB its really fast and enough to use as a service for a few clients. I'm using 4090 mobile=4080 desktop. Large models, these days, seems the Mac unified ram is the way to go for running large models but slower, but at least it runs and the wait is not too long.

    • @testales
      @testales 2 หลายเดือนก่อน

      I'm not convinced about the Mac's unified RAM. What all these Mac youtubers don't show you is the impact of larger prompts. While you may be fine with 5tokens/s, it's gonna suck if you have to wait a minute before it even starts to generate something.

  • @itiswhatitis-yes
    @itiswhatitis-yes 2 หลายเดือนก่อน +3

    This was a crazy video!! One of your best!!!

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +2

      Glad you liked it!!

  • @jukiy67
    @jukiy67 2 หลายเดือนก่อน +1

    i love this man, never bored watching

  • @3monsterbeast
    @3monsterbeast 2 หลายเดือนก่อน +19

    I think you meant 40Gigabits/s for Thunderbolt 4 - instead of gigabytes.

  • @TazzSmk
    @TazzSmk 2 หลายเดือนก่อน +3

    ollama is smart enough to use two gpus simultaneously, so for that 40GB LLM you really have to use two gpus with 24GB vram each,
    once you get over gpu vram capacity, things go into ram and though cpu which is terribly slow - at such point Apple Silicon Macs have advantage of utilizing shared ram, so something like 64GB Mac Studio "outperforms" and PC with lack of gpu vram

  • @ldandco
    @ldandco 2 หลายเดือนก่อน +2

    I've got a 1080 ti from 2017, and a PC I built in 2016 overclocked to almost 5Ghz, 6 cores 12 threads, with 64 GB Ram
    I am currently running Llama 3.2 7B models lightning fast with my PC

  • @Tarbard
    @Tarbard 2 หลายเดือนก่อน +7

    running "ollama ps" will show you how much of the model is loaded on system ram vs GPU ram. You want 2x4090s for enough VRAM to run a 70b at a good speed.

    • @reezlaw
      @reezlaw 2 หลายเดือนก่อน +1

      2 used 3090s would do a more than decent job for a fraction of the price

    • @soumyajitganguly2593
      @soumyajitganguly2593 2 หลายเดือนก่อน

      @@reezlaw but it wont be a mini PC anymore

    • @reezlaw
      @reezlaw 2 หลายเดือนก่อน

      @@soumyajitganguly2593 indeed

    • @NGC1433
      @NGC1433 6 วันที่ผ่านมา

      @@soumyajitganguly2593 How much of a minipc is it with a 4090 and that psu hanging off of it? adding another card won't change shit.

  • @gamingengineering565
    @gamingengineering565 2 หลายเดือนก่อน

    that ups sound in the perfect moment 🤣.doing a nice job. keep it up

  • @Heythisismychannel
    @Heythisismychannel 2 หลายเดือนก่อน +3

    I have a 3090 with 24GB and yes you can run 13b. Nice setup

  • @fontenbleau
    @fontenbleau 2 หลายเดือนก่อน +5

    Also ISTA-DASlab (on huggingface) managed to squeeze original Llama 70B 140Gb model into 22Gb remaining 90+% quality ratio, so it can run on one 3090 card. 8B model they've made possible to run on smartphones.

  • @thewreckedship5526
    @thewreckedship5526 2 หลายเดือนก่อน +17

    this is so cool, thanks for showing us

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +1

      Thanks for watching!

  • @albertjeremy3956
    @albertjeremy3956 2 หลายเดือนก่อน

    Thanks for the demo. Now i understand how the llm work, especially the part where it consume power and how it consume the memory. With this info, i can manage the usage properly.

  • @monkeyfish227
    @monkeyfish227 2 หลายเดือนก่อน +3

    I can’t understand why I find unboxing all this stuff was so interesting.

    • @ely_twix9580
      @ely_twix9580 2 หลายเดือนก่อน +2

      Me too, I’m addicted to these videos 😅

  • @georgioszampoukis1966
    @georgioszampoukis1966 2 หลายเดือนก่อน +8

    Well there is always the RTX 6000 Ada that is essentially a 4090 with 48GB VRAM but it costs around 10.000 USD I believe 😅

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +5

      only $5000 😝

    • @samizdat_eth
      @samizdat_eth 2 หลายเดือนก่อน +2

      @@AZisk The Ada is $7200 at the cheapest. Where are you finding it for $5k? Or are you mixing it up with the previous gen RTX A6000?

    • @soumyajitganguly2593
      @soumyajitganguly2593 2 หลายเดือนก่อน +1

      @@samizdat_eth the previous gen A6000 is not that bad.. it's a 3090 with 48GB and you can get it for ~4k.

    • @testales
      @testales 2 หลายเดือนก่อน +1

      @@AZisk You are confusing it with the older RTX A6000 of the previous generation. I got the Ada one and it was nearly 8k€ on a good day.... so my car has to somehow surive for some years more now. :-|

  • @mptcz
    @mptcz 2 หลายเดือนก่อน +1

    Loving the LLM testing videos! Now I need a 4090 😁

  • @DanieleBordignon
    @DanieleBordignon 14 วันที่ผ่านมา

    "the reason it fits is because that's where it belongs"
    amazing

  • @yonathandevash7657
    @yonathandevash7657 2 หลายเดือนก่อน +4

    Love your videos

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +1

      Thanks

  • @jake-ep9wq
    @jake-ep9wq 2 หลายเดือนก่อน +9

    You can see additional stats on model performance like tokens per second by using the --verbose flag with Ollama run.
    So Ollama run llama3.1 --verbose.
    Love the videos!

    • @johnmarshall4_
      @johnmarshall4_ 2 หลายเดือนก่อน

      This is a great way to benchmark new local LLMs. Just use a common prompt and compare.

    • @zandanshah
      @zandanshah หลายเดือนก่อน

      Can you list all the ollama commands/flags Thanks

  • @baoxinlong2455
    @baoxinlong2455 2 หลายเดือนก่อน +2

    From a computer engineering background I’m 200% sure that your spikes comes from the bottlenecking of the GPU, basically it only gets the load for some milliseconds, then it is waiting for the data transfer in the pipeline, and keep looping in that, you are only using 63Gbps/ 1008GBps transfer rate

    • @MultiMojo
      @MultiMojo 2 หลายเดือนก่อน

      Oculink is limited to 63Gbps transfer rate

    • @ChrisMartinPrivateAI
      @ChrisMartinPrivateAI 2 หลายเดือนก่อน

      @@MultiMojo As they say in the boat business, isn't that 63Gbps "a hole below the waterline?" If Oculink still is the limited factor, why have such a beefy GPU if you can't integrate it for full effect?

    • @JamesBedford
      @JamesBedford 16 วันที่ผ่านมา

      Yea pretty sure that's the case, it's waiting for the next inference, which is bottlenecked by the CPU passing the latest context (with the last tokens appended) across to the GPU.

    • @JamesBedford
      @JamesBedford 16 วันที่ผ่านมา

      I'm just guessing this is the case, and presumably this could be made a lot faster if it was able to feed the output back into the context for the next inference within the GPU memory. It's doable with CUDA or whatever but I'm not sure where in the AI stack it would be implemented. But the GPU needs to be told to write output into the location on the GPU where the context window is stored for next inference

  • @AlmorTech
    @AlmorTech 2 หลายเดือนก่อน +2

    Definitely need to build something! Awesome video 😍

  • @chrisa4072
    @chrisa4072 2 หลายเดือนก่อน +11

    Yo your surge protector gonna explode. Haha.
    The GPU spike is normal. It's likely just the model going back and forth.
    Beeeeeeep!!!

  • @gaiustacitus4242
    @gaiustacitus4242 2 หลายเดือนก่อน +2

    The white video cards are usually purchased by people building a "snow blind" PC - white case, white video card, white power supply, white cables, etc. These white video cards can be difficult to source and during periods of short supply they command a premium price with no other benefit than matching the color of the build.
    Larger LLMs yield very poor performance when spilling over from the maximum RAM of the nVidia RTX 4090 into the 128 Gb RAM in my tower PC. I get much better performance running up to 70 billion parameter LLMs on my MacBook Pro M3 Max. This is why I will be purchasing a Mac Studio M4 Ultra with maximum RAM installed when it is available.

  • @Dr3x0w
    @Dr3x0w 2 หลายเดือนก่อน +1

    My best guess about the gpu spikes: the whole LLM does not fit into the vram. And this means the gpu has to load new parts into vram. 13B models run fine on GPU but beyond that it’s CPU+GPU

  • @Gixion01
    @Gixion01 2 หลายเดือนก่อน +1

    Hi, nice test. Did It works also with a card like A6000 or h100?

  • @milleniumdawn
    @milleniumdawn 2 หลายเดือนก่อน +1

    Thx for the content.
    But as you pointed out, as long as consumers CUDA video card will be limited to 24 go, speed don't matter if you are that limited in model size.

  • @dustinwenzel1446
    @dustinwenzel1446 2 หลายเดือนก่อน

    seasonic is definitely an amazing power supply. ive used them for all of my builds

  • @ujjwalbhatt4766
    @ujjwalbhatt4766 2 หลายเดือนก่อน +2

    What keyboard are you using😅?

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +2

      this one: www.keychron.com/products/keychron-q1-max-qmk-via-wireless-custom-mechanical-keyboard?ref=azisk

    • @ujjwalbhatt4766
      @ujjwalbhatt4766 2 หลายเดือนก่อน +2

      Thanks for replying btw i just saw gear links in description ​@@AZisk

  • @ToddWBucy-lf8yz
    @ToddWBucy-lf8yz 2 หลายเดือนก่อน +1

    If you want to run larger model's on that card look for a Mixture of Experts model, otherwise stick with models smaller than 10-12gb in size, especially if you choose one that has a large context length. I recently sold my 4090 to help finance a second A6000, I can get Llama 70 q8 models to run with a 98k context length but only just barely. That eats up about 56gb of vRAM the rest goes to the context window weights and all sorts of other stuff.

  • @cooky842
    @cooky842 2 หลายเดือนก่อน +1

    Of course it just work.. oculink is just pcie over a cable. That's why it's not hot swappable. So it work just as easily than plugging a gpu in a regular pc, with the bandwidth limitation.

  • @martin777xyz
    @martin777xyz 2 หลายเดือนก่อน +2

    Is it possible to chain the external dock for multiple gpu?

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +2

      unfortunately no

  • @Kvantum
    @Kvantum 2 หลายเดือนก่อน +1

    That PSU will work great to go in an entire desktop with the one 4090. Not really enough for 2x 4090s, though. That really needs a 1600W psu.

    • @Lemure_Noah
      @Lemure_Noah 2 หลายเดือนก่อน

      Actually LLM inference doesn't need too much power. Just limit GPU max power with nvidia-smi or afterburner to ~250w per GPU and be happy.
      It's has very very low impact on tokens / sec.

  • @momendo
    @momendo 2 หลายเดือนก่อน +11

    LM Studio can split the model between CPU and GPU. You can tweak the split between them in the UI. Try that.

    • @linklovezelda
      @linklovezelda 2 หลายเดือนก่อน +1

      Ollama does this automatically as well. If you look at the task manager while the 70b model is running, the vram is filled up.

  • @elwii04
    @elwii04 2 หลายเดือนก่อน +3

    11:15 It automatically offloads some layers of the llm to the system RAM so it uses both gpu and cpu

    • @Nik.leonard
      @Nik.leonard 2 หลายเดือนก่อน +1

      Check ollama logs, there you will see a "xx/yy layers ofloaded to GPU".

  • @agiverreviga4592
    @agiverreviga4592 2 หลายเดือนก่อน +1

    Which monitor are you using?

  • @klaymoon1
    @klaymoon1 2 หลายเดือนก่อน +3

    Great video! But, didn't you do something similar before and your macs destroyed NVIDIA?

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +3

      I did it with the rtx4090 in a laptop

  • @Victornemeth
    @Victornemeth 2 หลายเดือนก่อน +1

    Ollama on linux is even more performant. If the model is too large to fit on the vram it will normally "overflow" to the system memory but still use the gpu, normally you would get arround 30% gpu utilization depending on the "ratio" of vram usage vs system memory. I like this sort of videos please go deeper into this. This is out of experience and I did not search in this in the documentation as it just worked like I wanted it too.

  • @theontologist
    @theontologist 2 หลายเดือนก่อน +4

    WOW. How much did this cost? I’m sensing an Amex Black Card in the vicinity.

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +6

      🤣 the dock was only $99

  • @abhiranjan0001
    @abhiranjan0001 2 หลายเดือนก่อน +2

    Hey which keyboard is that ?

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +1

      this one: www.keychron.com/products/keychron-q1-max-qmk-via-wireless-custom-mechanical-keyboard?ref=azisk

  • @timsubscriptions3806
    @timsubscriptions3806 2 หลายเดือนก่อน +3

    If you went with a 750W PS you have:
    - plenty of power
    - save $100
    - NO alarm bells going off. 😂

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +3

      i got greedy with power

  • @abhilashkp6864
    @abhilashkp6864 2 หลายเดือนก่อน +2

    11:00 Ollama has this feature (?) that makes it so that when you exceed a certain percentage the GPU vRAM, it meets the rest of its requirements from the system memory and CPU. That is why it is significantly slower. But it makes sure that you get an output unlike others that net you an OutOfMemoryError.

    • @anuragshas
      @anuragshas 2 หลายเดือนก่อน

      Isn’t it a llama.cpp port to go?

    • @abhilashkp6864
      @abhilashkp6864 2 หลายเดือนก่อน

      @@anuragshas Ollama is built based on llama.cpp, but everything about Ollama is automatically handled (in this case the CPU/GPU allocation).

  • @TechCarnivore1
    @TechCarnivore1 หลายเดือนก่อน

    That powersupply is completely overkill.

  • @siddu494
    @siddu494 2 หลายเดือนก่อน +1

    Wow Alex, that's surprising! Even the RTX 4090 seems to struggle, but as someone suggested in the comments, try updating your Nvidia drivers to see if it helps. I ran the same model on an AMD graphics card, and most of the time, it defaults to the CPU, with 0% GPU utilization. Since AMD lacks CUDA cores, I assumed that's why it ignores the GPU. Seeing this happen with your Nvidia card makes me wonder if we can force GPU usage.
    However, running models locally on our PCs doesn’t seem practical right now, as online models often deliver better results. Local setups might still be useful for generating small code snippets, though.

  • @dtesta
    @dtesta 2 หลายเดือนก่อน +2

    Yepp, that is why I saved my money and only got a 3060 12GB GPU. I still get about half of the speed of the top cards and most models are 8B. Next step is usually too much for 24GB anyways. Better to save money for now and buy something in a few years when VRAM probably is a lot cheaper or more specialised cards gets released.

    • @fontenbleau
      @fontenbleau 2 หลายเดือนก่อน

      you can run very big 70B models on your card with 90% quality, like from ISTA-DASlab austrian EU research team on huggingface

    • @reezlaw
      @reezlaw 2 หลายเดือนก่อน

      "Next step is usually too much for 24GB" Sorry but no, you CAN put those 24GB to good use, you just have to use quantized models. You can squeeze every bit of performance from a 3090 or 4090 by finding models that are around 20 GB, leaving some headroom for the rest of the OS/UI and your context window.

    • @fontenbleau
      @fontenbleau 2 หลายเดือนก่อน

      @@reezlaw yes! like ISTA-DASlab on huggingface incredible work with models

    • @dtesta
      @dtesta 2 หลายเดือนก่อน +1

      @@reezlaw With that logic, I can use a quantized model to fit in 12GB too. 70B models quantized down to under 24GB will simply not perform well, so it's useless in my opinion. The prices for such cards are simply insane, unless you are lucky with a used one.

  • @Bogomil76
    @Bogomil76 2 หลายเดือนก่อน +14

    Thunderbolt 5, announced by Intel, brings significant improvements over Thunderbolt 4. It supports data transfer speeds of up to 80 Gbps, which can double to 120 Gbps in certain configurations (using a feature called “Bandwidth Boost” for high-bandwidth applications like displays). Key improvements in Thunderbolt 5 include:
    • Increased Bandwidth: Standard 80 Gbps, with the ability to dynamically boost up to 120 Gbps.
    • Display Support: Enhanced support for multiple high-resolution displays, including up to 3 x 4K displays or 2 x 8K displays.
    • Expanded PCIe Bandwidth: 3x the PCIe data bandwidth of Thunderbolt 4, enabling faster external storage and support for high-performance peripherals like GPUs.
    • Backward Compatibility: It remains backward compatible with Thunderbolt 4, Thunderbolt 3, USB4, and USB3.
    Thunderbolt 5 is expected to appear in devices starting in 2024.

    • @johnpp21
      @johnpp21 2 หลายเดือนก่อน +1

      this is clearly AI and doesn't know wtf is TB5

    • @f.iph7291
      @f.iph7291 2 หลายเดือนก่อน

      Lmao. Even then the bandwidth is much lower than the 4090 which is arounf 1tb/s

    • @Bogomil76
      @Bogomil76 2 หลายเดือนก่อน +1

      @@johnpp21 „Clearly“, and thats why?

    • @Bogomil76
      @Bogomil76 2 หลายเดือนก่อน

      @@f.iph7291 Yeah, but faster than Oculink

    • @R1L1.
      @R1L1. 2 หลายเดือนก่อน

      @@f.iph7291 before you open your mouth please fist research what you are talking about, the only thing in the 4090 that runs close to that band width is the memory at 1008 GB/s. It has nothing to do with thunderbolt 5.

  • @GCkernkraft235
    @GCkernkraft235 2 หลายเดือนก่อน

    Power suppllies often hit peak efficiency around 40-60% loading so thats a justification for 1200w

  • @anoopramakrishna
    @anoopramakrishna 2 หลายเดือนก่อน +1

    Really cool, wondering if there are dual or Quad GPU solutions

  • @giridharpavan1592
    @giridharpavan1592 หลายเดือนก่อน +1

    really killing that poor ups

  • @ErikFrits
    @ErikFrits 2 หลายเดือนก่อน +1

    Can you explain the downsides of quantizing models?
    What are the disadvantages of running a 70B model on a 4090 with increased quantization?

    • @slavko321
      @slavko321 2 หลายเดือนก่อน +1

      Quantizing lowers the actual resolution of the model, it is less precise, but can still create good results unless quantized to oblivion.

  • @purian23
    @purian23 2 หลายเดือนก่อน +8

    Come on Alex, the fact that you bought a 4090 and don't have a home built PC is crazy lol. Bring that in as your next project Mr. Laptop / Mini PC guy 😂

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +7

      idk man, i just don’t want a big box standing around heating up the place.

    • @purian23
      @purian23 2 หลายเดือนก่อน +10

      @@AZisk don't worry the new 4090 will do that itself haha. Great video!

    • @tablettablete186
      @tablettablete186 2 หลายเดือนก่อน +2

      ​@@purian23My 4080 is a heater already lol
      It is nice when the room is cold

  • @boris---
    @boris--- 2 หลายเดือนก่อน +3

    NOWAY he installing nvidia drivers from GIGABYTE

  • @andrewowens5653
    @andrewowens5653 2 หลายเดือนก่อน +3

    You should have bought a AMD Radeon™ PRO W7900 Dual Slot, which has 48 GB of GDDR6.

  • @deal2live
    @deal2live 2 หลายเดือนก่อน +1

    ORAC. Coming into reality!!!😂

  • @s.patrickmarino7289
    @s.patrickmarino7289 2 หลายเดือนก่อน +1

    For a moment near the beginning of the video I was thinking you found a way to run a GPU on a Mac.
    I do have one quick question, if you are running a Linux box, can you use more than one GPU to increase the amount of RAM?

  • @johnathaan1
    @johnathaan1 2 หลายเดือนก่อน

    Awesome video can’t watch it with volume up with your ups freaking due to it making my dogs bark they think it’s the fire alarms

  • @BinaryClay
    @BinaryClay หลายเดือนก่อน

    I have a laptop with TB4 and a watercooled egpu with a 3090 in it. It works perfectly as i always offload the llm to my vram anyway. With TB5 starting to show Oculink will need an oculink v2 to compete :)

  • @SuperFredAZ
    @SuperFredAZ 2 หลายเดือนก่อน +1

    I'm not sure why you chose a mini-computer and occulink, when you could have used an itx mb with full 16 lane PCIe5 capabilities? It's not smaller, or more convenient or efficient just doesn't make a lot of sense.

  • @habios
    @habios 2 หลายเดือนก่อน

    Yes, it will run on whatever has enough memory to run into. 4090's are extremely good for Stable Diffusion, but not for big LLMs.

  • @ye849
    @ye849 2 หลายเดือนก่อน +1

    The issue that you see with over 24gb is not because it’s running on the CPU. It’s because ollama uses your RAM as a buffer for the vram. Only your ram is at best 55GBps while Vram in that 4090 is probably 650GBps. Add to that your using the oculink symmetrically- i’m assuming you have around 20GBps each way? (Dunno the oculink standard)
    Also note you are changing the latency by magnitudes. From ns to just under miliseconds (due to distance, several standards translation etc) and in a case of a huge matrix made of very small chunks of dara, that the algorithm needs constant access to it’s entirety..
    So your left with a processor moving around lots of data but can’t keep feeding the parallel units of the g card.

  • @gregsLyrics
    @gregsLyrics 2 หลายเดือนก่อน +2

    Amazing vid and helpful to understand how to set up our private computing. The 4090 is limited by VRAM, as you demonstrated. Why not attempt this with an A6000 (48MB VRAM) model? The A6000 is optimized for AI performance, whereas the 4090 is optimized for gaming and 3D rendering. Thoughts?

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +1

      sure there are better gpus for AI, but i would say they are a bit out of the budget of most consumers. even the 4090 is at the top end of what the consumer market can bear.

  • @ap99149
    @ap99149 2 หลายเดือนก่อน +1

    Alex - are you sure Ollama is running on the CPU?
    When I train Pytorch models, and accidentally exceed the GPU dedicated memory, the system starts to use the "virtual video memory" (or whatever Windows calls it).
    I get crazy high CPU usage, as I assume that the CPU is managing the data transfer down the busses from the GPU to system RAM.
    It is significantly slower than keeping it in GPU memory alone, but it faster than running on the CPU.. but the computation is still on the GPU
    I'd love to see a test when you disconnect the GPU as a control test!

  • @12ricky04
    @12ricky04 2 หลายเดือนก่อน +1

    why doesn't it run on the NPU?

  • @DanFrederiksen
    @DanFrederiksen หลายเดือนก่อน +1

    Also why did you ever use laptops for serious computing??

  • @borisitkis3724
    @borisitkis3724 2 หลายเดือนก่อน +1

    any plans to test Beelink GTi14?

    • @AZisk
      @AZisk  2 หลายเดือนก่อน

      no but i got a beelink ser9 that looks pretty amazing

    • @borisitkis3724
      @borisitkis3724 2 หลายเดือนก่อน

      @@AZisk AFAIK, ser9 doesn't have pcie slot and gti14 does (4.0, x8.) would be nice to see it's performance on llm-specific tasks vs oculink

    • @AZisk
      @AZisk  2 หลายเดือนก่อน

      @@borisitkis3724yes, that’s right. the direct pcie connection is nice

  • @carlosdominguez3108
    @carlosdominguez3108 2 หลายเดือนก่อน +2

    "I was expecting a lot more noise from the 4090 to be honest. It's not that loud." Maybe that's because there's literally 0% load on the GPU? Who knows.

    • @AZisk
      @AZisk  2 หลายเดือนก่อน

      literally?

    • @carlosdominguez3108
      @carlosdominguez3108 2 หลายเดือนก่อน +1

      @@AZisk And figuratively.

  • @henrylawson430
    @henrylawson430 2 หลายเดือนก่อน

    Llama 3.2 3b is fantastic for text summarisation.

  • @potsandjacks
    @potsandjacks 2 หลายเดือนก่อน +1

    My Kizer is ALSO my faverit knife!!

  • @psychurch
    @psychurch หลายเดือนก่อน +1

    Just saw the news with thunderbolt 5 able to transfer up to 120 Gbps 😮

  • @erproerpro903
    @erproerpro903 2 หลายเดือนก่อน

    Alex по красоте! Теперь прикуплю ✌️

  • @puyansude
    @puyansude 2 หลายเดือนก่อน +1

    Wow, cool video, Thank You 👍

  • @clivefoley
    @clivefoley 2 หลายเดือนก่อน +1

    One of the things I love about hardware is there is zero need for instructions "Does this fit in there? It does. Well that's where it's supposed to go!". Brilliant.
    I laughed out loud when the UPS kicked in.

  • @KaranSinghSikoria
    @KaranSinghSikoria 2 หลายเดือนก่อน +1

    My Mac heated up by just watching this in 4k lolz

  • @monstercameron
    @monstercameron 2 หลายเดือนก่อน

    llama.cpp and everything based on it can offload layer of the model to system memory. Meaning you can run larger model at a huge performance penalty. Strong rec for LMStudio!!!

  • @BelarusianInUk
    @BelarusianInUk 2 หลายเดือนก่อน +1

    Why do you need ups for dev machine?

    • @AZisk
      @AZisk  2 หลายเดือนก่อน

      i have a bunch of stuff on my desk that’s not just laptops

  • @slaynnyt8130
    @slaynnyt8130 26 วันที่ผ่านมา

    In the 90s we quickly went from PCs with 32MB RAM to 1GB...with AI we really need a 2nd coming of that boom lol. Image having affordable PCs with 1TB RAM, GPUs with 512GB, one can only dream

  • @SCHaworth
    @SCHaworth 2 หลายเดือนก่อน

    I use a 4070ti, a 1080ti, a 1060 6g, and a 5700xt.
    Works fine. if i pay that much for anything its going to be a new threadripper first.

  • @eudy97
    @eudy97 2 หลายเดือนก่อน

    USB4 2.0 80Gbps transfer speed should be pretty sweet when it lands.

  • @zerogravityfallout4228
    @zerogravityfallout4228 2 หลายเดือนก่อน

    thank god you god that ATX 3.0 to prevent the meltdown stuff on the 4090

    • @DriverGear
      @DriverGear 2 หลายเดือนก่อน

      @@zerogravityfallout4228 having an ATX 3.0 PSU doesn’t mean you’re protected against the meltdown unfortunately…

  • @Rushil69420
    @Rushil69420 2 หลายเดือนก่อน +1

    Didn’t realize that was a Kizer; we all have the same hobbies lmfao.

    • @AZisk
      @AZisk  2 หลายเดือนก่อน

      love that knife!

  • @jasonhoffman6642
    @jasonhoffman6642 2 หลายเดือนก่อน

    How would you feel about doing some fine-tuning? Maybe your fastest Mac Studio vs. This thing?

  • @topticktom
    @topticktom 2 หลายเดือนก่อน +1

    Wait why did you buy a white one?

  • @dough.9241
    @dough.9241 2 หลายเดือนก่อน +1

    OMG, that UPS is waaaaaaaay too small for that load. What was this guy thinking?

    • @AZisk
      @AZisk  2 หลายเดือนก่อน +3

      haha, this guy forgot he even had a ups

  • @bobbastian760
    @bobbastian760 2 หลายเดือนก่อน +6

    eGPUs are the future, best of all worlds.
    Plug it into an oculink laptop though, like a GPD winmax, then you just need 1 machine.

  • @futurerealmstech
    @futurerealmstech 2 หลายเดือนก่อน

    Would love to see you do fine-tuning and inference tests on a 16" 2019 MacBook Pro with an eGPU.

  • @FujishiroX
    @FujishiroX 2 หลายเดือนก่อน +2

    woah that's sick

  • @johnpp21
    @johnpp21 2 หลายเดือนก่อน

    at 9:56, the Cudaz Device to Host is fluctuating, its relative to the spikes at 10:07
    means gpu is sending something large data to the system causing cuda-z showing slower available bandwidth,

  • @MrMackievelli
    @MrMackievelli 13 วันที่ผ่านมา

    I wonder if the bandwidth limitations of oculink are just minor since the llm is first loaded into memory instead of a constant data stream like games require? Also you should look at aoostar as they have a egpu dock with power and both oculink and usb4.

  • @collinsutherland311
    @collinsutherland311 2 หลายเดือนก่อน +1

    I always get platinum or titanium PSUs cause I love overkill 😁

  • @kevinlantw
    @kevinlantw 2 หลายเดือนก่อน

    I prefer another mini PC also from Minis Forum, the GTi4 Mini PC with EX Docking, which includes a 600W power supply. The mini PC can work independently or be plugged into the EX Docking to use a 4090 graphics card. It looks much cleaner since there are no extra cables.

  • @JBoy340a
    @JBoy340a หลายเดือนก่อน

    It would be interesting to see how this compares to full size PC with a 4090. I am starting to build a system around a 4090 tomorrow. I went the more traditional route with a full size case, i9-13900, Asus Tuff Gaming MB, Corsair 1200 PS, 64GB, etc. Something smaller might be nice so I am watching closely.

  • @DanFrederiksen
    @DanFrederiksen หลายเดือนก่อน +1

    Need to know how M4 max compares to a proper 4090!

  • @MeinDeutschkurs
    @MeinDeutschkurs 2 หลายเดือนก่อน

    What connection is this?

  • @djayjp
    @djayjp 2 หลายเดือนก่อน +1

    10:00 a MiB =/= Megabit. 6100 MiB is just under 6GB or 48Gb or 5100 Mb.