Download Docker Desktop: dockr.ly/4fWhlFm Learn more about Docker Scout: dockr.ly/3MhG5dE This video is sponsored by Docker Ollama docker-compose file: gist.github.com/notthebee/1dfc5a82d13dd2bb6589a1e4747e03cf Docker installation on Debian: docs.docker.com/engine/install/debian/
brother i would recommend you to use codeqwen1.5 this with ollama and continue it is less power hungry and give better result, i run it on my laptop with 16gb ram, i5-13th gen and 4050, it also very accurate
For my bs comp sci degree senior project my team designed and built an ai chatbot for the university using gpt4all, langchain and torchserve with a fastapi backend and react frontend. It used local docs and limited hallicinations using prompt templating. Also had memory and chat history to maintain contextual awareness throughout the conversation. Was a lot of fun and we all learned a lot!
There are ways to accelerate the AI models on Intel iGPUs, but they need to be run through a compatibility layer if I'm not mistaken. I couldn't test the performance of those, but it would work instead of allucinating and throwing errors like that. I didn't know you could plug in locally run models for coding so loved the video!
It's been a few months since I started using Ollama under Linux with the RX 7800XT (16GB) inside a 128GB DDR4 3600MT/s Ryzen R9 5950X system (Asrock X570S PG Riptide MB). Models sit on an Asus Hyper M.2 card with 4 Seagate 2TB NVMEe gen.4 X4 each. The GPU uses an X4 electrical (X16 mechanical) slot, since the first PCIe slot is taken up by the 4 drives. So far, I am very happy with this hardware/software setup.
The best way to run low power LLM's is to utilise the integrated GPU. It can utilise regular system ram, so no large vram required and it's faster than cpu. I really think this is the only viable way of doing it.
Qwen 2.5 Coder model is now available on the Ollama models list. According to benchmarks it is the most capable model for coding right now and is getting close to gpt-4o
Its good to mention that you can have good inference speed with CPU only if your CPU supports AVX512 and you have 16+ GB RAM. No idea if there are some mini PC out there with this kind of parameters.
@@СергейБондаренко-у2б the barebone is useless, what was the price for the total system? Complete! And consider that's a system that does one thing permanently, running a LLM. No gaming, no secondary uses.
Would be interesting to see the viability of using an amd-powered mini pc for that with something like 7840hs/u with 780m. There seems to be some work being done to fix the memory allocation (PR 6282 on ollama gh). I've tried small-ish models (3b) that fit into my reserved vram and they seem to run faster this way, even if still constrained by memory speeds.
used rtx 3090 24GB or 3090 Ti 24GB vram will most likely work better than top-end AMD card, another option is to have pair of rtx 4070 Ti super for combined 32GB vram on proper docker setup, I think the biggest potential in all-round homelab AI use is to effectively make use of a gaming PC when not playing games :)
I found this helpful enough that I've included it in a presentation on a self-hosting an AI solution presentation that I'm working on for work. This is part of an effort to raise the overall AI knowledge and not for a particular use case. Now with that said, I've had much better luck with nVidia GPUs. I even bought a laptop with a discrete nVidia for just this purpose. It was back in May and I think the price was around $1600 USD. NVidia seems to be 'better' for now, but it is good to see AMD becoming viable. I'd suspect the nVidia options are in some ways better, but that is likely around power usage or time. The prices are still bonkers. I'm running an early Intel Ultra 9 in a laptop. This thing is nice.
I know you’re a Mac guy (you convinced me to start using Mac) so even though it would be on the more expensive side, an apple silicon Mac with lots of ram is another option.
Not expensive AT ALL compared to getting the same amount of gpu memory from nvidia. And with the coming M4 macs it is expected ram will become minimum 16gb on mac. Apple silicon unified memory model can provide up to around 75% of the total ram to the GPU
First. There're some really good coding models came out recently, like qwen-2.5-code. In their 14B version, it's not only capable of doing FIM (fill in the middle) tasks for a single file, but also for multiple ones. Not sure, if continue dev supports that, but Twinny (my personal favorite LLM plugin for VSCodium) does. Second. if you're aiming towards gpu-less builds, look at AMD Ryzen 7th series or higher. Like 7940HS. Not only it's a great powerhouse in terms of CPU performance, it's also supports DDR5 with higher bandwidth which is crucial, when it comes to LLM tasks. There're plenty of Mini PCs already with such CPUs
Correct! I am running Ollama and LM-Studio on a Mac Studio and a Macbook Pro. The Mac Studio (M2 Ultra) pulls a max of 150-170 watts while inferencing with 14B and lower parameter models, but idles at 20-30 watts. It is not the most efficient, but it is fast, and I can load Llama 3.1 70B Q4 without spending $3000+ on just the GPUs, and the added power cost. The Mac Minis with 16GB of Memory should be far more efficient.
Ollama is nice, but doesn't stand a chance against Tabby (from TabbyML). I run that on a desktop with an Intel i3 10100F CPU, single channel 16 GB RAM (DDR4, 3200MHz), Corsair SSD (500BX model) and a MSI GTX 1650 with 4 GB VRAM (75 Watt model). This meager self-build gives ChatGPT a run for its money in response times when accessed via the Tabby web-interface. Tabby can be run directly in your OS, setup in a VM or run at your cloud provider. Windows, Linux and MacOS are supported. Tabby also provides extensions for VS Code, JetBrains IDEs and NeoVim. Auto-complete and chat become available, just as is shown in this video. Tabby can be used for free when 5 accounts or less use it simultaneously. Disadvantage from Tabby is that it doesn't support many models. 6 Chat models and 12 Code models. Many of the models used in this video are supported by Tabby. You can hook at least 3 different Git repositories to it (as that is what I have done at the moment), but you can also use a document for context. And not just via the extension of your favorite editor, but also via the Tabby web-interface. Now, with only 4 GB of VRAM, I cannot load the largest Chat & Code models and these models tend to hallucinate. However, if you have a GPU with 8 GB or more, you can load the 7B models for Chat/Code and that improves the quality of responses a lot. And finally, Tabby has an installer for CUDA (NVidia), for ROCm (AMD) and Vulkan. Haven't tried ROCm or Vulkan, but Tabby in combination with NVidia is very impressive. My suggestion would be to make another video with Tabby in your 24 GB VRAM GPU using the largest supported models for both Chat and Code. I fully expect you'll come to a different conclusion.
For low power machines you'll need a good Tensor Processing Unit to process all those instructions for machine learning and AI. Ones like the Google Coral and Hailo would be best for the Latte Panda. Jeff Geerling made a pretty good video about this project. I think you're on the right path, just need some good TPUs to make this small server a reality.
Thanks for another educational and well executed video despite your hair style malfuction. (Which I probably wouldn't have noticed until you told on yourself.) Keep doing what you're doing.
Great video! I installed the Ollama codellama + WebUi on my Ubuntu Server via the Portainer app on my CasaOS install to make things as easy as possible. My server is an old Dell Precision T3620 that I bought for around 350 euros a couple of months ago. Specs Intel Xeon E3-1270 V5 - 4-Core 8-Threads 3.60GHz (4.00GHz Boost, 8MB Cache, 80W TDP) Nvidia Quadro K4200 4GB GDDR5 PCIe x16 FH 2 x 8GB - DDR4 2666MHz (PC4-21300E, 2Rx8) 512GB - NVMe SSD - () - New Crucial MX500 1TB (SFF 2.5in) SATA-III 6Gbps 3D NAND SSD New Things are running well, but, of course, not as fast as on a beefed machine like yours. :D
Really nice video! One point tho about forwarding port 8080 in your compose file, that also punches a hole in your firewall allowing traffic from everywhere to connect to that port. Just as a warning for anyone running this on a server that's not sitting behind another firewall.
We'll have to see how this all plays out on the new laptop processors with NPU. I'm still hoping to be able to buy a $1,000 all-in-one mini PC - without a graphics card but with enough NPU power. The question also arises as to what is more necessary: RAM or NPU power.
I've been running Ollama in a MacBook Pro M3 Pro with 36 gigs of RAM and it works pretty well for chats with llama3.1:latest. I'll test the other models you suggested also with Continue. I tried using Continue in the past in Goland, but the experience was quite mediocre. Interesting stuff, thanks for the recommendations in the video.
you can run meta llama 3.2 8 billion parameters quantized with just a cpu using gpt4all as well as control it via pycharm or vscode . a good option for those wanting to build something like this on old cheap hardware that would normally be thrown in the garbage.
Amazing video, Wolfgang. Amazing information and incredibly informative. It would be really cool if you could do the same setup on Proxmox using a CT or something like that for companies that could have extra hardware lying around waiting to be repurposed. This video has really answered my question related to this topic.
Thanks a lot for this topic! I am very interested about running AI in my homelab, but those AI chips are very expensive and power hungry, it would be interesting to review other options such as iGPU, NPU, cheap ebay graphics cards or any other hardware that can run AI inteference
Im waiting for the m4 mac mini, if they offer it with 64gb ram it will most likely be amazing value for AI Macos can run ollama and it runs very well, and the unified memory model on apple silicon is perfect for it
I set up my 800W "home lab" to wake on lan and hibernate after ten minutes of inactivity, which seems like a decent compromise in terms of power consumption Am I really gonna brun 800w of compute on local code generation? Bet.
Personally I'm hyped for the of Hailo-10H that claims 40tops on m.2 form factor and just 5 watts of power consumption. I hope all their claims are true and maybe you can be interest in it yourself (:
you could also run the "autocompletion" from VS code on your Mac. I´m running ollama on my M2 Pro. Power consumption is not really worth mentioning it and you don't have to run a seperate PC. (But this only make sense of course if you don't want so share the openUI web to others)
Dude, this was a really cool project, I recently build an AI home server and I think there are some parts of your project that could be done better. AMD is fine but general purpose graphics cards but for AI, Nvidia is always your first option, of course it would work fine on Ollama and similar models, but most AI projects out there support Nvidia firstly, your build would be more future proof with a Nvidia graphics card and you'd be able to mess up with other projects like Whisper more easily. Of course, Nvidia is way more expensive, I'm using an RTX 3060 which costed me like $320, there are a really excelent video by Jarods Journey comparing that card with more high end Nvidia cards and it works greatly for its price, its performance it's not so far away from the 4090, specially considering the price, the main difference is the VRAM but there are some tuning you could use to run heavier models using less RAM, or balance both RAM and VRAM such as Aphrodite and KoboldCPP for text generation. Lastly in regard of power consumtion, yes, it does cost a lot of power to run decent models, there isn't a work around it, however, you could just turn on your machine when you need the text generation and turn it off when you are not using it, if you want a more ellaborated solution, you could enable Wake On Lan and get a Raspberry Pi as client server for turning it on/off everytime you need it, at least I'm planning to do that with my server. At the moment there isn't a lot of videos about deploying local inference servers on YT so I'm really happy you made this one. Looking forward to more AI related videos in the future.
I have a similar setup, I run all AI stuff on windows, so I don't have to dual boot, it's also easier to setup the PC to sleep on windows, compared to a headless Linux
There is an ongoing effort to allow for OneAPI / OpenVINO on intel gpus's. Once this tomes we'll be able to use low power iGPUS with lots of RAM. I'm always checking on the issue for Ollama, there's also a couple questions regarding NPU support. Holding my breath for Battlemage GPUs here, though I've seen impressive results with ollama Running on modern QUADRO Gpus... for those who can afford it. Not me! Thanks for this! I've tried this same stack in the past both with a 1080ti and 6950xt. Ollama runs perfectly fine on both of them and but continue seems to have improved a lot since my last try. I will give it another shot!
Could you tell how you got the models that were used in your testing? I have attempted to find them, including looking at the Open WebUI community site, looking through the models available on the ollama site, and I even attempted to pull them directly into ollama (the file was not found). So where did you get those models and how did you get them into ollama?
They have this, no idea if it's any good - www.jetbrains.com/ai/ But also, Continue supports JetBrains IDEs: plugins.jetbrains.com/plugin/22707-continue
@@WolfgangsChannel The first is SaaS, but I didn't notice that Continue was available there, too. It doesn't look like it works all that well at the moment, but I'll try it at some point. Thanks!
I will only self host an AI service when the hardware needed to run these things are down to mainsream levels. Even, the architecture is not power optimized and maybe this is something that we have to wait from the hardware and software side. Power efficiency is the main issue for self-hosting anything.
but you can host it on the same powerful machine, you programme on and run llm, when you need it, then play games when you don't, right? I am planning to do the same thing, but I want my low power home server to be like a switch for my pc, so I can game on it, when I need to, but also use it's power to help me with my prompts... It can be done using WOL, if you need it to work remotely, I think. Or is it a bad idea in terms of security?
@@WolfgangsChannelBut I need WOL to turn on my PC over WAN, I tried to do port forwarding, but it seems, that I need to set something in my router, so it knows where to forward the signal it gets, because, if my PC is off it cannot find it's local IP for some reason. I use openwrt.
I've seen a few people plug a GPU into an M2 slot of a mini pc with an M2 to PCi adapter. I believe you need to power the GPU with an external power supply but I've always wondered what the idle power consumption is like for a setup like this. Maybe a future video:)
Very cool video. I would love to see more videos like this. Target java developers ;-) . A comparison with a Nvidia setup. Maybe some tuning of the LLM to use less power "eco mode"? A fast cpu and lots of RAM memory setup vs expensive graphic card comparison
imo getting a second hand m1 mac mini with 16 gb of ram might be the cheapest AI solution. Very decent performance with ollama and a price around 500-600 €. Otherwise for a gpu, a cheap option is a second hand 3090, even better if you get 2 you have access to 48 gb models.
Perhaps GPUs built for compute workloads would work better? (Im not sure, im genuinely asking) Im thinking something along the lines of RTX A2000 for low power draw or A5000?
I have a RTX A4000 Ada. Power consumption in idle (per nvtop) is about 13w. When a large LLM is run it takes the full 130w which I think is a perfect compromise between power consumption and performance. I am not a gamer at all - just using this GPU in my homelab for LLM.
At the same time, dedicating an entire GPU to a single task like this is kind of nonsensical. Unless you're a small company that dedicates a server for the task I do not see this making sense - and let's face it, it's mostly a memory issue which the industry makes us pay at a premium. Thank you for answering a question that I have had in the back of my mind for the last few months :).
So, I would try to get hands on miniPC based on AMD 8840HS ( a lot of on ALiExpress for around 300 barebone, sou add RAM and SSD). Run them with either 32Gb of RAM and you got yourself, small, light power AI assistant. This APU from AMD comes with nice integrated GPU but also with NPU unit ( 16 TOPS), so it is nice upgrade from that Intel you have and even cheaper. Once the newest APUs from AMD and maybe Intel come to miniPC, (with around 50TOPS npu units), it will more than enough for this.
By the way, Zed is becoming a more prominent and apt replacement for Neovim and VSCode, and it has built-in support for Ollama (as well as other services). It doesn’t have AI suggestions directly, but those can be (kind of) configured via inline assists, prefilled prompts, and keybindings. But the main problem is speed, and it doesn’t matter if it’s a service from a billion-dollar company or your local LLM running on top-end hardware. Two seconds or five, it breaks the flow. And the result is rarely perfect. It’s very cool that it’s possible, but it’s not there yet, and we don’t know if it will ever be.
Hey, my country blocked internet access in my area, but if I connect and start the internet outside the blocked area and come to the region that's blocked without disconnecting, it keeps working for weeks or months. But if I reconnect, I lose access to the internet. No VPN can bypass it. HELP
And like all neovim users, you let us know it, as smugly as possible. Edit: Having now reached the end of the video I realize this is just a reference.
Apple made their own ai code completion with latest xcode and latest macos. It requires 16gb ram (making it even more insane they sell “pro” computers with only 8gb ram. Apples ai code completion is kind of bad to be honest
I wouldn't put my faith into any LLM small enough to allow local hosting for coding, while chatGPT can't write something as mundane as an actually working autohotkey 2.0 script. If you can troubleshoot the hogwash output, good for you. If you can't, tough luck... it can't either. Also not being able to utilize AWQ models is shooting yourself in the foot from the get-go... 5:13 - case in point.
@@firstspar No, the other option is to not use LLMs for things they weren't meant to be used for. Transformer models whose working principle is to give you a probabilistic distribution of words as results, can't do specifics! Why do you think it struggles with math concepts as simple as adding two single-digit numbers accurately when it's accompanied by a text representation of what those numbers are? This is an example when I asked it how much of each thing I'll need (I repeat, I'll need, one person) for a 9-day stay without the possibility to go to the store to get more: Personal Hygiene Toilet Paper: 2-3 rolls per person per week, so about 18-27 rolls for 9 days. It outputs 2*9 to 3*9 instead of rounding up 9/7*2 to 9/7*3, not being able to reconcile the concept of a week with days and THIS is what you want to entrust with coding?! Solving coding problems?! This isn't just a minor error, it's a fundamental failure to apply basic arithmetic concepts correctly, and this with 4 HUNDRED BILLION parameters, what do you expect out of a 3 billion model? A lobotomized version of chatGPT won't just suddenly get a concept right chatGPT didn't, just by tightening its training data to only entail coding!
A second hand 3090 is one of the best options for ML inference right now. You can get one for 500 bucks. It’s faster than amd cards for many tasks and will allow you to run way more things. Most projects out there are built purely on cuda
10:41 how the hell did you get tab autocomplete to work so easily? I've been banging my head on this problem for months and even tried to copy your configuration but still it just refuses to work for some reason
@@WolfgangsChannel in vscode extension I can choose any and it will not even try to load it (ollama log doesn't show even getting a request to load the model) it will load models I've set for chat functions (Ctrl+L and Ctrl+I) but never tab autocomplete (For chat: llama3.1:8b-instruct-q6_K, gemma2:9b-instruct-q6_K, llama3.1:70b-instruct-q2_K (slow but works), for tab autocomplete: deepseek-coder-v2:16b-lite-instruct-q3_K_M, codegeex4:9b-all-q6_K, codestral:22b-v0.1-q4_K_M)
The problem with many TPUs is that they don’t have onboard storage. Or if they have some, it’s very small. One of the reason why Ollama works so well on GPUs is fast VRAM. Running LLMs on the TPU would mean that your output is bottlenecked by either USB or PCIe, since they’re slower than the interconnect between the GPU itself and the VRAM (or the CPU and the RAM)
Download Docker Desktop: dockr.ly/4fWhlFm
Learn more about Docker Scout: dockr.ly/3MhG5dE
This video is sponsored by Docker
Ollama docker-compose file: gist.github.com/notthebee/1dfc5a82d13dd2bb6589a1e4747e03cf
Docker installation on Debian: docs.docker.com/engine/install/debian/
brother i would recommend you to use codeqwen1.5 this with ollama and continue it is less power hungry and give better result, i run it on my laptop with 16gb ram, i5-13th gen and 4050, it also very accurate
For my bs comp sci degree senior project my team designed and built an ai chatbot for the university using gpt4all, langchain and torchserve with a fastapi backend and react frontend. It used local docs and limited hallicinations using prompt templating. Also had memory and chat history to maintain contextual awareness throughout the conversation. Was a lot of fun and we all learned a lot!
Can you share the project? I'd love to see it.
There are ways to accelerate the AI models on Intel iGPUs, but they need to be run through a compatibility layer if I'm not mistaken. I couldn't test the performance of those, but it would work instead of allucinating and throwing errors like that. I didn't know you could plug in locally run models for coding so loved the video!
Yes the layer is called ipex-llm i would love to see an update video testing that
Thanks, I was scowering my wiki to find that info but without success
After scrolling through some github issues it would appear ollama supports vulkan which can utilize the iGpu
It's been a few months since I started using Ollama under Linux with the RX 7800XT (16GB) inside a 128GB DDR4 3600MT/s Ryzen R9 5950X system (Asrock X570S PG Riptide MB). Models sit on an Asus Hyper M.2 card with 4 Seagate 2TB NVMEe gen.4 X4 each. The GPU uses an X4 electrical (X16 mechanical) slot, since the first PCIe slot is taken up by the 4 drives. So far, I am very happy with this hardware/software setup.
The best way to run low power LLM's is to utilise the integrated GPU. It can utilise regular system ram, so no large vram required and it's faster than cpu.
I really think this is the only viable way of doing it.
You watch M.D House to kick back and relax?
I like you more than I used to.
everybody lies
@@Giftelzwerg lol who hurt you?
@@tf5pZ9H5vcAdBp woooosh
@@tf5pZ9H5vcAdBp It's a quote from the show
lol I thought the same thing
I started going down this road a few weeks ago.
i9-9600k, 48GB RAM, 8GB RX 5700
Thanks for the tips on some pieces I've been missing!
Qwen 2.5 Coder model is now available on the Ollama models list.
According to benchmarks it is the most capable model for coding right now and is getting close to gpt-4o
Its good to mention that you can have good inference speed with CPU only if your CPU supports AVX512 and you have 16+ GB RAM. No idea if there are some mini PC out there with this kind of parameters.
Even if there are such machines, the cost would be high. Are a few suggestions really worth that?
@@bzuidgeest I have Minisforum UM780 XTX with AMD Ryzen 7 7840HS. It's supports up to 96 GB of RAM and there is AVX512. The barebone wasn't costly.
@@СергейБондаренко-у2б the barebone is useless, what was the price for the total system? Complete! And consider that's a system that does one thing permanently, running a LLM. No gaming, no secondary uses.
Would be interesting to see the viability of using an amd-powered mini pc for that with something like 7840hs/u with 780m. There seems to be some work being done to fix the memory allocation (PR 6282 on ollama gh). I've tried small-ish models (3b) that fit into my reserved vram and they seem to run faster this way, even if still constrained by memory speeds.
used rtx 3090 24GB or 3090 Ti 24GB vram will most likely work better than top-end AMD card,
another option is to have pair of rtx 4070 Ti super for combined 32GB vram on proper docker setup,
I think the biggest potential in all-round homelab AI use is to effectively make use of a gaming PC when not playing games :)
I found this helpful enough that I've included it in a presentation on a self-hosting an AI solution presentation that I'm working on for work. This is part of an effort to raise the overall AI knowledge and not for a particular use case. Now with that said, I've had much better luck with nVidia GPUs. I even bought a laptop with a discrete nVidia for just this purpose. It was back in May and I think the price was around $1600 USD. NVidia seems to be 'better' for now, but it is good to see AMD becoming viable. I'd suspect the nVidia options are in some ways better, but that is likely around power usage or time. The prices are still bonkers. I'm running an early Intel Ultra 9 in a laptop. This thing is nice.
I will buy a 2000+ workstation for local LLM development, which is a good idea IMO.
I know you’re a Mac guy (you convinced me to start using Mac) so even though it would be on the more expensive side, an apple silicon Mac with lots of ram is another option.
Not expensive AT ALL compared to getting the same amount of gpu memory from nvidia. And with the coming M4 macs it is expected ram will become minimum 16gb on mac.
Apple silicon unified memory model can provide up to around 75% of the total ram to the GPU
I have achieved pretty decent results with Deepseek Coder V2 on a moderately priced RTX 4060 ti 16GB.
First. There're some really good coding models came out recently, like qwen-2.5-code. In their 14B version, it's not only capable of doing FIM (fill in the middle) tasks for a single file, but also for multiple ones. Not sure, if continue dev supports that, but Twinny (my personal favorite LLM plugin for VSCodium) does.
Second. if you're aiming towards gpu-less builds, look at AMD Ryzen 7th series or higher. Like 7940HS. Not only it's a great powerhouse in terms of CPU performance, it's also supports DDR5 with higher bandwidth which is crucial, when it comes to LLM tasks. There're plenty of Mini PCs already with such CPUs
The 4090 is such a scam. 24gb of vram should not cost more than 99% of the systems in existence.
Ollama works well on Apple M-Series Chips if you have enough RAM. A Mac Mini might be a good server for this, but it's kind of expensive.
Correct! I am running Ollama and LM-Studio on a Mac Studio and a Macbook Pro. The Mac Studio (M2 Ultra) pulls a max of 150-170 watts while inferencing with 14B and lower parameter models, but idles at 20-30 watts. It is not the most efficient, but it is fast, and I can load Llama 3.1 70B Q4 without spending $3000+ on just the GPUs, and the added power cost. The Mac Minis with 16GB of Memory should be far more efficient.
Ollama is nice, but doesn't stand a chance against Tabby (from TabbyML). I run that on a desktop with an Intel i3 10100F CPU, single channel 16 GB RAM (DDR4, 3200MHz), Corsair SSD (500BX model) and a MSI GTX 1650 with 4 GB VRAM (75 Watt model). This meager self-build gives ChatGPT a run for its money in response times when accessed via the Tabby web-interface.
Tabby can be run directly in your OS, setup in a VM or run at your cloud provider. Windows, Linux and MacOS are supported. Tabby also provides extensions for VS Code, JetBrains IDEs and NeoVim. Auto-complete and chat become available, just as is shown in this video.
Tabby can be used for free when 5 accounts or less use it simultaneously.
Disadvantage from Tabby is that it doesn't support many models. 6 Chat models and 12 Code models. Many of the models used in this video are supported by Tabby. You can hook at least 3 different Git repositories to it (as that is what I have done at the moment), but you can also use a document for context. And not just via the extension of your favorite editor, but also via the Tabby web-interface.
Now, with only 4 GB of VRAM, I cannot load the largest Chat & Code models and these models tend to hallucinate. However, if you have a GPU with 8 GB or more, you can load the 7B models for Chat/Code and that improves the quality of responses a lot.
And finally, Tabby has an installer for CUDA (NVidia), for ROCm (AMD) and Vulkan. Haven't tried ROCm or Vulkan, but Tabby in combination with NVidia is very impressive. My suggestion would be to make another video with Tabby in your 24 GB VRAM GPU using the largest supported models for both Chat and Code. I fully expect you'll come to a different conclusion.
It should gave you same results as Ollama with same models
How surprised he was when it "kinda just worked" 😂
Right? I figured the state of this was some super delicate, ready to break in 2 seconds setup. Was shocked when it wasn't awful.
For low power machines you'll need a good Tensor Processing Unit to process all those instructions for machine learning and AI. Ones like the Google Coral and Hailo would be best for the Latte Panda. Jeff Geerling made a pretty good video about this project. I think you're on the right path, just need some good TPUs to make this small server a reality.
Thanks for another educational and well executed video despite your hair style malfuction. (Which I probably wouldn't have noticed until you told on yourself.) Keep doing what you're doing.
Great video! I installed the Ollama codellama + WebUi on my Ubuntu Server via the Portainer app on my CasaOS install to make things as easy as possible.
My server is an old Dell Precision T3620 that I bought for around 350 euros a couple of months ago.
Specs
Intel Xeon E3-1270 V5 - 4-Core 8-Threads 3.60GHz (4.00GHz Boost, 8MB Cache, 80W TDP)
Nvidia Quadro K4200 4GB GDDR5 PCIe x16 FH
2 x 8GB - DDR4 2666MHz (PC4-21300E, 2Rx8)
512GB - NVMe SSD - () - New
Crucial MX500 1TB (SFF 2.5in) SATA-III 6Gbps 3D NAND SSD New
Things are running well, but, of course, not as fast as on a beefed machine like yours. :D
Really nice video!
One point tho about forwarding port 8080 in your compose file, that also punches a hole in your firewall allowing traffic from everywhere to connect to that port.
Just as a warning for anyone running this on a server that's not sitting behind another firewall.
We'll have to see how this all plays out on the new laptop processors with NPU. I'm still hoping to be able to buy a $1,000 all-in-one mini PC - without a graphics card but with enough NPU power. The question also arises as to what is more necessary: RAM or NPU power.
Nice self hosting.
Amazing content as usual! Thank you for all the work you put into this!
I will give it a try next week on my nvidia tesla k80. It has been good enough for 1080p remote gaming.
I've been running Ollama in a MacBook Pro M3 Pro with 36 gigs of RAM and it works pretty well for chats with llama3.1:latest. I'll test the other models you suggested also with Continue. I tried using Continue in the past in Goland, but the experience was quite mediocre. Interesting stuff, thanks for the recommendations in the video.
I have a 7735HS. Would love to see how the powerful IGPU's like the 680M perform.
you can run meta llama 3.2 8 billion parameters quantized with just a cpu using gpt4all as well as control it via pycharm or vscode . a good option for those wanting to build something like this on old cheap hardware that would normally be thrown in the garbage.
Amazing video, Wolfgang. Amazing information and incredibly informative. It would be really cool if you could do the same setup on Proxmox using a CT or something like that for companies that could have extra hardware lying around waiting to be repurposed. This video has really answered my question related to this topic.
Thanks a lot for this topic! I am very interested about running AI in my homelab, but those AI chips are very expensive and power hungry, it would be interesting to review other options such as iGPU, NPU, cheap ebay graphics cards or any other hardware that can run AI inteference
Im waiting for the m4 mac mini, if they offer it with 64gb ram it will most likely be amazing value for AI
Macos can run ollama and it runs very well, and the unified memory model on apple silicon is perfect for it
And how is the performance hosted in your MacBook Pro? Another topic, the mini pc with something like hailo-8 can improve the performance enough?
I set up my 800W "home lab" to wake on lan and hibernate after ten minutes of inactivity, which seems like a decent compromise in terms of power consumption
Am I really gonna brun 800w of compute on local code generation? Bet.
Personally I'm hyped for the of Hailo-10H that claims 40tops on m.2 form factor and just 5 watts of power consumption. I hope all their claims are true and maybe you can be interest in it yourself (:
Great info. Thank you
you could also run the "autocompletion" from VS code on your Mac.
I´m running ollama on my M2 Pro. Power consumption is not really worth mentioning it and you don't have to run a seperate PC.
(But this only make sense of course if you don't want so share the openUI web to others)
Dude, this was a really cool project, I recently build an AI home server and I think there are some parts of your project that could be done better. AMD is fine but general purpose graphics cards but for AI, Nvidia is always your first option, of course it would work fine on Ollama and similar models, but most AI projects out there support Nvidia firstly, your build would be more future proof with a Nvidia graphics card and you'd be able to mess up with other projects like Whisper more easily. Of course, Nvidia is way more expensive, I'm using an RTX 3060 which costed me like $320, there are a really excelent video by Jarods Journey comparing that card with more high end Nvidia cards and it works greatly for its price, its performance it's not so far away from the 4090, specially considering the price, the main difference is the VRAM but there are some tuning you could use to run heavier models using less RAM, or balance both RAM and VRAM such as Aphrodite and KoboldCPP for text generation. Lastly in regard of power consumtion, yes, it does cost a lot of power to run decent models, there isn't a work around it, however, you could just turn on your machine when you need the text generation and turn it off when you are not using it, if you want a more ellaborated solution, you could enable Wake On Lan and get a Raspberry Pi as client server for turning it on/off everytime you need it, at least I'm planning to do that with my server.
At the moment there isn't a lot of videos about deploying local inference servers on YT so I'm really happy you made this one. Looking forward to more AI related videos in the future.
I have a similar setup, I run all AI stuff on windows, so I don't have to dual boot, it's also easier to setup the PC to sleep on windows, compared to a headless Linux
Maybe it's the SFF case, but holy smokes that CPU cooler is huge!
Nah, she's just a thick boy
There is an ongoing effort to allow for OneAPI / OpenVINO on intel gpus's. Once this tomes we'll be able to use low power iGPUS with lots of RAM. I'm always checking on the issue for Ollama, there's also a couple questions regarding NPU support. Holding my breath for Battlemage GPUs here, though I've seen impressive results with ollama Running on modern QUADRO Gpus... for those who can afford it. Not me! Thanks for this! I've tried this same stack in the past both with a 1080ti and 6950xt. Ollama runs perfectly fine on both of them and but continue seems to have improved a lot since my last try. I will give it another shot!
Nice vid wolfie. Perfect !
Could you tell how you got the models that were used in your testing? I have attempted to find them, including looking at the Open WebUI community site, looking through the models available on the ollama site, and I even attempted to pull them directly into ollama (the file was not found). So where did you get those models and how did you get them into ollama?
Is that MonHun I spotted at 16:41? Nice.
3:25 I bought my RTX 3090 used for 600€ with 24GB* of VRAM. Just for reference - that's a great deal.
Is it a modded card? Stock 3090 has 24GB of VRAM
@@WolfgangsChannel Oh man, I just wanted to update my comment :D
No, I'm just an idiot. Looked at the wrong row. 24GB of VRAM.
"you've been looking for more stuff to self host anyway"....
GET OUT OF MY BRAIN!
Off to Micro Center.... "Which aisle has the 7900 XT?"
I wonder if JetBrains will ever allow a similar plugin for their IDEs
They have this, no idea if it's any good - www.jetbrains.com/ai/
But also, Continue supports JetBrains IDEs: plugins.jetbrains.com/plugin/22707-continue
@@WolfgangsChannel The first is SaaS, but I didn't notice that Continue was available there, too. It doesn't look like it works all that well at the moment, but I'll try it at some point. Thanks!
I will only self host an AI service when the hardware needed to run these things are down to mainsream levels. Even, the architecture is not power optimized and maybe this is something that we have to wait from the hardware and software side. Power efficiency is the main issue for self-hosting anything.
Have you considered tabbyml instead of continue+ollama, and it has a neovim plugin
Plus fedora support rocm out of the box
Thanks for the recommendation! I'll try it out
1:39 sigma 🥶
Thanks for the tutorial.
but you can host it on the same powerful machine, you programme on and run llm, when you need it, then play games when you don't, right? I am planning to do the same thing, but I want my low power home server to be like a switch for my pc, so I can game on it, when I need to, but also use it's power to help me with my prompts... It can be done using WOL, if you need it to work remotely, I think. Or is it a bad idea in terms of security?
You can do it all on one machine
@@WolfgangsChannelBut I need WOL to turn on my PC over WAN, I tried to do port forwarding, but it seems, that I need to set something in my router, so it knows where to forward the signal it gets, because, if my PC is off it cannot find it's local IP for some reason. I use openwrt.
counting watts consumption is useful, but I wouldn't let it stop me.
*is* useful
You should look into TPUs in order to run this on a lower spec machine.
I've seen a few people plug a GPU into an M2 slot of a mini pc with an M2 to PCi adapter. I believe you need to power the GPU with an external power supply but I've always wondered what the idle power consumption is like for a setup like this. Maybe a future video:)
Probably not much different from a desktop PC with the same setup. M.2 is just PCIe with extra steps
I think my 4th gen i5 unraid server would immediately catch fire if I put Ollama on it, lol.
Great video, subbed
What do you think for Nvidia rtx a2000 12GB VRam is that enough.
Hey, thanks for the video, I’ve always wanted something like this. Would love to see an update with neovim and a nvidia gpu setup
You can make a Frankendebian and install amdgpu-dkms and rocm-hip-libraries from the AMD repos, but yeah, better run with Ubuntu.
I don't recognise half of those icons in the dock, ...but I can see 1847 unopened emails 😂😂😂
Very cool video. I would love to see more videos like this. Target java developers ;-) .
A comparison with a Nvidia setup.
Maybe some tuning of the LLM to use less power "eco mode"?
A fast cpu and lots of RAM memory setup vs expensive graphic card comparison
Tabnine has offered this for years now already
imo getting a second hand m1 mac mini with 16 gb of ram might be the cheapest AI solution. Very decent performance with ollama and a price around 500-600 €. Otherwise for a gpu, a cheap option is a second hand 3090, even better if you get 2 you have access to 48 gb models.
Perhaps GPUs built for compute workloads would work better? (Im not sure, im genuinely asking) Im thinking something along the lines of RTX A2000 for low power draw or A5000?
I have a RTX A4000 Ada. Power consumption in idle (per nvtop) is about 13w. When a large LLM is run it takes the full 130w which I think is a perfect compromise between power consumption and performance.
I am not a gamer at all - just using this GPU in my homelab for LLM.
At the same time, dedicating an entire GPU to a single task like this is kind of nonsensical. Unless you're a small company that dedicates a server for the task I do not see this making sense - and let's face it, it's mostly a memory issue which the industry makes us pay at a premium.
Thank you for answering a question that I have had in the back of my mind for the last few months :).
So, I would try to get hands on miniPC based on AMD 8840HS ( a lot of on ALiExpress for around 300 barebone, sou add RAM and SSD). Run them with either 32Gb of RAM and you got yourself, small, light power AI assistant.
This APU from AMD comes with nice integrated GPU but also with NPU unit ( 16 TOPS), so it is nice upgrade from that Intel you have and even cheaper.
Once the newest APUs from AMD and maybe Intel come to miniPC, (with around 50TOPS npu units), it will more than enough for this.
Maybe a TPU rather than a GPU would work more efficiently?
How you think a400 gpu will perform
By the way, Zed is becoming a more prominent and apt replacement for Neovim and VSCode, and it has built-in support for Ollama (as well as other services). It doesn’t have AI suggestions directly, but those can be (kind of) configured via inline assists, prefilled prompts, and keybindings.
But the main problem is speed, and it doesn’t matter if it’s a service from a billion-dollar company or your local LLM running on top-end hardware. Two seconds or five, it breaks the flow. And the result is rarely perfect.
It’s very cool that it’s possible, but it’s not there yet, and we don’t know if it will ever be.
Hey, my country blocked internet access in my area, but if I connect and start the internet outside the blocked area and come to the region that's blocked without disconnecting, it keeps working for weeks or months. But if I reconnect, I lose access to the internet. No VPN can bypass it. HELP
HAHAHAH @9:55 that killlllleeeedddd me!!!!!!!!!!!!!!!!!!!!!
Why just run ollama on your mac? It does support GPU acceleration on mac as well.
I still waiting for "real" Cortana AI".
can i run this docker instance in truenas?
Why you have 2 unix devices /dev/kfd and /dev/dri for a single GPU?
kfd is for „Kernel Fusion Device“, which is needed for ROCm
What about using a google tensor card?
TPUs are currently not supported: github.com/ollama/ollama/issues/990?ref=geeek.org
This is unacceptable. You deserve 5 million views and 39 million likes in 2 minutes 😅
3090 also has 24 GB of G6X memory and nowhere near the cost of a 4090
I'm a Neovim Chad.
And like all neovim users, you let us know it, as smugly as possible.
Edit: Having now reached the end of the video I realize this is just a reference.
Hey. What font do you use in terminal?
Comic Code
Are there any Continue like plugins that support xcode ?
Apple made their own ai code completion with latest xcode and latest macos. It requires 16gb ram (making it even more insane they sell “pro” computers with only 8gb ram. Apples ai code completion is kind of bad to be honest
Are you going to make an updated video on VPN?
Thx
I wouldn't put my faith into any LLM small enough to allow local hosting for coding, while chatGPT can't write something as mundane as an actually working autohotkey 2.0 script. If you can troubleshoot the hogwash output, good for you. If you can't, tough luck... it can't either.
Also not being able to utilize AWQ models is shooting yourself in the foot from the get-go... 5:13 - case in point.
The other options is having your work spied on and stolen.
@@firstspar No, the other option is to not use LLMs for things they weren't meant to be used for.
Transformer models whose working principle is to give you a probabilistic distribution of words as results, can't do specifics! Why do you think it struggles with math concepts as simple as adding two single-digit numbers accurately when it's accompanied by a text representation of what those numbers are? This is an example when I asked it how much of each thing I'll need (I repeat, I'll need, one person) for a 9-day stay without the possibility to go to the store to get more:
Personal Hygiene
Toilet Paper: 2-3 rolls per person per week, so about 18-27 rolls for 9 days.
It outputs 2*9 to 3*9 instead of rounding up 9/7*2 to 9/7*3, not being able to reconcile the concept of a week with days and THIS is what you want to entrust with coding?! Solving coding problems?! This isn't just a minor error, it's a fundamental failure to apply basic arithmetic concepts correctly, and this with 4 HUNDRED BILLION parameters, what do you expect out of a 3 billion model? A lobotomized version of chatGPT won't just suddenly get a concept right chatGPT didn't, just by tightening its training data to only entail coding!
A second hand 3090 is one of the best options for ML inference right now. You can get one for 500 bucks. It’s faster than amd cards for many tasks and will allow you to run way more things. Most projects out there are built purely on cuda
30gb for the drivers!?
14:38 😂 halu
Hey man it'll really be helpful if you'd provide your continue's config.json file as well.
I didn't edit the config at all, apart from replacing the model names and URLs
@@WolfgangsChannel got it. Thanks ✨
I'am emacs chad, I use neovim to do simple config edits only.
what about a Nvidia Jetson board?
10:41 how the hell did you get tab autocomplete to work so easily? I've been banging my head on this problem for months and even tried to copy your configuration but still it just refuses to work for some reason
What hardware are you running Ollama on?
@@WolfgangsChannel Ollama runs fine, but the tab autocomplete doesn't want to work
If I do Ctrl+L or Ctrl+I it works fine
@@WolfgangsChannel ryzen 3600 32gb 3200 ddr4 rtx 3060 ti
Which model are you using for the tab autocomplete?
@@WolfgangsChannel in vscode extension I can choose any and it will not even try to load it (ollama log doesn't show even getting a request to load the model)
it will load models I've set for chat functions (Ctrl+L and Ctrl+I) but never tab autocomplete
(For chat: llama3.1:8b-instruct-q6_K, gemma2:9b-instruct-q6_K, llama3.1:70b-instruct-q2_K (slow but works), for tab autocomplete: deepseek-coder-v2:16b-lite-instruct-q3_K_M, codegeex4:9b-all-q6_K, codestral:22b-v0.1-q4_K_M)
What about a Raspberry pi 5 with the ai kit?
The problem with many TPUs is that they don’t have onboard storage. Or if they have some, it’s very small. One of the reason why Ollama works so well on GPUs is fast VRAM. Running LLMs on the TPU would mean that your output is bottlenecked by either USB or PCIe, since they’re slower than the interconnect between the GPU itself and the VRAM (or the CPU and the RAM)
Based House enjoyer
0:45 - that sweet, sweet Mövenpick yoghurt
nestle 🤮
i wish some day we would be able to use local Ai at cheaper hardware cost.
Most local Ai needs expensive hardware. 😪
1840 unread mails..
In a b e e e e e e ee e e e e e e e e e e e e e e e e e
Key takeaway: My Raspberry Pi 3B+ is not an option :'(