Host Your Own AI Code Assistant with Docker, Ollama and Continue!

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 พ.ย. 2024

ความคิดเห็น •

  • @WolfgangsChannel
    @WolfgangsChannel  2 หลายเดือนก่อน +12

    Download Docker Desktop: dockr.ly/4fWhlFm
    Learn more about Docker Scout: dockr.ly/3MhG5dE
    This video is sponsored by Docker
    Ollama docker-compose file: gist.github.com/notthebee/1dfc5a82d13dd2bb6589a1e4747e03cf
    Docker installation on Debian: docs.docker.com/engine/install/debian/

    • @Indiginous
      @Indiginous 2 หลายเดือนก่อน

      brother i would recommend you to use codeqwen1.5 this with ollama and continue it is less power hungry and give better result, i run it on my laptop with 16gb ram, i5-13th gen and 4050, it also very accurate

  • @criostasis
    @criostasis 2 หลายเดือนก่อน +38

    For my bs comp sci degree senior project my team designed and built an ai chatbot for the university using gpt4all, langchain and torchserve with a fastapi backend and react frontend. It used local docs and limited hallicinations using prompt templating. Also had memory and chat history to maintain contextual awareness throughout the conversation. Was a lot of fun and we all learned a lot!

    • @tf5pZ9H5vcAdBp
      @tf5pZ9H5vcAdBp 2 หลายเดือนก่อน +4

      Can you share the project? I'd love to see it.

  • @TommyThousandFaces
    @TommyThousandFaces 2 หลายเดือนก่อน +40

    There are ways to accelerate the AI models on Intel iGPUs, but they need to be run through a compatibility layer if I'm not mistaken. I couldn't test the performance of those, but it would work instead of allucinating and throwing errors like that. I didn't know you could plug in locally run models for coding so loved the video!

    • @zekefried2000
      @zekefried2000 2 หลายเดือนก่อน +5

      Yes the layer is called ipex-llm i would love to see an update video testing that

    • @TommyThousandFaces
      @TommyThousandFaces 2 หลายเดือนก่อน +1

      Thanks, I was scowering my wiki to find that info but without success

    • @arouraios4942
      @arouraios4942 2 หลายเดือนก่อน +2

      After scrolling through some github issues it would appear ollama supports vulkan which can utilize the iGpu

  • @AndreiNeacsu
    @AndreiNeacsu 2 หลายเดือนก่อน +3

    It's been a few months since I started using Ollama under Linux with the RX 7800XT (16GB) inside a 128GB DDR4 3600MT/s Ryzen R9 5950X system (Asrock X570S PG Riptide MB). Models sit on an Asus Hyper M.2 card with 4 Seagate 2TB NVMEe gen.4 X4 each. The GPU uses an X4 electrical (X16 mechanical) slot, since the first PCIe slot is taken up by the 4 drives. So far, I am very happy with this hardware/software setup.

  • @slim5782
    @slim5782 2 หลายเดือนก่อน +21

    The best way to run low power LLM's is to utilise the integrated GPU. It can utilise regular system ram, so no large vram required and it's faster than cpu.
    I really think this is the only viable way of doing it.

  • @anibalubilla5689
    @anibalubilla5689 2 หลายเดือนก่อน +55

    You watch M.D House to kick back and relax?
    I like you more than I used to.

    • @Giftelzwerg
      @Giftelzwerg 2 หลายเดือนก่อน +8

      everybody lies

    • @tf5pZ9H5vcAdBp
      @tf5pZ9H5vcAdBp 2 หลายเดือนก่อน +1

      ​@@Giftelzwerg lol who hurt you?

    • @Giftelzwerg
      @Giftelzwerg 2 หลายเดือนก่อน

      @@tf5pZ9H5vcAdBp woooosh

    • @namarrkon
      @namarrkon 2 หลายเดือนก่อน +5

      @@tf5pZ9H5vcAdBp It's a quote from the show

    • @twistedsaints
      @twistedsaints 2 หลายเดือนก่อน

      lol I thought the same thing

  • @spotted0wl.
    @spotted0wl. 2 หลายเดือนก่อน

    I started going down this road a few weeks ago.
    i9-9600k, 48GB RAM, 8GB RX 5700
    Thanks for the tips on some pieces I've been missing!

  • @dibu28
    @dibu28 7 วันที่ผ่านมา

    Qwen 2.5 Coder model is now available on the Ollama models list.
    According to benchmarks it is the most capable model for coding right now and is getting close to gpt-4o

  • @montytrollic
    @montytrollic 2 หลายเดือนก่อน +10

    Its good to mention that you can have good inference speed with CPU only if your CPU supports AVX512 and you have 16+ GB RAM. No idea if there are some mini PC out there with this kind of parameters.

    • @bzuidgeest
      @bzuidgeest 2 หลายเดือนก่อน

      Even if there are such machines, the cost would be high. Are a few suggestions really worth that?

    • @СергейБондаренко-у2б
      @СергейБондаренко-у2б 2 หลายเดือนก่อน

      @@bzuidgeest I have Minisforum UM780 XTX with AMD Ryzen 7 7840HS. It's supports up to 96 GB of RAM and there is AVX512. The barebone wasn't costly.

    • @bzuidgeest
      @bzuidgeest 2 หลายเดือนก่อน

      @@СергейБондаренко-у2б the barebone is useless, what was the price for the total system? Complete! And consider that's a system that does one thing permanently, running a LLM. No gaming, no secondary uses.

  • @pinklife4310
    @pinklife4310 2 หลายเดือนก่อน +9

    Would be interesting to see the viability of using an amd-powered mini pc for that with something like 7840hs/u with 780m. There seems to be some work being done to fix the memory allocation (PR 6282 on ollama gh). I've tried small-ish models (3b) that fit into my reserved vram and they seem to run faster this way, even if still constrained by memory speeds.

  • @TazzSmk
    @TazzSmk 2 หลายเดือนก่อน +4

    used rtx 3090 24GB or 3090 Ti 24GB vram will most likely work better than top-end AMD card,
    another option is to have pair of rtx 4070 Ti super for combined 32GB vram on proper docker setup,
    I think the biggest potential in all-round homelab AI use is to effectively make use of a gaming PC when not playing games :)

  • @brmolnar
    @brmolnar 2 หลายเดือนก่อน +2

    I found this helpful enough that I've included it in a presentation on a self-hosting an AI solution presentation that I'm working on for work. This is part of an effort to raise the overall AI knowledge and not for a particular use case. Now with that said, I've had much better luck with nVidia GPUs. I even bought a laptop with a discrete nVidia for just this purpose. It was back in May and I think the price was around $1600 USD. NVidia seems to be 'better' for now, but it is good to see AMD becoming viable. I'd suspect the nVidia options are in some ways better, but that is likely around power usage or time. The prices are still bonkers. I'm running an early Intel Ultra 9 in a laptop. This thing is nice.

  • @MelroyvandenBerg
    @MelroyvandenBerg 2 หลายเดือนก่อน +4

    I will buy a 2000+ workstation for local LLM development, which is a good idea IMO.

  • @camsand6109
    @camsand6109 2 หลายเดือนก่อน +2

    I know you’re a Mac guy (you convinced me to start using Mac) so even though it would be on the more expensive side, an apple silicon Mac with lots of ram is another option.

    • @bjarne431
      @bjarne431 หลายเดือนก่อน

      Not expensive AT ALL compared to getting the same amount of gpu memory from nvidia. And with the coming M4 macs it is expected ram will become minimum 16gb on mac.
      Apple silicon unified memory model can provide up to around 75% of the total ram to the GPU

  • @helidrones
    @helidrones 2 หลายเดือนก่อน +1

    I have achieved pretty decent results with Deepseek Coder V2 on a moderately priced RTX 4060 ti 16GB.

  • @alx8439
    @alx8439 หลายเดือนก่อน +1

    First. There're some really good coding models came out recently, like qwen-2.5-code. In their 14B version, it's not only capable of doing FIM (fill in the middle) tasks for a single file, but also for multiple ones. Not sure, if continue dev supports that, but Twinny (my personal favorite LLM plugin for VSCodium) does.
    Second. if you're aiming towards gpu-less builds, look at AMD Ryzen 7th series or higher. Like 7940HS. Not only it's a great powerhouse in terms of CPU performance, it's also supports DDR5 with higher bandwidth which is crucial, when it comes to LLM tasks. There're plenty of Mini PCs already with such CPUs

  • @Larimuss
    @Larimuss 28 วันที่ผ่านมา +2

    The 4090 is such a scam. 24gb of vram should not cost more than 99% of the systems in existence.

  • @MarcinSzklany
    @MarcinSzklany 2 หลายเดือนก่อน +11

    Ollama works well on Apple M-Series Chips if you have enough RAM. A Mac Mini might be a good server for this, but it's kind of expensive.

    • @victorc777
      @victorc777 2 หลายเดือนก่อน +2

      Correct! I am running Ollama and LM-Studio on a Mac Studio and a Macbook Pro. The Mac Studio (M2 Ultra) pulls a max of 150-170 watts while inferencing with 14B and lower parameter models, but idles at 20-30 watts. It is not the most efficient, but it is fast, and I can load Llama 3.1 70B Q4 without spending $3000+ on just the GPUs, and the added power cost. The Mac Minis with 16GB of Memory should be far more efficient.

  • @geroldmanders9742
    @geroldmanders9742 หลายเดือนก่อน +2

    Ollama is nice, but doesn't stand a chance against Tabby (from TabbyML). I run that on a desktop with an Intel i3 10100F CPU, single channel 16 GB RAM (DDR4, 3200MHz), Corsair SSD (500BX model) and a MSI GTX 1650 with 4 GB VRAM (75 Watt model). This meager self-build gives ChatGPT a run for its money in response times when accessed via the Tabby web-interface.
    Tabby can be run directly in your OS, setup in a VM or run at your cloud provider. Windows, Linux and MacOS are supported. Tabby also provides extensions for VS Code, JetBrains IDEs and NeoVim. Auto-complete and chat become available, just as is shown in this video.
    Tabby can be used for free when 5 accounts or less use it simultaneously.
    Disadvantage from Tabby is that it doesn't support many models. 6 Chat models and 12 Code models. Many of the models used in this video are supported by Tabby. You can hook at least 3 different Git repositories to it (as that is what I have done at the moment), but you can also use a document for context. And not just via the extension of your favorite editor, but also via the Tabby web-interface.
    Now, with only 4 GB of VRAM, I cannot load the largest Chat & Code models and these models tend to hallucinate. However, if you have a GPU with 8 GB or more, you can load the 7B models for Chat/Code and that improves the quality of responses a lot.
    And finally, Tabby has an installer for CUDA (NVidia), for ROCm (AMD) and Vulkan. Haven't tried ROCm or Vulkan, but Tabby in combination with NVidia is very impressive. My suggestion would be to make another video with Tabby in your 24 GB VRAM GPU using the largest supported models for both Chat and Code. I fully expect you'll come to a different conclusion.

    • @OlegShulyakov
      @OlegShulyakov 24 วันที่ผ่านมา

      It should gave you same results as Ollama with same models

  • @TomSzenessy-nr6ry
    @TomSzenessy-nr6ry 2 หลายเดือนก่อน +20

    How surprised he was when it "kinda just worked" 😂

    • @BeefIngot
      @BeefIngot 14 วันที่ผ่านมา

      Right? I figured the state of this was some super delicate, ready to break in 2 seconds setup. Was shocked when it wasn't awful.

  • @cap737
    @cap737 2 หลายเดือนก่อน +1

    For low power machines you'll need a good Tensor Processing Unit to process all those instructions for machine learning and AI. Ones like the Google Coral and Hailo would be best for the Latte Panda. Jeff Geerling made a pretty good video about this project. I think you're on the right path, just need some good TPUs to make this small server a reality.

  • @deechvogt1589
    @deechvogt1589 2 หลายเดือนก่อน +2

    Thanks for another educational and well executed video despite your hair style malfuction. (Which I probably wouldn't have noticed until you told on yourself.) Keep doing what you're doing.

  • @saljardim
    @saljardim 2 หลายเดือนก่อน

    Great video! I installed the Ollama codellama + WebUi on my Ubuntu Server via the Portainer app on my CasaOS install to make things as easy as possible.
    My server is an old Dell Precision T3620 that I bought for around 350 euros a couple of months ago.
    Specs
    Intel Xeon E3-1270 V5 - 4-Core 8-Threads 3.60GHz (4.00GHz Boost, 8MB Cache, 80W TDP)
    Nvidia Quadro K4200 4GB GDDR5 PCIe x16 FH
    2 x 8GB - DDR4 2666MHz (PC4-21300E, 2Rx8)
    512GB - NVMe SSD - () - New
    Crucial MX500 1TB (SFF 2.5in) SATA-III 6Gbps 3D NAND SSD New
    Things are running well, but, of course, not as fast as on a beefed machine like yours. :D

  • @Manuel-gm4zs
    @Manuel-gm4zs 2 หลายเดือนก่อน +1

    Really nice video!
    One point tho about forwarding port 8080 in your compose file, that also punches a hole in your firewall allowing traffic from everywhere to connect to that port.
    Just as a warning for anyone running this on a server that's not sitting behind another firewall.

  • @mydetlef
    @mydetlef 2 หลายเดือนก่อน +2

    We'll have to see how this all plays out on the new laptop processors with NPU. I'm still hoping to be able to buy a $1,000 all-in-one mini PC - without a graphics card but with enough NPU power. The question also arises as to what is more necessary: ​​RAM or NPU power.

  • @ericdanielski4802
    @ericdanielski4802 2 หลายเดือนก่อน +11

    Nice self hosting.

  • @hakovatube
    @hakovatube 2 หลายเดือนก่อน

    Amazing content as usual! Thank you for all the work you put into this!

  • @luisalcarazleal
    @luisalcarazleal 2 หลายเดือนก่อน

    I will give it a try next week on my nvidia tesla k80. It has been good enough for 1080p remote gaming.

  • @bretanac93
    @bretanac93 2 หลายเดือนก่อน

    I've been running Ollama in a MacBook Pro M3 Pro with 36 gigs of RAM and it works pretty well for chats with llama3.1:latest. I'll test the other models you suggested also with Continue. I tried using Continue in the past in Goland, but the experience was quite mediocre. Interesting stuff, thanks for the recommendations in the video.

  • @Splarkszter
    @Splarkszter 2 หลายเดือนก่อน +1

    I have a 7735HS. Would love to see how the powerful IGPU's like the 680M perform.

  • @godned74
    @godned74 หลายเดือนก่อน

    you can run meta llama 3.2 8 billion parameters quantized with just a cpu using gpt4all as well as control it via pycharm or vscode . a good option for those wanting to build something like this on old cheap hardware that would normally be thrown in the garbage.

  • @Urxiel
    @Urxiel 2 หลายเดือนก่อน

    Amazing video, Wolfgang. Amazing information and incredibly informative. It would be really cool if you could do the same setup on Proxmox using a CT or something like that for companies that could have extra hardware lying around waiting to be repurposed. This video has really answered my question related to this topic.

  • @Quozul
    @Quozul 2 หลายเดือนก่อน +3

    Thanks a lot for this topic! I am very interested about running AI in my homelab, but those AI chips are very expensive and power hungry, it would be interesting to review other options such as iGPU, NPU, cheap ebay graphics cards or any other hardware that can run AI inteference

    • @bjarne431
      @bjarne431 หลายเดือนก่อน

      Im waiting for the m4 mac mini, if they offer it with 64gb ram it will most likely be amazing value for AI
      Macos can run ollama and it runs very well, and the unified memory model on apple silicon is perfect for it

  • @thiagobmbernabe
    @thiagobmbernabe 27 วันที่ผ่านมา

    And how is the performance hosted in your MacBook Pro? Another topic, the mini pc with something like hailo-8 can improve the performance enough?

  • @Horesmi
    @Horesmi 2 หลายเดือนก่อน +1

    I set up my 800W "home lab" to wake on lan and hibernate after ten minutes of inactivity, which seems like a decent compromise in terms of power consumption
    Am I really gonna brun 800w of compute on local code generation? Bet.

  • @albertocalore2414
    @albertocalore2414 2 หลายเดือนก่อน +2

    Personally I'm hyped for the of Hailo-10H that claims 40tops on m.2 form factor and just 5 watts of power consumption. I hope all their claims are true and maybe you can be interest in it yourself (:

  • @benve7151
    @benve7151 19 วันที่ผ่านมา

    Great info. Thank you

  • @Konsi2
    @Konsi2 2 หลายเดือนก่อน

    you could also run the "autocompletion" from VS code on your Mac.
    I´m running ollama on my M2 Pro. Power consumption is not really worth mentioning it and you don't have to run a seperate PC.
    (But this only make sense of course if you don't want so share the openUI web to others)

  • @panconqueso9195
    @panconqueso9195 2 หลายเดือนก่อน

    Dude, this was a really cool project, I recently build an AI home server and I think there are some parts of your project that could be done better. AMD is fine but general purpose graphics cards but for AI, Nvidia is always your first option, of course it would work fine on Ollama and similar models, but most AI projects out there support Nvidia firstly, your build would be more future proof with a Nvidia graphics card and you'd be able to mess up with other projects like Whisper more easily. Of course, Nvidia is way more expensive, I'm using an RTX 3060 which costed me like $320, there are a really excelent video by Jarods Journey comparing that card with more high end Nvidia cards and it works greatly for its price, its performance it's not so far away from the 4090, specially considering the price, the main difference is the VRAM but there are some tuning you could use to run heavier models using less RAM, or balance both RAM and VRAM such as Aphrodite and KoboldCPP for text generation. Lastly in regard of power consumtion, yes, it does cost a lot of power to run decent models, there isn't a work around it, however, you could just turn on your machine when you need the text generation and turn it off when you are not using it, if you want a more ellaborated solution, you could enable Wake On Lan and get a Raspberry Pi as client server for turning it on/off everytime you need it, at least I'm planning to do that with my server.
    At the moment there isn't a lot of videos about deploying local inference servers on YT so I'm really happy you made this one. Looking forward to more AI related videos in the future.

  • @gatu-net
    @gatu-net 2 หลายเดือนก่อน

    I have a similar setup, I run all AI stuff on windows, so I don't have to dual boot, it's also easier to setup the PC to sleep on windows, compared to a headless Linux

  • @AnticipatedHedgehog
    @AnticipatedHedgehog หลายเดือนก่อน

    Maybe it's the SFF case, but holy smokes that CPU cooler is huge!

  • @jandramila
    @jandramila 2 หลายเดือนก่อน

    There is an ongoing effort to allow for OneAPI / OpenVINO on intel gpus's. Once this tomes we'll be able to use low power iGPUS with lots of RAM. I'm always checking on the issue for Ollama, there's also a couple questions regarding NPU support. Holding my breath for Battlemage GPUs here, though I've seen impressive results with ollama Running on modern QUADRO Gpus... for those who can afford it. Not me! Thanks for this! I've tried this same stack in the past both with a 1080ti and 6950xt. Ollama runs perfectly fine on both of them and but continue seems to have improved a lot since my last try. I will give it another shot!

  • @oscarcharliezulu
    @oscarcharliezulu 2 หลายเดือนก่อน

    Nice vid wolfie. Perfect !

  • @robertbrowniii2033
    @robertbrowniii2033 2 หลายเดือนก่อน

    Could you tell how you got the models that were used in your testing? I have attempted to find them, including looking at the Open WebUI community site, looking through the models available on the ollama site, and I even attempted to pull them directly into ollama (the file was not found). So where did you get those models and how did you get them into ollama?

  • @MrSemorphim
    @MrSemorphim 2 หลายเดือนก่อน

    Is that MonHun I spotted at 16:41? Nice.

  • @seifenspender
    @seifenspender 2 หลายเดือนก่อน

    3:25 I bought my RTX 3090 used for 600€ with 24GB* of VRAM. Just for reference - that's a great deal.

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน

      Is it a modded card? Stock 3090 has 24GB of VRAM

    • @seifenspender
      @seifenspender 2 หลายเดือนก่อน

      @@WolfgangsChannel Oh man, I just wanted to update my comment :D
      No, I'm just an idiot. Looked at the wrong row. 24GB of VRAM.

  • @BLiNKx86
    @BLiNKx86 2 หลายเดือนก่อน

    "you've been looking for more stuff to self host anyway"....
    GET OUT OF MY BRAIN!
    Off to Micro Center.... "Which aisle has the 7900 XT?"

  • @FunBotan
    @FunBotan 2 หลายเดือนก่อน +1

    I wonder if JetBrains will ever allow a similar plugin for their IDEs

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน +1

      They have this, no idea if it's any good - www.jetbrains.com/ai/
      But also, Continue supports JetBrains IDEs: plugins.jetbrains.com/plugin/22707-continue

    • @FunBotan
      @FunBotan 2 หลายเดือนก่อน +1

      @@WolfgangsChannel The first is SaaS, but I didn't notice that Continue was available there, too. It doesn't look like it works all that well at the moment, but I'll try it at some point. Thanks!

  • @AndréBarbosa-z4n
    @AndréBarbosa-z4n หลายเดือนก่อน

    I will only self host an AI service when the hardware needed to run these things are down to mainsream levels. Even, the architecture is not power optimized and maybe this is something that we have to wait from the hardware and software side. Power efficiency is the main issue for self-hosting anything.

  • @iraqinationalist7778
    @iraqinationalist7778 2 หลายเดือนก่อน +1

    Have you considered tabbyml instead of continue+ollama, and it has a neovim plugin

    • @iraqinationalist7778
      @iraqinationalist7778 2 หลายเดือนก่อน

      Plus fedora support rocm out of the box

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน

      Thanks for the recommendation! I'll try it out

  • @AndrewChikin
    @AndrewChikin 2 หลายเดือนก่อน +1

    1:39 sigma 🥶

  • @cugansteamid6252
    @cugansteamid6252 2 หลายเดือนก่อน

    Thanks for the tutorial.

  • @postcanonical
    @postcanonical 2 หลายเดือนก่อน +1

    but you can host it on the same powerful machine, you programme on and run llm, when you need it, then play games when you don't, right? I am planning to do the same thing, but I want my low power home server to be like a switch for my pc, so I can game on it, when I need to, but also use it's power to help me with my prompts... It can be done using WOL, if you need it to work remotely, I think. Or is it a bad idea in terms of security?

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน +1

      You can do it all on one machine

    • @postcanonical
      @postcanonical 2 หลายเดือนก่อน

      @@WolfgangsChannelBut I need WOL to turn on my PC over WAN, I tried to do port forwarding, but it seems, that I need to set something in my router, so it knows where to forward the signal it gets, because, if my PC is off it cannot find it's local IP for some reason. I use openwrt.

  • @HussainAlkumaish
    @HussainAlkumaish 2 หลายเดือนก่อน +1

    counting watts consumption is useful, but I wouldn't let it stop me.

  • @Shield-Anvil-Itkovian
    @Shield-Anvil-Itkovian 2 หลายเดือนก่อน

    You should look into TPUs in order to run this on a lower spec machine.

  • @Invisible-z7w
    @Invisible-z7w 2 หลายเดือนก่อน

    I've seen a few people plug a GPU into an M2 slot of a mini pc with an M2 to PCi adapter. I believe you need to power the GPU with an external power supply but I've always wondered what the idle power consumption is like for a setup like this. Maybe a future video:)

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน

      Probably not much different from a desktop PC with the same setup. M.2 is just PCIe with extra steps

  • @jonjohnson2844
    @jonjohnson2844 2 หลายเดือนก่อน +1

    I think my 4th gen i5 unraid server would immediately catch fire if I put Ollama on it, lol.

  • @SornDP
    @SornDP 2 หลายเดือนก่อน

    Great video, subbed

  • @peterjauch36
    @peterjauch36 2 หลายเดือนก่อน

    What do you think for Nvidia rtx a2000 12GB VRam is that enough.

  • @binh1298ify
    @binh1298ify 2 หลายเดือนก่อน

    Hey, thanks for the video, I’ve always wanted something like this. Would love to see an update with neovim and a nvidia gpu setup

  • @sjukfan
    @sjukfan 2 หลายเดือนก่อน

    You can make a Frankendebian and install amdgpu-dkms and rocm-hip-libraries from the AMD repos, but yeah, better run with Ubuntu.

  • @bionborys1648
    @bionborys1648 2 หลายเดือนก่อน

    I don't recognise half of those icons in the dock, ...but I can see 1847 unopened emails 😂😂😂

  • @AndreiRebegea
    @AndreiRebegea 2 หลายเดือนก่อน

    Very cool video. I would love to see more videos like this. Target java developers ;-) .
    A comparison with a Nvidia setup.
    Maybe some tuning of the LLM to use less power "eco mode"?
    A fast cpu and lots of RAM memory setup vs expensive graphic card comparison

  • @MichaelBabcock
    @MichaelBabcock 6 วันที่ผ่านมา

    Tabnine has offered this for years now already

  • @benoitmarioninaciomartins1602
    @benoitmarioninaciomartins1602 2 หลายเดือนก่อน

    imo getting a second hand m1 mac mini with 16 gb of ram might be the cheapest AI solution. Very decent performance with ollama and a price around 500-600 €. Otherwise for a gpu, a cheap option is a second hand 3090, even better if you get 2 you have access to 48 gb models.

  • @Jetta4TDIR
    @Jetta4TDIR 2 หลายเดือนก่อน

    Perhaps GPUs built for compute workloads would work better? (Im not sure, im genuinely asking) Im thinking something along the lines of RTX A2000 for low power draw or A5000?

    • @CoolWolf69
      @CoolWolf69 8 วันที่ผ่านมา +1

      I have a RTX A4000 Ada. Power consumption in idle (per nvtop) is about 13w. When a large LLM is run it takes the full 130w which I think is a perfect compromise between power consumption and performance.
      I am not a gamer at all - just using this GPU in my homelab for LLM.

  • @pixelsafoison
    @pixelsafoison 2 หลายเดือนก่อน

    At the same time, dedicating an entire GPU to a single task like this is kind of nonsensical. Unless you're a small company that dedicates a server for the task I do not see this making sense - and let's face it, it's mostly a memory issue which the industry makes us pay at a premium.
    Thank you for answering a question that I have had in the back of my mind for the last few months :).

  • @DS-pk4eh
    @DS-pk4eh 2 หลายเดือนก่อน

    So, I would try to get hands on miniPC based on AMD 8840HS ( a lot of on ALiExpress for around 300 barebone, sou add RAM and SSD). Run them with either 32Gb of RAM and you got yourself, small, light power AI assistant.
    This APU from AMD comes with nice integrated GPU but also with NPU unit ( 16 TOPS), so it is nice upgrade from that Intel you have and even cheaper.
    Once the newest APUs from AMD and maybe Intel come to miniPC, (with around 50TOPS npu units), it will more than enough for this.

  • @Marc42
    @Marc42 2 หลายเดือนก่อน

    Maybe a TPU rather than a GPU would work more efficiently?

  • @عدنانرزوق-ك5ق
    @عدنانرزوق-ك5ق 2 หลายเดือนก่อน

    How you think a400 gpu will perform

  • @eugeneglybin
    @eugeneglybin 2 หลายเดือนก่อน

    By the way, Zed is becoming a more prominent and apt replacement for Neovim and VSCode, and it has built-in support for Ollama (as well as other services). It doesn’t have AI suggestions directly, but those can be (kind of) configured via inline assists, prefilled prompts, and keybindings.
    But the main problem is speed, and it doesn’t matter if it’s a service from a billion-dollar company or your local LLM running on top-end hardware. Two seconds or five, it breaks the flow. And the result is rarely perfect.
    It’s very cool that it’s possible, but it’s not there yet, and we don’t know if it will ever be.

  • @henokyohannes9910
    @henokyohannes9910 2 หลายเดือนก่อน

    Hey, my country blocked internet access in my area, but if I connect and start the internet outside the blocked area and come to the region that's blocked without disconnecting, it keeps working for weeks or months. But if I reconnect, I lose access to the internet. No VPN can bypass it. HELP

  • @wildorb1209
    @wildorb1209 2 หลายเดือนก่อน +2

    HAHAHAH @9:55 that killlllleeeedddd me!!!!!!!!!!!!!!!!!!!!!

  • @mrfaifai
    @mrfaifai 2 หลายเดือนก่อน

    Why just run ollama on your mac? It does support GPU acceleration on mac as well.

  • @sobskii
    @sobskii 2 หลายเดือนก่อน

    I still waiting for "real" Cortana AI".

  • @ehm-wg8pd
    @ehm-wg8pd 2 หลายเดือนก่อน

    can i run this docker instance in truenas?

  • @evgeniy9582
    @evgeniy9582 2 หลายเดือนก่อน

    Why you have 2 unix devices /dev/kfd and /dev/dri for a single GPU?

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน

      kfd is for „Kernel Fusion Device“, which is needed for ROCm

  • @ragadrop
    @ragadrop 2 หลายเดือนก่อน

    What about using a google tensor card?

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน

      TPUs are currently not supported: github.com/ollama/ollama/issues/990?ref=geeek.org

  • @Rohan-rb5le
    @Rohan-rb5le 2 หลายเดือนก่อน +3

    This is unacceptable. You deserve 5 million views and 39 million likes in 2 minutes 😅

  • @SpookFilthy
    @SpookFilthy 2 หลายเดือนก่อน

    3090 also has 24 GB of G6X memory and nowhere near the cost of a 4090

  • @RazoBeckett.
    @RazoBeckett. 2 หลายเดือนก่อน +2

    I'm a Neovim Chad.

    • @BeefIngot
      @BeefIngot 14 วันที่ผ่านมา

      And like all neovim users, you let us know it, as smugly as possible.
      Edit: Having now reached the end of the video I realize this is just a reference.

  • @valeriianderson9226
    @valeriianderson9226 2 หลายเดือนก่อน

    Hey. What font do you use in terminal?

  • @buisiham
    @buisiham 2 หลายเดือนก่อน

    Are there any Continue like plugins that support xcode ?

    • @bjarne431
      @bjarne431 หลายเดือนก่อน

      Apple made their own ai code completion with latest xcode and latest macos. It requires 16gb ram (making it even more insane they sell “pro” computers with only 8gb ram. Apples ai code completion is kind of bad to be honest

  • @habatau8310
    @habatau8310 2 หลายเดือนก่อน

    Are you going to make an updated video on VPN?

  • @svensyoutube1
    @svensyoutube1 2 หลายเดือนก่อน

    Thx

  • @dominic.h.3363
    @dominic.h.3363 2 หลายเดือนก่อน +1

    I wouldn't put my faith into any LLM small enough to allow local hosting for coding, while chatGPT can't write something as mundane as an actually working autohotkey 2.0 script. If you can troubleshoot the hogwash output, good for you. If you can't, tough luck... it can't either.
    Also not being able to utilize AWQ models is shooting yourself in the foot from the get-go... 5:13 - case in point.

    • @firstspar
      @firstspar 2 หลายเดือนก่อน

      The other options is having your work spied on and stolen.

    • @dominic.h.3363
      @dominic.h.3363 2 หลายเดือนก่อน

      @@firstspar No, the other option is to not use LLMs for things they weren't meant to be used for.
      Transformer models whose working principle is to give you a probabilistic distribution of words as results, can't do specifics! Why do you think it struggles with math concepts as simple as adding two single-digit numbers accurately when it's accompanied by a text representation of what those numbers are? This is an example when I asked it how much of each thing I'll need (I repeat, I'll need, one person) for a 9-day stay without the possibility to go to the store to get more:
      Personal Hygiene
      Toilet Paper: 2-3 rolls per person per week, so about 18-27 rolls for 9 days.
      It outputs 2*9 to 3*9 instead of rounding up 9/7*2 to 9/7*3, not being able to reconcile the concept of a week with days and THIS is what you want to entrust with coding?! Solving coding problems?! This isn't just a minor error, it's a fundamental failure to apply basic arithmetic concepts correctly, and this with 4 HUNDRED BILLION parameters, what do you expect out of a 3 billion model? A lobotomized version of chatGPT won't just suddenly get a concept right chatGPT didn't, just by tightening its training data to only entail coding!

  • @BaldyMacbeard
    @BaldyMacbeard 2 หลายเดือนก่อน

    A second hand 3090 is one of the best options for ML inference right now. You can get one for 500 bucks. It’s faster than amd cards for many tasks and will allow you to run way more things. Most projects out there are built purely on cuda

  • @steveiliop56
    @steveiliop56 2 หลายเดือนก่อน

    30gb for the drivers!?

  •  2 หลายเดือนก่อน +1

    14:38 😂 halu

  • @AkashPandya9
    @AkashPandya9 2 หลายเดือนก่อน

    Hey man it'll really be helpful if you'd provide your continue's config.json file as well.

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน

      I didn't edit the config at all, apart from replacing the model names and URLs

    • @AkashPandya9
      @AkashPandya9 2 หลายเดือนก่อน

      @@WolfgangsChannel got it. Thanks ✨

  • @rafal9ck817
    @rafal9ck817 2 หลายเดือนก่อน

    I'am emacs chad, I use neovim to do simple config edits only.

  • @Arombreak
    @Arombreak 2 หลายเดือนก่อน

    what about a Nvidia Jetson board?

  • @Napert
    @Napert 2 หลายเดือนก่อน

    10:41 how the hell did you get tab autocomplete to work so easily? I've been banging my head on this problem for months and even tried to copy your configuration but still it just refuses to work for some reason

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน

      What hardware are you running Ollama on?

    • @Napert
      @Napert 2 หลายเดือนก่อน

      @@WolfgangsChannel Ollama runs fine, but the tab autocomplete doesn't want to work
      If I do Ctrl+L or Ctrl+I it works fine

    • @Napert
      @Napert 2 หลายเดือนก่อน

      @@WolfgangsChannel ryzen 3600 32gb 3200 ddr4 rtx 3060 ti

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน

      Which model are you using for the tab autocomplete?

    • @Napert
      @Napert 2 หลายเดือนก่อน

      @@WolfgangsChannel in vscode extension I can choose any and it will not even try to load it (ollama log doesn't show even getting a request to load the model)
      it will load models I've set for chat functions (Ctrl+L and Ctrl+I) but never tab autocomplete
      (For chat: llama3.1:8b-instruct-q6_K, gemma2:9b-instruct-q6_K, llama3.1:70b-instruct-q2_K (slow but works), for tab autocomplete: deepseek-coder-v2:16b-lite-instruct-q3_K_M, codegeex4:9b-all-q6_K, codestral:22b-v0.1-q4_K_M)

  • @pingas533
    @pingas533 2 หลายเดือนก่อน

    What about a Raspberry pi 5 with the ai kit?

    • @WolfgangsChannel
      @WolfgangsChannel  2 หลายเดือนก่อน

      The problem with many TPUs is that they don’t have onboard storage. Or if they have some, it’s very small. One of the reason why Ollama works so well on GPUs is fast VRAM. Running LLMs on the TPU would mean that your output is bottlenecked by either USB or PCIe, since they’re slower than the interconnect between the GPU itself and the VRAM (or the CPU and the RAM)

  • @p504504
    @p504504 2 หลายเดือนก่อน +3

    Based House enjoyer

  • @Adam130694
    @Adam130694 2 หลายเดือนก่อน +1

    0:45 - that sweet, sweet Mövenpick yoghurt

    • @o_q
      @o_q 2 หลายเดือนก่อน +1

      nestle 🤮

  • @naeemulhoque1777
    @naeemulhoque1777 2 หลายเดือนก่อน

    i wish some day we would be able to use local Ai at cheaper hardware cost.
    Most local Ai needs expensive hardware. 😪

  • @spoilerkiller
    @spoilerkiller 2 หลายเดือนก่อน +1

    1840 unread mails..

  • @sjukfan
    @sjukfan 2 หลายเดือนก่อน +1

    In a b e e e e e e ee e e e e e e e e e e e e e e e e e

  • @lloydbush
    @lloydbush หลายเดือนก่อน

    Key takeaway: My Raspberry Pi 3B+ is not an option :⁠'⁠(