The Secret Behind Ollama's Magic: Revealed!

แชร์
ฝัง
  • เผยแพร่เมื่อ 18 ก.พ. 2024
  • Ollama is amazing and let's you run LLM's locally on your machine. But how does it work? What are the pieces you need to use? This video covers it all.
    Be sure to sign up to my monthly newsletter at technovangelist.com/newsletter
    And if interested in supporting me, sign up for my patreon at / technovangelist
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 94

  • @mo_i_nas
    @mo_i_nas 2 หลายเดือนก่อน +7

    that 10 second pause at the end of the video was like expecting post-credits scene from a Marvel movie :D

  • @jayk806
    @jayk806 6 วันที่ผ่านมา

    Really enjoyed this video. I especially appreciated the section on creating derivative models. Very informative. Thanks!

  • @erniea5843
    @erniea5843 3 หลายเดือนก่อน +2

    Appreciate you taking the time to put this video together!

  • @piero957
    @piero957 3 หลายเดือนก่อน +6

    Excellent content and explanation, your videos get my like when they start, just to be sure I don't forget it.

  • @aaronk9910
    @aaronk9910 หลายเดือนก่อน

    Awesome and calm explanation. By far the best channel about this topic.

  • @darenbaker4569
    @darenbaker4569 3 หลายเดือนก่อน +5

    Sorry but you are so cool, so much enjoying your videos and saving me months of learning the docs. This ai world is moving so fast for an old person like me but making me feel young again, but it feels like we are still at 1400k modem stage of dev. Love what your doing.

  • @lucioussmoothy
    @lucioussmoothy 3 หลายเดือนก่อน +1

    Great job compiling and explaining this information. Fantastically useful video - keep up the great work brother

  • @Slimpickens45
    @Slimpickens45 3 หลายเดือนก่อน +3

    Thanks for the in depth review. Good stuff sir!

  • @Autom4tic
    @Autom4tic 3 หลายเดือนก่อน

    Hey, Matt! Thank you for such a clear explanation! Cheers!

  • @fabriai
    @fabriai 3 หลายเดือนก่อน

    Thanks! Good architecture video. Amazinf.

  • @bjaburg
    @bjaburg 2 หลายเดือนก่อน

    And I thought I knew everything. Until I found your video(s).
    Thanks again!!

  • @AnythingGodamnit
    @AnythingGodamnit 3 หลายเดือนก่อน

    Thanks for this. This helped me click that the model is not the big thing, but rather the weights are. I find that a bit confusing because people talk about training a model, not training the weights.

  • @s11-informationatyourservi44
    @s11-informationatyourservi44 3 หลายเดือนก่อน

    awesome tutorial

  • @isaacyimgaingkuissu3720
    @isaacyimgaingkuissu3720 2 หลายเดือนก่อน +1

    Great job. Thanks to go deep in how the architecture work and where are the possible break points. One question: Do you have any architecture schema about each bloc of ollama and how to interact each others?

  • @TheOnlyEpsilonAlpha
    @TheOnlyEpsilonAlpha 2 หลายเดือนก่อน

    The explanation that LLAMA doesn’t upload things was the most interesting for me and it got answered in that video. It’s no wonder based on where llama comes from

  • @lukasksiezak4929
    @lukasksiezak4929 13 วันที่ผ่านมา +1

    It is still unclear to me how Ollama really works, was hoping for a deeper dive in that topic.

  • @timtensor6994
    @timtensor6994 2 หลายเดือนก่อน

    Hi Matt great content. I think it would be great if you have some kind of a cookbook with your experiments ? Similar to hugging face. On other note would be great to have vidoes on RAG +open source, running agents . they seem to be hot topics for now

  • @xugefu
    @xugefu 2 หลายเดือนก่อน +1

    Thanks!

  • @MrJaggy123
    @MrJaggy123 3 หลายเดือนก่อน +3

    I do have an "anything else" question : it seems like Ollama has its own custom place and way of storing models. Does this mean that if I want to do other stuff with the models, say for instance using EleutherAI's benchmarking stuff, I'll need a copy of all the models? I currently have lots and lots of models, almost 3T worth, so I'd rather not have to have 6T if I can help it...

  • @DonBranson1
    @DonBranson1 3 หลายเดือนก่อน

    another great video. I just need to implement rag so i can remember it.

  • @natekidwell
    @natekidwell หลายเดือนก่อน

    Amazing presentation. Is there a paper or post discussing the dockeresque style of ollama's manifest?

  • @richardpulliam9096
    @richardpulliam9096 วันที่ผ่านมา

    Great presentation @Matt, I do have one question as it relates to the saving option, if you save once and then continue asking questions will it continue to save, or will you need to do this after or during each session?

    • @technovangelist
      @technovangelist  วันที่ผ่านมา

      That’s not really the purpose of the save. save creates a new model based on the current state. Once you create the system prompt and parameters, save it so that you can use it later.

  • @Techonsapevole
    @Techonsapevole 3 หลายเดือนก่อน

    Great video, can you make a video on autogen-studio + ollama ? Is there a local model able to run the playground examples ?

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      Yes. Been meaning to cover that

  • @m12652
    @m12652 2 หลายเดือนก่อน

    Thanks for this... just what I needed 👍what would really help would be a list of example prompts, or sequences of prompts, that produce workable code. So far for me there's been almost no good results for anything more than a very simple question (which I likely would never ask)... for example I asked for a popup list that when a list item is clicked, dismisses the popup and returns the item. What i got was a non modal contact form with a select list. Many iterations later still nothing useful. Cheers...

  • @sandrocavali9810
    @sandrocavali9810 3 หลายเดือนก่อน

    Excelente

  • @felixchin1
    @felixchin1 3 หลายเดือนก่อน

    Hey! Big fan and user of Ollama, thank you for everything you do!
    Just out of curiosity, how does Ollama work on M1/M2 Macs which lack NVIDIA gpus?
    When I try to run Mixtral through PyTorch, I get an error about not having NVIDIA CUDA. How does Ollama bypass this problem?

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      the team runs on apple silicon macs, so that came first. Metal and Apple Silicon is super powerful. Ollama doesn't use Pytorch or Python. It avoids a lot of problems.

    • @felixchin1
      @felixchin1 3 หลายเดือนก่อน

      @@technovangelist So it’s PyTorch that’s dependent on CUDA, not the LLM model itself?
      Are there any frameworks besides from Ollama that work with Apple Silicon? Seems like tensorflow doesn’t work on Apple Silicon either?
      Asking because I’m trying to fine-tune models locally. Wishing for the day Ollama adds local fine-tuning as a feature!

  • @jdray
    @jdray 3 หลายเดือนก่อน

    This is super helpful as a reference for my efforts to get the Security people to stop freaking out about what demons might be infecting people's computers if they allow something like Ollama to be distributed. Can you comment further on what happens with those "memory" files? I presume they stay local even if someone pushes the model that they're associated with. Also, I'd like to see a short video (or maybe there is one) on the concept of someone "customizing a model and pushing it to the Library". What customizations are available? If customizing doesn't change the weights, why would someone do it? How is a "model" different than "the weights"? These are all things that Security is going to ask.

  • @podunkman2709
    @podunkman2709 3 หลายเดือนก่อน

    Hi Matt! Thanks for your videos - u inspired me to get into subject.
    How can I use my own data in ollama? I would try to build up some search engine to find products accoding some given attributes. So I have product title, sku, description. Would be good if I could get product detail (including product number) when posting query like "mineral oil, 5 w40" or so.
    Ho can I start with that?

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      If you have structured data and you want to search for things that match then more traditional db lookup is better and faster.

    • @podunkman2709
      @podunkman2709 3 หลายเดือนก่อน

      @@technovangelist Well, it is not so structured 🙂 Very often product descriptions are quite chaotic and user type not exactly what's inside descriptions. Anyway, whatever it is - how to use own data?

  • @bobbaganush1
    @bobbaganush1 หลายเดือนก่อน

    Why do none of the models work on my end? When I try mistral or llama 2, I get a message stating those terms are considered hate speech. Is there a workaround for this?

  • @brinkoo7
    @brinkoo7 3 หลายเดือนก่อน

    For those on Arch linux, don't make the mistake I did if you have Nvidia.. I install the package ollama from AUR, but never noticed the ollama-cuda package which is sooo much faster.

  • @karlofranic1299
    @karlofranic1299 2 หลายเดือนก่อน +1

    Can someone explain how does Ollama manage to run large models on my server when I can't even run smaller ones? (Ollama can run 34b versions of models and without it I run out of memory already on 7b versions)

    • @technovangelist
      @technovangelist  2 หลายเดือนก่อน +1

      I assume without ollama you are trying to work with unquantized models? A 7b unquantized model will take 32gb of vram at least because it’s 4bytes per parameter. Quantization reduces the size of the parameters with almost no impact on its precision. So it can be reduced to needing 3.5gb for best performance or as little as under 2. A 34b model could fit in as little as 9gb.

  • @daryladhityahenry
    @daryladhityahenry 3 หลายเดือนก่อน

    Question about ollama:
    I have 8Gb VRAM.
    First request I ask to run mistral7b for example.
    The second request I ask to run llama2.
    Is it auto changing the model or what? Thanks. Is it able to do what I said? Because if yes, it's very useful for small VRAM like me :D

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน +1

      Yeah. With smaller models this is closer to magic. It dumps the first model from memory and then loads the second and answers the question. With larger models you feel the time taken to load and unload models.

    • @daryladhityahenry
      @daryladhityahenry 2 หลายเดือนก่อน

      @@technovangelist Woahh.. Why I just know about this......Thank you so much. This is exactly what I need later for my personal project.

  • @atrocitus777
    @atrocitus777 3 หลายเดือนก่อน

    how can i point ollama to my own registry where say im hosting models on my own on prem hardware?

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      That will come soon. The person on the team who wrote the registry I think was able to use his code from docker hub when he created that. But a lot had to change to support ollama. Docker images are minuscule compared to LLMs

  • @fuba44
    @fuba44 3 หลายเดือนก่อน

    If you "export OLLAMA_HOST = YOUR_IP_HERE ollama serve" to serve ollama on another IP like inside your tailnet like i am, the ollama instance you get cannot use the models downloaded form the "normal" "ollama run" instance. you have to download every model again and manage them in two (or more) places. is that a bug or is there a meaning with doing it like that?
    Also programs like autogenStudio does not set the "num_ctx" so it defaults to 2048, is there a way to set that on a per model basis like just set it for "mistral:7b", i guess i could create a model file maybe? but then i have to do it for each model/new model i have/get. would be so much easier if it just pulled that from the ollama site as a model specific default value.
    ALSO, love your videos just subbed!

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน +1

      Yes, if you setup environment variables the wrong way as you showed, then it’s a different user accessing them. So the models aren't where they are expected to be. I have a couple videos on here that show how to set environment variables for Mac, Linux, and Windows. When you set them properly you won't have any issues.

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน +1

      For the num_ctx question, that’s where a modelfile comes in. You can set that parameter and get the larger context, but be warned, large context takes a huge amount of memory

    • @fuba44
      @fuba44 3 หลายเดือนก่อน

      @@technovangelist thank you, I will go find those videos.

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน +1

      actually, you are right. I should do another that shows that scenario a bit more. thanks

    • @sarthaksarangi1
      @sarthaksarangi1 3 หลายเดือนก่อน

      ​​@@technovangelist Can you please provide more context on how to do it, for the num_ctx question

  • @user-xh3ig8qq7j
    @user-xh3ig8qq7j 2 หลายเดือนก่อน

    do i have to duplicate the model to add my permanent system
    prompt?

    • @technovangelist
      @technovangelist  2 หลายเดือนก่อน

      Creating a new model based on another with a new system prompt will mean an additional few kb. It will use the same weights layer.

    • @user-xh3ig8qq7j
      @user-xh3ig8qq7j 2 หลายเดือนก่อน

      great ~@@technovangelist but how do i do this?

  • @nevokrien95
    @nevokrien95 3 หลายเดือนก่อน

    For linux u can just ctrl c out of the ollama serve and thats good enough for the gpu memory.

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      Yes. Do that and wait 5 minutes.

    • @nevokrien95
      @nevokrien95 3 หลายเดือนก่อน

      @technovangelist dosent it force the gpu to clear? On my machine at least it seems to do that.
      Because when u kill the ollama parent process the os clears the gpu memory automatically

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      ahh sorry, you are running ollama serve at the cli rather than the recommended way of running it as a service. Got it. I think most use it the normal way and thats what I cover in the video.

  • @Xiripyu
    @Xiripyu 3 หลายเดือนก่อน

    Ollama vs LM studio, what is better and what is difference?

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน +1

      The way I have heard most describe it is that pm studio is great to start with but you quickly hit walls that are difficult to deal with. Ollama will let you do far more with better speed and efficiency. Plus Ollama is open source.

  • @sarthaksarangi1
    @sarthaksarangi1 3 หลายเดือนก่อน

    Can you help us out with a detailed video on how to add a new layer to a ollama model 6:05

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      are you asking about doing the fine tuning, or adding the fine tune adapter to ollama?

    • @sarthaksarangi1
      @sarthaksarangi1 3 หลายเดือนก่อน

      @@technovangelist I meant doing the Fine tuning

    • @sarthaksarangi1
      @sarthaksarangi1 3 หลายเดือนก่อน

      @@technovangelist And also I'll be very thankful if you provide me with more context on the fine tune adapter to Ollama( Am new to all those concepts, appreciate your advice 😃🫡)

    • @sarthaksarangi1
      @sarthaksarangi1 2 หลายเดือนก่อน

      Waiting on your knowledge 😉 😀 ​@@technovangelist

  • @FloodGold
    @FloodGold 3 หลายเดือนก่อน

    Discord is Chaos, and I don't like Chaos, but I really like your videos, so thanks.

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      I think the discord is lining up nicely with what the maintainer team is wanting. It’s not chaos its activity.

  • @nathank5140
    @nathank5140 3 หลายเดือนก่อน

    You should cover if it’s running in a container run time. I think it is.

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      If what’s in a container runtime? Ollama? No ollama by default does not use docker. That said some choose to run it in docker and the members of the ollama team created docker desktop and docker hub when they worked at docker

    • @nathank5140
      @nathank5140 3 หลายเดือนก่อน

      @@technovangelist then I’m curious how the nvidia drivers work and are isolated and don’t conflict with my system ones.

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน +1

      It is using the system ones

  • @tiredofeverythingnew
    @tiredofeverythingnew 3 หลายเดือนก่อน +2

    Brilliant, thanks, Matt, but you do look a bit thirsty after that video; better keep your bottle closer.

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน +3

      That water bottle lights up when it’s time to drink and I still forget.

    • @synaestesia-bg3ew
      @synaestesia-bg3ew 3 หลายเดือนก่อน

      ​@@technovangelist I have two good suggestions for a new video for you :
      1),To show a way to use the ollama downloaded nodems by both windows and Linux(Wls2) in the same machine ,avoiding to download twice the same models .
      2) what best hardware spec to use ollama on a fast and reliable flash drive using usb3.2 .

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน +1

      Im unable to do the first. I don't have a windows machine... just 6 macs in this house, plus a few proxmox machines mostly running kubernetes. and since i can't get WSL on any cloud instances, I can't get it working. if asking for best hardware spec to use ollama from me....well, based on what I said above... hard to beat a mac studio.

  • @tuurblaffe
    @tuurblaffe 3 หลายเดือนก่อน

    lol at sutfin the waves xD

  • @TalkingWithBots
    @TalkingWithBots 3 หลายเดือนก่อน

    ☀🌊🏄‍♀

  • @santoshshetty6
    @santoshshetty6 2 หลายเดือนก่อน

    Ollama doesn't support parallel processing. So it can't be used for commercial purpose.

    • @technovangelist
      @technovangelist  2 หลายเดือนก่อน

      Those are two completely unrelated statements.

    • @technovangelist
      @technovangelist  2 หลายเดือนก่อน

      A well architected solution will get you everything you want. But plenty are using as is today for commercial use.

  • @ikjb8561
    @ikjb8561 2 หลายเดือนก่อน

    Issue with ollama is it does not scale beyond the user using it. Requests get queued

    • @technovangelist
      @technovangelist  2 หลายเดือนก่อน

      Not everyone has to like it. You don’t like that it does what it’s designed to do. Serve one user really really well rather than slowing it down for everyone.

  • @HUEHUEUHEPony
    @HUEHUEUHEPony 3 หลายเดือนก่อน

    How does it not run on Windows

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      Huh? It does run on windows. Last week they released the native app and for months before it worked with WSL

  • @timmygilbert4102
    @timmygilbert4102 3 หลายเดือนก่อน

    Insert a palworld joke here

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน

      hmm, never heard of palworld, and looking it up i don't see the connection. Can you clue me in?

  • @ShanyGolan
    @ShanyGolan 3 หลายเดือนก่อน

    Problem is that even mixtral sucks compared to gpt4

    • @technovangelist
      @technovangelist  3 หลายเดือนก่อน +1

      There are some things that gpt does better and others that open source models do better. Even some of the smallest models are on par with what chatgpt offers but far faster, which is a pretty big deal. Now if you just look at benchmarks then it might look like chatgpt wins everywhere, but I haven't seen a benchmark that reflects reality yet.