Have wondered several times how to do this. Thanks! One area I'm interested in is local function-calling LLMs (ideally with Ollama). Would love an explanation of how to get those working, if there are any reasonable solutions yet
I should also mention that typescript + typesafety would be a big plus as well, perhaps with zod. But really any info on plumbing open-source solutions would be great
Thank you for video. It's easy to get GGUF format now, and I was stupid enough to miss the fact that one can go with just Modelfile and have it all in Ollama. Thanks.
I wonder if could be possible to have very small LLM for a particular use cases, for example Java, Ruby, Python and are like modules. this way if only a particular programming is required only have to load that particular lamguage Model.
@@technovangelist omg! I just read what I wrote. Sorry. I meant to say, when we are required to fine-tune a model for a specific domain or required functionality. Curation of data, preparing it properly, ensuring the correct setting are used. As well as the evaluation of the dataset and testing.
What if I created a model with a new architecture or I made an architectural tweak to an existing model? In other words, something that changes the number/type of model layers or the number of training parameters, etc. Is there a path for porting a model with this kind of customized architecture to run on Ollama? What would the process be?
(1:16) What architectures are supported? I only found these options in the comments of a PR.. 1. LlamaForCausalLM 2. MistralForCausalLM 3. RWForCausalLM 4. FalconForCausalLM 5. GPTNeoXForCausalLM 6. GPTBigCodeForCausalLM
looks like you sent that twice. Sorry, I have comments wait for me to approve. In a distant past there was a spam problem. And doing it this way I ensure that I can answer every question that comes in. The interface for comments is worse than the Ollama Discord. And I want to ensure I address everything that comes in.
Hi, I am trying hard to get an embedding model for german language going on ollama. the architecture in the config file is named this: "BertForMaskedLM" I assume that it does not work because the Architecture is not supported. I have two questions regarding that: - can you tell me where I can find a list of architectures supported by ollama? I am unable to find one. - is there a way to get it working with ollama even with the named architecture?
hey, thanks for the video. Im trying to quantize a llama3 model with the docker image showed in video, but i think it is not supported. will the docker image be updated?
Hello Matt. I am working with an already quantized model in exl2 format. Since it is already quantized I wanted to make it compatible for working with Ollama. I created the modelfile and run the ollama create cmd but I am running into an error: "unsupported content type: unknown". Could you help me out? Or at least let me know if it is even possible to convert from exl2 to gguf?
I don’t know of any converter. May be easier just to convert from the original to gguf then quantize. Even really big models take just a few minutes with normal hw.
I am quite new to all of this and there are some things that I don't understand. First, when I do this command hfdownloader -s : -m it tells me that hfdownloader is an unknown command even tho I downloaded the executable. Secondly, I don't understand what you mean at 1:54 when you say go to where you want your model to be downloaded since you don't show any folder folder being selected, only the terminal. Could you please explain ? thank you in advance
hfdownloader probably isn't in your path. for the second part, you have to decide where to download something, same as with downloading anything from the Internet. choose that place and run the command there.
Hello sir, whenever I run local models they are not using my VRAM, I saw its usage in taskmanager and their usage is not much, does making it default increases the speed of llm, so how can I make my VRAM as default for running any local llm. But sorry I have only 4gb VRAM nvidia gtx 1650
This was such a great video just perfect for what I wanted to learn next. And this new model would work automatically with the api/chat and api/generate? Let's say we follow these steps on the Whisper model and somehow it works. The Whisper model has a translate() method. How would we add an api/translate custom endpoint to the Ollama API? Previously I used a nginx container to make a proxy so I could add my custom endpoint to that proxy. The Ollama API is golang. It would be great if there was some kind of plugin folder that I could drop Golang scripts that the API would automatically include. With the docker mount, if there was some predefined named mount point that Ollama monitored to automatically pull in your API additions that would be great!
If a model works on ollama it gets all the endpoints. But a non llava model can’t do image stuff. The whisper model doesn’t have the endpoint. It’s the runner in front that does.
This should really include some resources to go to if one runs into problems. I've noticed that a lot of these videos you do are a bit too narrow of focus to actually help people get into this, unless they were already into something of a similar nature.
Lol I'm still stuck on alot of things from these videos. I really just wanted to do one of his recent videos incorporating web search in local llama responses. But no real instructions
This is way too woofin' hard for me, I thought I'd simply right-click and Save As a file on Huggingface and save it to whichever directory Ollama wants them. But I need to convert them to work, and to select a quantization type? My thoughts get hazy as soon as simply getting and placing a file requires me to learn CMD commands. = ,=;
@@technovangelist That's fine, I'm going to try out LM Studio next, Oobabooga was quick and easy to use but FlashAttention in it only supports GPUs as new as Ampere now, unlike my 1080, so I'm looking for something quicker that I can use.
@@technovangelist respectfully, I’d disagree. I have some custom finetuned and merged models and corresponding quants that I made but I can’t quite figure out how to convert them to a Modelfile. Maybe it’s the Go syntax throwing me off. But incidentally, since these are GGUF files, couldn’t the relevant instruction metadata be imported automatically? And the GGUF parameters templated into the generated Modelfile? Maybe I’m missing something, but why not just use llama.cpp/GGUF files directly? I’m quite surprised how much trouble I’m having despite Ollama being a quite popular tool.
This guy is a beast, when does he sleep?! Keep it up!!! Thanks!!! I was just wondering today how to do that, especially since many models don't provide this standard format easy to import in ollama. Will ollama support the openai http api format, so it can be integrated easier into autogen studio? ^^ Will ollama support easier RAG and web requests from console?
so will it support openai? i doubt it. Will ollama do rag on the console? that’s a bit out of scope for the project. But there are plenty of extensions that are doing it well.
great video! followed this video just to try running ollama with mistral and the emoji model mentioned in this video, but i see that is so painfully slow in my win11 even though i got 22GB ram and amd 6800S video card, anyone facing same issue on windows? i tried running using wsl2 and it was a bit faster but still slow compared to what these videos show. any suggestions? ran on my M1 mac 16G and its even faster than what wsl2
@@technovangelist - could be, also windowsdefender is goiing crazy with the ollama.exe detected Trojan:Script/Wacatac.B!m will remove ollama, this is concerning that installer has a trojan!
the ollama team is working with msft to get them to fix defender because it is broken on this. there is no trojan in ollama, or the hundreds of other tools that are using the latest go compiler that this is having a false positive on.
For models not already on ollama this process is a single command and done in 5 minutes. It’s pretty painless. And this process goes away soon. It’s an old video. But it’s still faster than anything else out there
it allready was model. how do make own model from scratch. nobody know LOL. and we not need do it bcoz you did it and share it. some take your model and quantize again LOL. nobody start scratch
The example I showed was just the model weights. In ollama a model is everything to make it useful. The weights is just a part of it, along with the system prompt and template. There are plenty of places that show how to make a model from scratch. The downside is that no one has done it for less that $100k.
@@technovangelist can be done. take long. pc cost 4k how many years need train LOL. i was kinda mean how make model what has. hi mom, and ai answer hi son. it not take long. just want knoiw all commands LOL. lets train model. use AI make question and answer then train lol. i not get it xD we allready had jarvis
Just found this channel, you are the G.O.A.T sir!
Have wondered several times how to do this. Thanks!
One area I'm interested in is local function-calling LLMs (ideally with Ollama). Would love an explanation of how to get those working, if there are any reasonable solutions yet
That’s a great idea. Thanks
I should also mention that typescript + typesafety would be a big plus as well, perhaps with zod. But really any info on plumbing open-source solutions would be great
I’m a much bigger fan of typescript. It’s so much easier for me to write than python.
Thank you for video. It's easy to get GGUF format now, and I was stupid enough to miss the fact that one can go with just Modelfile and have it all in Ollama. Thanks.
This video is the solution to what exactly I was looking for. And I'm also glad to see someone else using terminal like me :P
I wonder if could be possible to have very small LLM for a particular use cases, for example Java, Ruby, Python and are like modules. this way if only a particular programming is required only have to load that particular lamguage Model.
That is exactly the scenario I am most excited about
How to produce modelfile for Embeddings models?
thank you so much this was really helpful
really good video 👍
What markup do i need to add to the title or description to get the blue bubbles with the tag label like "7B" "70B" etc?
The one thing that brings me anxiety in creating and using the dataset. A video would be great on that.. love your videos.
Can you tell me more about what you mean by that? What dataset?
@@technovangelist omg! I just read what I wrote. Sorry. I meant to say, when we are required to fine-tune a model for a specific domain or required functionality. Curation of data, preparing it properly, ensuring the correct setting are used. As well as the evaluation of the dataset and testing.
What if I created a model with a new architecture or I made an architectural tweak to an existing model? In other words, something that changes the number/type of model layers or the number of training parameters, etc. Is there a path for porting a model with this kind of customized architecture to run on Ollama? What would the process be?
New architecture? No idea
(1:16) What architectures are supported? I only found these options in the comments of a PR..
1. LlamaForCausalLM
2. MistralForCausalLM
3. RWForCausalLM
4. FalconForCausalLM
5. GPTNeoXForCausalLM
6. GPTBigCodeForCausalLM
I see a lot of OCR models use `VisionEncoderDecoderModel` is that supported?
Ollama/quantize Dockerhub page lists next model architectures supported:
1. LlamaForCausalLM
2. MistralForCausalLM
3. YiForCausalLM
4. LlavaLlama
5. RWForCausalLM
6. FalconForCausalLM
7. GPTNeoXForCausalLM
8. GPTBigCodeForCausalLM
9. MPTForCausalLM
10. BaichuanForCausalLM
11. PersimmonForCausalLM
12. GPTRefactForCausalLM
13. BloomForCausalLM
I seem to remember that isn't 100% accurate.
Ollama/quantize docker page lists next architectures:
1. LlamaForCausalLM
2. MistralForCausalLM
3. YiForCausalLM
4. LlavaLlama
5. RWForCausalLM
6. FalconForCausalLM
7. GPTNeoXForCausalLM
8. GPTBigCodeForCausalLM
9. MPTForCausalLM
10. BaichuanForCausalLM
11. PersimmonForCausalLM
12. GPTRefactForCausalLM
13. BloomForCausalLM
looks like you sent that twice. Sorry, I have comments wait for me to approve. In a distant past there was a spam problem. And doing it this way I ensure that I can answer every question that comes in. The interface for comments is worse than the Ollama Discord. And I want to ensure I address everything that comes in.
Great video! how do I add tags (what do I have to type in the terminal) in so that I can upload different quants to the same Ollama model repo?
Thanks!
Wow. That’s the first time I have seen one of those. Thanks so much. I don’t know what to say. Thank you.
@@technovangelist I hope, no, - I'm sure you will see more of those. Just continue.
Hi,
I am trying hard to get an embedding model for german language going on ollama. the architecture in the config file is named this: "BertForMaskedLM"
I assume that it does not work because the Architecture is not supported. I have two questions regarding that:
- can you tell me where I can find a list of architectures supported by ollama? I am unable to find one.
- is there a way to get it working with ollama even with the named architecture?
I don' t know where that list is. and not a way that I know of.
hey, thanks for the video. Im trying to quantize a llama3 model with the docker image showed in video, but i think it is not supported. will the docker image be updated?
Hello Matt. I am working with an already quantized model in exl2 format. Since it is already quantized I wanted to make it compatible for working with Ollama. I created the modelfile and run the ollama create cmd but I am running into an error: "unsupported content type: unknown". Could you help me out? Or at least let me know if it is even possible to convert from exl2 to gguf?
I don’t know of any converter. May be easier just to convert from the original to gguf then quantize. Even really big models take just a few minutes with normal hw.
I am quite new to all of this and there are some things that I don't understand. First, when I do this command hfdownloader -s : -m it tells me that hfdownloader is an unknown command even tho I downloaded the executable. Secondly, I don't understand what you mean at 1:54 when you say go to where you want your model to be downloaded since you don't show any folder folder being selected, only the terminal. Could you please explain ? thank you in advance
hfdownloader probably isn't in your path. for the second part, you have to decide where to download something, same as with downloading anything from the Internet. choose that place and run the command there.
@@technovangelist ty so much for such a quick answer, however things are still not so clear for me. How do I put hfdownloader in the path ?
I suggest you look for tutorials on how to work with your OS, especially the command line.
./hfdownloader
Thank you for this!
This is nuts! How large is the ollama team? How many programmers?
I think it's just this guy
There are a few in the team. But under 10. I’m not.
I wish you can cover a video on how to finetune tinyllama for inferencing in ollama.
Yes. I really want to do one on this.
@@technovangelist thanks in advance.
Very nice sharing 👍
Have a nice day 😊
Thanks
Thank you so much for such a great video❤❤
Hello sir, whenever I run local models they are not using my VRAM, I saw its usage in taskmanager and their usage is not much, does making it default increases the speed of llm, so how can I make my VRAM as default for running any local llm. But sorry I have only 4gb VRAM nvidia gtx 1650
Looks like it supports the 1650 Ti but not the 1650. Need to upgrade to get that. Nvidia doesnt support that with their drivers.
all architecture models can be converted to gguf, or any specific list , ?
Not everything but a lot of them can.
@@technovangelist i need speech to text, do you know any model that can be converted to gguf, or can you help if any, i would be highly grateful
Ollama is for text to text and text/image to text. For speech to text take a look at OpenAI’s Whisper models that you can install locally.
Whats the best file format for RAG? Is there a list what works best?
where can I get the architectures are compatible with ollama?
I don't know if there is any one list anymore
Where can we find compatible models other than HuggingFace? Are TensorFlow HUB formats compatible?
I don’t know about that. But it’s safetensors and PyTorch models that are supported.
This was such a great video just perfect for what I wanted to learn next. And this new model would work automatically with the api/chat and api/generate? Let's say we follow these steps on the Whisper model and somehow it works. The Whisper model has a translate() method. How would we add an api/translate custom endpoint to the Ollama API? Previously I used a nginx container to make a proxy so I could add my custom endpoint to that proxy. The Ollama API is golang. It would be great if there was some kind of plugin folder that I could drop Golang scripts that the API would automatically include. With the docker mount, if there was some predefined named mount point that Ollama monitored to automatically pull in your API additions that would be great!
If a model works on ollama it gets all the endpoints. But a non llava model can’t do image stuff. The whisper model doesn’t have the endpoint. It’s the runner in front that does.
This should really include some resources to go to if one runs into problems. I've noticed that a lot of these videos you do are a bit too narrow of focus to actually help people get into this, unless they were already into something of a similar nature.
Lol I'm still stuck on alot of things from these videos. I really just wanted to do one of his recent videos incorporating web search in local llama responses. But no real instructions
Thanks Matt!
This is way too woofin' hard for me, I thought I'd simply right-click and Save As a file on Huggingface and save it to whichever directory Ollama wants them. But I need to convert them to work, and to select a quantization type? My thoughts get hazy as soon as simply getting and placing a file requires me to learn CMD commands. = ,=;
Its not for everyone. it’s a dev tool first and requires some basic cli skills...
@@technovangelist That's fine, I'm going to try out LM Studio next, Oobabooga was quick and easy to use but FlashAttention in it only supports GPUs as new as Ampere now, unlike my 1080, so I'm looking for something quicker that I can use.
Omg. Ollama is sooo much easier than either of those. No question
Just get the model you want from ollama. Too much work getting them from hf
@@technovangelist respectfully, I’d disagree. I have some custom finetuned and merged models and corresponding quants that I made but I can’t quite figure out how to convert them to a Modelfile. Maybe it’s the Go syntax throwing me off. But incidentally, since these are GGUF files, couldn’t the relevant instruction metadata be imported automatically? And the GGUF parameters templated into the generated Modelfile? Maybe I’m missing something, but why not just use llama.cpp/GGUF files directly? I’m quite surprised how much trouble I’m having despite Ollama being a quite popular tool.
how to convert gpt2 ?
This guy is a beast, when does he sleep?! Keep it up!!! Thanks!!! I was just wondering today how to do that, especially since many models don't provide this standard format easy to import in ollama.
Will ollama support the openai http api format, so it can be integrated easier into autogen studio? ^^
Will ollama support easier RAG and web requests from console?
so will it support openai? i doubt it. Will ollama do rag on the console? that’s a bit out of scope for the project. But there are plenty of extensions that are doing it well.
re openai. just kidding, watch out for the next release video
That was good! Thanks! I'd be interested to see how to make quantized llava models. Even better if it would be new moondream1 vision model by vikhyat
Interesting. Thanks.
great video! followed this video just to try running ollama with mistral and the emoji model mentioned in this video, but i see that is so painfully slow in my win11 even though i got 22GB ram and amd 6800S video card, anyone facing same issue on windows? i tried running using wsl2 and it was a bit faster but still slow compared to what these videos show. any suggestions? ran on my M1 mac 16G and its even faster than what wsl2
I assume that’s an AMD card. I don't think amd support is enabled yet on Windows
@@technovangelist - could be, also windowsdefender is goiing crazy with the ollama.exe detected Trojan:Script/Wacatac.B!m will remove ollama, this is concerning that installer has a trojan!
the ollama team is working with msft to get them to fix defender because it is broken on this. there is no trojan in ollama, or the hundreds of other tools that are using the latest go compiler that this is having a false positive on.
Issues with architectures that aren't supported.
unknown architecture LlavaQwenForCausalLM
Yup, can't work with architectures it doesn't know about.
this seems like a pin in the neck!
For models not already on ollama this process is a single command and done in 5 minutes. It’s pretty painless. And this process goes away soon. It’s an old video. But it’s still faster than anything else out there
please keep boomer jokes with you.
Had to watch it again. I don’t have any jokes in this one.
Oh was that the problem? I didn’t include any. Got it
it allready was model. how do make own model from scratch. nobody know LOL. and we not need do it bcoz you did it and share it. some take your model and quantize again LOL. nobody start scratch
The example I showed was just the model weights. In ollama a model is everything to make it useful. The weights is just a part of it, along with the system prompt and template. There are plenty of places that show how to make a model from scratch. The downside is that no one has done it for less that $100k.
@@technovangelist can be done. take long. pc cost 4k how many years need train LOL. i was kinda mean how make model what has. hi mom, and ai answer hi son. it not take long. just want knoiw all commands LOL.
lets train model. use AI make question and answer then train lol. i not get it xD we allready had jarvis
PS C:\programmazione\ollama\ollama-ita> docker run --rm -v .:/model ollama/quantize -q q4_0 /model
Traceback (most recent call last):
File "/workdir/llama.cpp/convert.py", line 1658, in
/workdir/llama.cpp/gguf-py
Loading model file /model/model-00001-of-00004.safetensors
Loading model file /model/model-00001-of-00004.safetensors
Loading model file /model/model-00002-of-00004.safetensors
Loading model file /model/model-00003-of-00004.safetensors
Loading model file /model/model-00004-of-00004.safetensors
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, f_norm_eps=1e-05, n_experts=None, n_experts_used=None, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=, path_model=PosixPath('/model'))
main(sys.argv[1:]) # Exclude the first element (script name) from sys.argv
File "/workdir/llama.cpp/convert.py", line 1614, in main
vocab, special_vocab = vocab_factory.load_vocab(args.vocab_type, model_parent_path)
File "/workdir/llama.cpp/convert.py", line 1409, in load_vocab
path = self._select_file(vocabtype)
File "/workdir/llama.cpp/convert.py", line 1384, in _select_file
raise FileNotFoundError(f"{vocabtype} {file_key} not found.")
FileNotFoundError: spm tokenizer.model not found.
anyone?