had olama working great last night this morning i realized i was running off the net or something because it was gone this morning i spent 5 hrs planning my homenet future upgades to totaly deck out my setup and its gone all that typing and it gone
I was just looking at a Supermicro H11DSI-NT REV2.0 mobo + 2 * AMD 7B12 3.3GHz 64C/128T 240W CPU + 16*32GB DDR4 (512 GB) 3200mhz RAM for about 140 $ on aliexpress, this is a dual cpu for a total of 256 threads! and I was wondering how good would it be running llama 3.1 70b, and if, if it will even try to run the llama 405b. I kinda need that machine because I train xgboost models on cpu, but I can't pull the trigger, it seams too good to be true. Also, better try the llamafile versions of the llama models. Llamafile is a Mozilla project that modify models to run way faster on cpu.
It's a good question and hard to gauge whether it would be fast enough especially for large models to be running on ram and CPU. the electricity consumption might also be quite expensive depending on where you live. GPUs will always outperform a CPU so if it has plenty of pcie slots you could also install GPUs on it as well. If it doesn't work as intended you can use it for other stuff like file storage, AI data training and other server stuff.
@@chris_php Thank you for your answer. I wanted to buy it, but only if there is any chance that it will run the llama 405b, even if a bit slow like 3 to 5 tokens per second would do it for me. One other interesting thing that you might find worth to share is this new android app called LM Playground. It's an app that can run local models on Android devices, and I'm running the llama 3.1 8b on my galaxy note 10 at good speeds, and the Gemma 2 9b at very low speeds, and heats up the phone quite a bit, but it runs.
I have my eye on a Proliant Gen 8 server with 393 gigs of ram (dual socket / 12 threads each). I know more ram would handle more parameters, but would more ram speed up simpler models?
RAM speed is important since the whole LLM needs to be read from so the smaller the model the quicker you will get a response. If the LLM is 40gb in size and you can ready 40gb a second in bandwidth that means you'd have 1 token every second.
Hi Chris! Great video. Do you know what the minimum specs are for setting up an older system with no GPU to run ollama locally with reasonable speed? I know there are many variables here including the size of the LLM, however, if we choose a small to medium model and assume that RAM is not an issue, how many cores and what speed would allow for a speed that would not require you to take a nap waiting for the generation to finish? I have also recently purchased an old nVidia GRID K1 (16GB VRAM) for extremely cheap and could not get this to run with ollama. Currently my workstation specs are Dell T5600 with two E5-2690 CPU's and 128GB DDR3 RAM. I could not use the GRID K1 with this unit as these older Dell workstations will not even POST with one in it. I am not planning on leaving the system on 24x7 so I am not considering energy costs as being substantial to run this for short periods of time. FYI, the RAM was gifted to me and the t5600 was under $100 US so I really could not afford not trying...
Hello, That's a good offer you got for that system and it's speed might already be good since those CPUs have the AVX instruction set which will greatly increase the speed for ollama. generally more cores the better since it's a large dataset. The speed of the ddr3 ram will be important and might be a bottleneck but this system might be fine running a 7B or 13B at a good speed since the CPU has the AVX instruction set, so no need to disable it in the generate_linux,go.
i have plenty of old servers..but the requirements have other hidden dependencies... such as avx2 or better. While avx2 hit intels cpus in 2012 with the haswell tech... the cpu in your server could predate avx2.
HP ML350p with 256GB, 1200W dual power supplies, 2 dedicated 16 lane, one 16-8 lane for the NVLinked.. and the video cards are up to you as to speed and power consumption,, But the A5000s should do for the linked or A6000(s) in the 16 dedicated, lane 32 bus.. three vid cards will leave you absent for PCIe accelerator cards and high bandwidth connections.. like to your mass storage.. Oh, expect your internet provider to complain if your LLM has access to the Web... you may have to upgrade to a business plan. These programs are hard on SSDs, choose accordingly or you consumer NVME will be written to death in months, so have a big backup platter (HD) as least and backup often. Petabyte write rated drives.
This generation of server is E waste, and anything with dual CPU of this gen worse still. Minimum would be E5 V3,V4 with quad channel, not the cheap AliExpress remade boards. 20b Q5 on E5 2660 V3 is usable but cash would be better spent on P40
please consider a microphone. Its really difficult to understand you with the mic/noise gate cutting out so often. Thanks for the video.
Thanks, I've been making improvements to my mic in my more recent videos.
had olama working great last night this morning i realized i was running off the net or something because it was gone this morning i spent 5 hrs planning my homenet future upgades to totaly deck out my setup and its gone all that typing and it gone
I was just looking at a Supermicro H11DSI-NT REV2.0 mobo + 2 * AMD 7B12 3.3GHz 64C/128T 240W CPU + 16*32GB DDR4 (512 GB) 3200mhz RAM for about 140 $ on aliexpress, this is a dual cpu for a total of 256 threads! and I was wondering how good would it be running llama 3.1 70b, and if, if it will even try to run the llama 405b. I kinda need that machine because I train xgboost models on cpu, but I can't pull the trigger, it seams too good to be true. Also, better try the llamafile versions of the llama models. Llamafile is a Mozilla project that modify models to run way faster on cpu.
It's a good question and hard to gauge whether it would be fast enough especially for large models to be running on ram and CPU. the electricity consumption might also be quite expensive depending on where you live. GPUs will always outperform a CPU so if it has plenty of pcie slots you could also install GPUs on it as well. If it doesn't work as intended you can use it for other stuff like file storage, AI data training and other server stuff.
@@chris_php Thank you for your answer. I wanted to buy it, but only if there is any chance that it will run the llama 405b, even if a bit slow like 3 to 5 tokens per second would do it for me. One other interesting thing that you might find worth to share is this new android app called LM Playground. It's an app that can run local models on Android devices, and I'm running the llama 3.1 8b on my galaxy note 10 at good speeds, and the Gemma 2 9b at very low speeds, and heats up the phone quite a bit, but it runs.
Interesting I might have to check out the app and see what it is like.
I have my eye on a Proliant Gen 8 server with 393 gigs of ram (dual socket / 12 threads each). I know more ram would handle more parameters, but would more ram speed up simpler models?
RAM speed is important since the whole LLM needs to be read from so the smaller the model the quicker you will get a response. If the LLM is 40gb in size and you can ready 40gb a second in bandwidth that means you'd have 1 token every second.
Hi Chris! Great video. Do you know what the minimum specs are for setting up an older system with no GPU to run ollama locally with reasonable speed? I know there are many variables here including the size of the LLM, however, if we choose a small to medium model and assume that RAM is not an issue, how many cores and what speed would allow for a speed that would not require you to take a nap waiting for the generation to finish? I have also recently purchased an old nVidia GRID K1 (16GB VRAM) for extremely cheap and could not get this to run with ollama. Currently my workstation specs are Dell T5600 with two E5-2690 CPU's and 128GB DDR3 RAM. I could not use the GRID K1 with this unit as these older Dell workstations will not even POST with one in it. I am not planning on leaving the system on 24x7 so I am not considering energy costs as being substantial to run this for short periods of time. FYI, the RAM was gifted to me and the t5600 was under $100 US so I really could not afford not trying...
Hello, That's a good offer you got for that system and it's speed might already be good since those CPUs have the AVX instruction set which will greatly increase the speed for ollama. generally more cores the better since it's a large dataset. The speed of the ddr3 ram will be important and might be a bottleneck but this system might be fine running a 7B or 13B at a good speed since the CPU has the AVX instruction set, so no need to disable it in the generate_linux,go.
Old PC with 128-512gb ddr3 RAM + a lot pci-e slots + some cheap Nvidia P can lift heavy workload without issue. Also in a tight 1000€ budget.
Yeah can do some good work with a Setup like that decent for also training your own data for your own models.
What would you recommend to run a chatbot trained on website data locallY? Thanks for the video.
If it's a small model like a 3B you don't need much ram like 8gb so it can even run on lower end graphics like a 2060 and get quick responses.
Will do once I get more RAM xD
i have plenty of old servers..but the requirements have other hidden dependencies... such as avx2 or better. While avx2 hit intels cpus in 2012 with the haswell tech... the cpu in your server could predate avx2.
Yes the CPU predates any avx which contributes to it's very slow speed which is why I had to disable avx entirely to get to even run.
HP ML350p with 256GB, 1200W dual power supplies, 2 dedicated 16 lane, one 16-8 lane for the NVLinked.. and the video cards are up to you as to speed and power consumption,, But the A5000s should do for the linked or A6000(s) in the 16 dedicated, lane 32 bus.. three vid cards will leave you absent for PCIe accelerator cards and high bandwidth connections.. like to your mass storage.. Oh, expect your internet provider to complain if your LLM has access to the Web... you may have to upgrade to a business plan. These programs are hard on SSDs, choose accordingly or you consumer NVME will be written to death in months, so have a big backup platter (HD) as least and backup often. Petabyte write rated drives.
So yeah, but hella slowly
This generation of server is E waste, and anything with dual CPU of this gen worse still. Minimum would be E5 V3,V4 with quad channel, not the cheap AliExpress remade boards.
20b Q5 on E5 2660 V3 is usable but cash would be better spent on P40