Check out: FULL R1 671B Testing here th-cam.com/video/yFKOOK6qqT8/w-d-xo.html Notes on FULL R1 671b local testing: digitalspaceport.com/running-deepseek-r1-locally-not-a-distilled-qwen-or-llama/
You didn't test DeepSeek R1, you tested R1-Distill-Qwen-14B, a distilled version of the Qwen2.5 model. This is not R1 (which is fantastic), this is a small distilled model, Qwen (at Q4 nonetheless...)
@@jesusleguiza77 Because these results are nowhere close to R1, yet it is being framed (broadly) as R1 testing with chapter titles like "Code Testing Deepseek r1"
Yup just saw these comments this morning and I pinned a detailed reply comment on a very similar mention about this being distilled versions. Will be doing a followup video with the real deal soon as I can get my exo cluster humming.
You can make a template to omit the part. Also why are you comparing Deepseek r1 Qwen distill 14b q4 against Claude 11:15, your video title should reflect the model you are testing. Distill models are just finetunes of Qwen and LLama.
BIG NOTE: Although these models are trained to internally "reason", this is just one part of test-time compute. Adding an external chain-of-thought generator and deep context handling is what will be truly required to make them useful for everyday tasks. This is what o1+, as well as Perplexity and other products are doing more explicitly and what is coming to open source imminently.
@vladislavdonchev1271 I'll add there also needs to be an additional layer here, test time training. Otherwise, these models will keep going through the same steps over again even if it knew the answer before. The problem with reasoning models, and most LLMs for that matter, is that they don't know when to stop and cut off the chain of thought at the correct response. Also, this would help so that you won't necessarily have to prompt the model the same way every time if the model encounters a similar scenario repeatedly.
Love this content, and super jealous of your rig!!! I would say its worth noting you are running the distilled models not exactly DeepSeek R1. Honestly the models you tested are the ones I care about the most, just noting I first thought you were actually gonna run the actual DeepSeek R1.
Regarding coding they did say in their paper that they did not do lot of RL on coding/engineering reasoning for R1 so it didnt get a huge improvement over V3. They mention that adding RL on reasoning for coding is one of their next tasks.. So a lot more improvement to come yet.
Im aware. They updated the model page a few days ago it looks like and added a lot of information that was not prior there. Really not sure why this wasnt included day 1 on that pg, but I am working to get an exo cluster running to test the real version out. I did read the card for ctx but skipped right past the base being different, thats totally on me. I clearly do state in my videos I am not an ai expert nor is my testing scientific and this would be a good example of why. Apologies for this review, but clearly its pushing me to test out a new engine so thats something decent that came from it. Future video to address this very soon.
@@DigitalSpaceport I am using open-webui like you are and also have the 14b model but I am confused what both of you are referring to. All I could gather from context (since I have very little experience) is the line of the model description on the ollama models page for Deepseek being "general.basename DeepSeek-R1-Distill-Qwen" and the line on the README being "Below are the models created via fine-tuning against several dense models widely used in the research community using reasoning data generated by DeepSeek-R1." So is this just someone uploading a replica of Deepseek? Supposedly the actual Deepseek authors have uploaded their version to huggingface but I have not been there to check. For all I know, the authors of models typically may only share specifications for hobbyists to use to build usable models, but again, what do I know? I am just getting into this stuff. Thanks for creating content like this. This is the first video that has introduced me to your channel which I have subscribed to and of course hit like on. Cheers!
I'm running the 8b well and 13b "ok" and wow, there's just something about this LLM. Running in LM Studio and hitting it from Anything LLM. 45 tokens/s on the 8b quantized on my RTX 3060 and 12 year old i7. EDIT: Forgot to mention, that the coding capability is decent. Understand grammars. It made sense of a complex EBNF grammar I made a few years ago for a custom language. It was smart about neurochemistry, some C#, some high level conceptual thinking. Just gotta say, I have never had this type of interaction with an LLM on my local machine before. It feels like o1 a bit even on my very limited 8b. I found it sensible and looking forward to automating it. It's going to replace llama3.2 as my go-to creative helper. The [thinking] block is brilliant and if you just hit your llm host with some python you can hide the [thinking] block.
@@DigitalSpaceport well I changed my opinion of this LLM. Also, no LLM passes my tests, describe bellow. The reasoning is pretty good but for my budget build I can't get any significant code generation out of it. It's ok for short snippets. My test is always getting the LLM to generate a "B+ Tree implementation" in C# or something like that. NO LLM HAS EVER PULLED OFF THE "Delete()" METHOD. O1 can't do it. The B+ tree's Delete node method requires next level thinking. It must re-balance the tree as needed, and so far, no LLM has managed it. The code always fails under test and edge cases are always missed. Back to DeepSeek R1 .. now that I played with a dozen different local LLMs since last week, I can say that it has a place in my home setup, but for my budget setup Mistral 7B, Dolphin 3.0 and some BERTs are more useful for automation and chatting. So what's the use case for DeepSeek R1? Planning with RAG and Agents?
The full size deepseek model does get the peppermint question right. 14 billion model just doesn't have the power to do it properly in spite of the deep thinking.
Great video. Do you have any videos for an absolute beginner? I have a pretty beefy system now. I also have quite a few GPUs I can use. Would like very much to get started in AI. Just don’t know really where to begin.
On my mac, deepseek-r1 + ollama is broken based on my internal benchmarks. Simple test is to ask it if it takes 3 hours to dry a shirt on a clothes line, how long will 10 shirts take? If it fumbles all over itself, its broken. What I mean by fumble is it continues to second guess itself rather than give a confident answer. When I run LM studio with the MLX versions of deepseek, it gives correct answers with confidence. If you have access to LM studio, try the deepseek models on there.
This is great, thank you for doing this! Have you tried to check your ports and see if it is doing any external communication? Im VERY concerned that this is Chinese, I don't care if they are private.
There is a big problem with the reporting of Deepseek R1. The Ollama versions are not really Deepseek, with the exception of the 680GB 4bit (half-precision model). They are the distilled versions (llama/qwen) and they are also 4bit. This makes a huge difference when testing prompts. LM Studio now offers the Deepseek 680GB 8bit model as well as lesser quants. I wish streamers would inform people on this issue, because it is quite misunderstood. Only the largest model is a true Deepseek architecture.
Waitone! "2 vowels" is correct in terms of your query. Only two vowels, "e" and "i", appear in peppermint. Now, were you to ask >how many times do vowels appear in "peppermint"...< Yes, it does miss the double "p", which is a common LLM weak spot. Regarding the driving problem, I think the explanation is a simple case of GIGO, as Phi-4 made the same type of mistake, and using kilometers in some way doesn't fix the problem. To simplify checking, I tweaked the problem to include "using the Interstate highway system" so I could compare Google Maps' answer(s). I find it odd that Phi-4 and DeepSeek both ignore using I-12 to avoid NOLA (saves 20+ miles, and driving I-10 in NOLA sux).
ADDED: So I said here's the latitude and longitude for Austin and Pensacola. DeepSeek-R1:32b went off to compute the great circle distance (a rigorous answer) between start and finish. In the end, it essentially read back to me my input of "Google Maps says", with no explanation of the error. In fact, it still flunked the quiz with "Via I-35 to I-10 : This route takes you from Austin north on I-35 to San Antonio, then east on I-10 toward Pensacola. According to Google Maps, this is approximately 762 miles ." Last time I drove from Austin to Baton Rouge on I-10, I drove >southwest< on I-35 to get to "San Antone". IMNSHO DeepSeek's is eyewash to amuse us organic, wetware-based units. In this investigation, and using it to analyze large pieces of text, what it finally "said" and what the "thinking" showed didn't quite match. Further, both the 70b and 32b version spend a lot of time heating my 4090 before putting anything up on Open WebUI. What happens before "" only DeepSeek-R1, and maybe the developers, know. Compared to Llama 3.3:70b, when it comes to DeepSeek, it gets the "deep six" (rm).
Thanks for your awesome videos; I’m also trying to build a similar rig. But I guess taking a model that has been distilled from 671B FP16 to 14B and then quantized to q4 (a 192 times reduction in size) just cannot have the same results as the original. I think it would make more sense to test the 671B one on your 7995WX, even if needs hours to respond just to see if the model is really comparable to O1 given the required hardware. I mean, it's not deepseeks fault that we don’t have a couple of H200s lying around to play with :)
There will be over 100 comments dumping on this model, but I think it has merit. I appreciate these chatty Kathys because getting a glimpse of their reasoning helps me write better prompts, and I put a lot of effort into making better prompts, since "better text in" is most of what I can do to get better text out.
The peppermint question, may be a prompt issue. You specified how many p's are there in the word, and how many vowels. There are only 2 vowels out of 5 possible vowels. What if you adjust the prompt to indicate how many a, e, i, o, and u are on the word? The LLM may have interpreted how many vowels as how many unique vowels are there in the word.
Thanks for sharing your knowledge with us! I just built a similar rig inspired by yours, featuring 4x 4090s on the same board (Rev 3) and an Epyc Milan 7C13 CPU, paired with Noctua NH-U14S TR4-SP3 air cooler. I opted for the C-Payne SlimSAS Gen4 risers after reading about potential issues with traditional ones, and they're working flawlessly. One tip I'd like to pass on is to definitely consider adding a second PSU - I've got two AX1600i PSUs in the same chassis as yours. I'm using the ADD2PSU board to power on the second PSU, which is dedicated to 2/4 4090s (just a heads up: make sure to use one PSU per GPU and avoid mixing/matching). Thanks again for the inspiration, and I hope this helps others building their own rigs!
@@wayne8863 works with full pcie4 bandwidth, i've tested it with the p2p script from nvidia-samples (gpu to gpu mem transfers). I've also ran stress tests and no issues!
@@wayne8863 it works at full gen4 speeds. I've confirmed it with the p2p script from nvidia samples library. I've also stress tested the gpus and no issues!
It would be nice to figure out how to add a FIM (fill in the middle) tokens for a text autocompletion. Its challenging since it actually would be nice to view the ... block before the final model's output. Additionally it would be nice to figure out how to add the tool calling support. It could be really good for coding. The model WITHIN chain of thought will be able to test the code that it has written.
What app is he hosting on the local server that mimics Chatgpt? e.g, he can choose the model and save his chats etc.. I've seen it once but I dont remember the name of it. IT's time I host it on my own.
Hola agradezco enormemente que realices las traducciones a diferentes idiomas en compensación tienes un suscriptor más. Realicé las pruebas con el modelo de 671B de parametros y es alucinante la precisión y claridad en las respuestas. Se que es importante para ti el uso del modelo de 14B pero no le hace justicia a lo superlativo frente a o1 o a Claude 3.5, además que los modelos subyacentes los puedes personalizar o finetunear de acuerdo a tus propios requerimientos, creo que Deepseek toma la corona sobre el liderazgo en modelos de AI tanto en la parte de modelo como en precios, personalmente se convirtió en patrón de comparación como hasta ahora lo fue el modelo o1hasta diciembre siendo replazado por Gemini 2.0 ahora por este mountro Adios temporal a Openai! Nota: No hay que dejar de lado tampoco a Minimax de Hailuo con sus nuevos modelos...
I'm also a newb, but if you want quality answers, limit your context window to save VRAM as you aren't using any long context with your questions, stick to q8 or better with largest model possible.
Big fan of the channel but i dont think this was a great test for R1. Ive tested with the api for the full sized version and it is very good. Do you think you vould get the full sized version working completely locally? I'd love to watch that video!
So as it turns out.... thats not r1. Its a tune which explains a lot. TBF the ollama page didnt mention it originally when I visited it like it clearly states now. I did read the model card to ensure the ctx but failed to read that it had a different base. Pinned a comment on this but yes will be redoing it. Trying to get exo working as its best chance for locally hosting it. In sort, this video is obsolete and that explains likely the poor performance of the LLM.
Man you need full context only when you really need it. Otherwise always prioritize model thinking capabilities (model size given the same model family) and insure that you have just enough context for the job at hand. For all your tests 4k will be more than enough. For my questions with PhD level math 16k windows is absolutely fine. If I need to feed my whole paper I set window to 32k. In your VRAM reach case I would load 70b llama finetune at Q6_K with 32k context window and you will be covered in 95% of your use cases.
Might I suggest some test that would interest me greatly, how about tools and function call or perhaps even MCP servers Cline has a awesome MCP setup for VSCode.
Recent ollama update appears to be hitting harder on performance, which is good I think, unsure if model specific but it blipped my psu into overload. Usually the way ollama ran parallel processes the process was 25% on each. This hit over 60%. Going to be adding a second PSU and moving this rig to a 30 amp circuit soonish.
@ that’s interesting, with my three 3090s at 350w, I get strange thermal trips for the CPU (Epyc on ASRock Rome mobo) even though the CPU temp is under 50 Celsius. I stop getting thermal trip errors at 185w. This also happens with ollama.
I ran this model on my usual setup at the 32b size, and about an 8K context - I could have gone higher on the context, probably around 32K based on the memory used, but I doubt it would make a difference in this context. I got acceptable TPS even on my 4060 TI's, with mid-20's on most responses. Will this replace LLaMA 3.3? It's interesting, and I would say around the same quality. It's pretty locked down however on creative writing, and I'd have to see more examples with better prompting. But I would say it's "boring", for lack of a better term. And my god it's verbose. I was going to run it on the second API server, but I um, had an accident and messed up the motherboard (ripped out one of my x16 slots - don't do that!) So that's limited now to one GPU, with proxmox having issues using some of the other PCIe slots due to addressing PCI iD's higher than 4 digits. Never knew that was an problem, and there isn't much out there about it.
As far as I'm concerned, at least compared to me you're a SME, so agree to disagree? Always curious to see how 70B and better models run since I can't handle them at home. Gave the 14B and 32B models a go this morning, and 32B was noticeably better imho. Armageddon with a twist was a failure for me... didn't give a final answer, got stuck hemming and hawing.
My rig is a real battleship, I can run Qwen 72B at full speed. ($25,000 rig....). I have been playing a game called AI roguelite which now needs to be updated to handle the sections...
I don't understand why there are tests like the first 100 decimals of pi or counting letters, this is something that LLMs struggle with and will not get better at. Why don't we focus on adding sandbox code running tools instead?
Hi @Digital Spaceport. Thank you for your videos. I have a near identical dual 3090 setup (using your videos as inspiration) on a SM mobo. Hope is to add on more VRAm with time. I’m on a mining rack right now but would really like to have it enclosed in a 12u server rack with a ventilated mesh. Do you think that would limit airflow and my temps would suffer?
@@tvstation8102 actually had it built out in a mining rig similar to what’s shown on the video. Was planning to just stuff it into the server rack (which based on the dims should fit)
If it worked for mining, it will work for LLM. The only major difference is the CPU matters (fast single thread, the ram speed doesnt really, and the pcie lanes for more complex workflows like rag need real risers) The x1 usb dudes do work in case you were wondering.
Itś not all about speed but have you tried, i call it another "engine"? with Sglang instead of Ollama iam getting 100tokens/s with the 1.5b model and my poor RTX 2080.
Can we see a setup using 4 rtx 5070 and see how they preform with the added extra vram compared to one 5090 and the cost savings in doing this way besides the power consumption
I dont get anything free from nvidia 😅 I mean if I can afford it yeah but thats a lot of money. I have been told from a friend who work at big tech hw co. the 5090 perf per $ will be very hard to touch. However a 5090 will also be hard to touch lol.
@ that’s fair enough but in saying that that’s why I’m curious if 3 rtx 5070 having more combined ram than a 5090 and only costing under 1800 if that makes sense well hopefully some one sees your skills and helps you out
I tried r1:7b locally and it makes very simple codeing mistakes like writing variable names with spaces in them or importing python packages that don’t exist. So at least for the lower parameter models I don’t see any superiority in codeing.
14b @ 4bpw with 4 3090s 💀 bruh… at least try an 8bpw quant or even fp16 you don’t even need the full context for those tests. You could also offload the kvcache to ram if you did want full context or even quant the kvcache to squeeze in a higher quant than 4bpw...
how much does deepseek r1 chain of thought think in english and western norms. Is it largely copying and generating the implict reasoning of a latent space parameterized western corpus?
Running only a 14b model on a quad 3090 setup is CRIMINAL. I have dual 3090s and I run 70b parameter models. You could easily run the 70b r1 model at 8 bit quant, so i assume youre using full precision... WHY?
I showed live the 128k ctx does not fit into vram at a larger size. VRAM is 82% full here. Im setting up an exo cluster to bang the non distill out rn so an updated video soon as I can get it stable.
@DigitalSpaceport I'd understand better if any of your tests even used that much, but I'd bet none of them even hit 16k! Even apart from that, I can run 70b with 64k context, with double the vram you should easily be able to run at 128k. 5bpw, even with a 16 bit kv cache should fit into your vram even with your unreasonable context requirement.
Dude, deepseek-R1-70b is not true DeepSeek, it’s just Llama-3-70b trained on synthetic data obtained as a by-product during the DeepSeek R1 training process.
May be its better off to use AMD THREADRIPPER PRO with 8 channel RAM to run it? 😊 The speed of data transfer would be around 200 GB/sec which just five times worse than in GDDR6 VRAM.
In fairness, O1 is pretty bad. The initial replies are decently okay, but it doesn’t stick to instructions (it tends to rewrite unrelated code, often claiming to add validation or similar, but removing critical functionality). For example, it likes to check if every variable in a bash script is set. Fine. But it’ll move those checks above the call that sets the variable. Or it’ll replace a if =>1 with a >0… okay, what about 0.5? Another time it escaped the escape characters in a complex string, completely u related to the code it was intended to fix (the fix worked, but only if you copy just the lines you wanted changed, it broke unrelated working code). Beyond that, if you point out a problem or add something else it goes insane, 2-3 messages in and it writes dissertation length explanations and chunks of code that are only tangentially related to the original problem, and in this state the only fix is to grab the last “goodish” code, clear the context and give it the code it wrote and ask it to bug fix. O1’s initial replies aren’t bad. Maybe better slightly, maybe not, but it craps all over itself shortly after. Fun to play with? Yeah. Useful in production? No. Other models are far more useful, in my opinion.
I keep seeing R1 doing weird translation errors, it's like it's translating from English to Chinese to English in its brain. You can give it certain text phrases and it completely sees them as something else when it outputs them.
Open Web UI. I have a guide that might be of interest to you on it and there is compose cheetsheets linked in that description also. th-cam.com/video/lNGNRIJ708k/w-d-xo.html
So the conclusion is : LACKLUSTER. Ok. Thanks, that's helpful. But what are you comparing it to ? Another 14B model at Q4 ? If you are comparing to openAI o1 model that's unfair because it doesn't run locally. You could compare it to phi-3 models I suppose it would be lackluster in comparison. So its lackluster compared to what please ?
bro you have so much hardware, can you not test the largest model? i wish to see performance on epyc/tr cpu/ram only and with 2-4 gpus 3090+ (because its a MOE model it will fit into the GPUs), i really wish to see the performance in this scenarios. i dont want to bildly spend 20k for a server and then maybe its just not worth it :(
This is not R1. Can you please test real R1 with 671b params? With rtx 5090 and also apple mac mini, mac studio. Otherwise all of these are nonsense. I realize these hardwares probably gonna have hard times working that model but we need to see how much they can handle.
Is it just me or is the sound incredibly bad? I have some good headphones plugged into my computer but I had to turn the volume up so high that I almost had a heart attack when a notification came in. 😅 I'm usually a huge fan of your videos but thought I'd at least leave this as feedback.
While I love these videos… I also dislike them. Makes me jealous. See… I have an RTX 4080 with 16GB. I am thinking about getting a 5090 for the 32GB and already know that It's not enough. It's for hobby purposes - so I actually have to stay with distilled models or APIs. I got myself a big Ryzen 7950 X3D with 128GB Ram - and the whole wonderful new AI world can't find a way to make use of it. I sincerely hope that some very intelligent people on this planet may find ways to optimize the whole LLM thing. I know, I am just an 3D artist and don't have the deep insights - but even in my business I have to optimize my stuff for smaller rigs, without making it look worse.
I'm not sure why the extinction level event question is hard. We let generals and army leaders everywhere make this kind of call all the time. X die so Y live. X < Y therefore ....
Thanks for the feedback, Ill condense it down but also this video is functionally obsolete. There is a pinned comment you should read, review of actual model R1 soon. Hopefully on exo, but if not then likely one of the cpu heavy rigs.
14b isn't enough to have all knowledge like 200b models try. if nvidia wasn't such a douche on ram amount and amd weren't such clowns on the same, we could run bigger models. GDDR6 costs less than 2$/GB so there should be 128GB versions
@@anthonyperks2201 you can right click the video and show statistics. anything below -10 is very low. full use of the range might be 0db. -5 might be pretty normal
Please check the pinned comment. Im not going to pretent to be above mistakes but the ollama page changed a lot a few days ago. New video out on real deepseek r1 test soon.
Ollama s.ucks, not very user-friendly for the average John and Jane, no wonder why AI acceptance among regular people is rather low, other than using it as a more powerful search engine, when there are so many nerd applications to run them.
Um, I've used German with both :32b and 70b and didn't have problems. OTOH see my earlier comment re: the driver problem, for hallucination (e.g., San Antonio is SW of Austin, not N) in English.
Check out: FULL R1 671B Testing here th-cam.com/video/yFKOOK6qqT8/w-d-xo.html
Notes on FULL R1 671b local testing: digitalspaceport.com/running-deepseek-r1-locally-not-a-distilled-qwen-or-llama/
I wish you also attempted to compile and run the flappy bird code on the 671B model.
You didn't test DeepSeek R1, you tested R1-Distill-Qwen-14B, a distilled version of the Qwen2.5 model. This is not R1 (which is fantastic), this is a small distilled model, Qwen (at Q4 nonetheless...)
Why you say that?
@@jesusleguiza77 Because these results are nowhere close to R1, yet it is being framed (broadly) as R1 testing with chapter titles like "Code Testing Deepseek r1"
Yup just saw these comments this morning and I pinned a detailed reply comment on a very similar mention about this being distilled versions. Will be doing a followup video with the real deal soon as I can get my exo cluster humming.
2.5 distilled is 2.5 distilled. R1 distilled is R1distilled, don't mix them. It's still smaller version, but 2.5 is totally different version
@@dubesor almost everyone on TH-cam am to be doing distilled versions unknowingly
"the testing is fundamentally broken and does not matter to real world use cases"
Amen brother. Glad people are catching on to this
Can you turn your volume up to match other youtube videos please. i have to double the volume on my TV and it still sounds quite. please thanks
Agreed..its quite low.
i have it at 70% on my Notebook and it is pretty loud.
Yeah I forgot to run the youtube normalizer in davinci on this one. Happens from time to time.
You can make a template to omit the part. Also why are you comparing Deepseek r1 Qwen distill 14b q4 against Claude 11:15, your video title should reflect the model you are testing. Distill models are just finetunes of Qwen and LLama.
BIG NOTE: Although these models are trained to internally "reason", this is just one part of test-time compute. Adding an external chain-of-thought generator and deep context handling is what will be truly required to make them useful for everyday tasks. This is what o1+, as well as Perplexity and other products are doing more explicitly and what is coming to open source imminently.
@vladislavdonchev1271 I'll add there also needs to be an additional layer here, test time training. Otherwise, these models will keep going through the same steps over again even if it knew the answer before. The problem with reasoning models, and most LLMs for that matter, is that they don't know when to stop and cut off the chain of thought at the correct response. Also, this would help so that you won't necessarily have to prompt the model the same way every time if the model encounters a similar scenario repeatedly.
Love this content, and super jealous of your rig!!! I would say its worth noting you are running the distilled models not exactly DeepSeek R1. Honestly the models you tested are the ones I care about the most, just noting I first thought you were actually gonna run the actual DeepSeek R1.
Regarding coding they did say in their paper that they did not do lot of RL on coding/engineering reasoning for R1 so it didnt get a huge improvement over V3. They mention that adding RL on reasoning for coding is one of their next tasks.. So a lot more improvement to come yet.
This is not R1 🙄
Im aware. They updated the model page a few days ago it looks like and added a lot of information that was not prior there. Really not sure why this wasnt included day 1 on that pg, but I am working to get an exo cluster running to test the real version out. I did read the card for ctx but skipped right past the base being different, thats totally on me. I clearly do state in my videos I am not an ai expert nor is my testing scientific and this would be a good example of why. Apologies for this review, but clearly its pushing me to test out a new engine so thats something decent that came from it. Future video to address this very soon.
@@DigitalSpaceport I am using open-webui like you are and also have the 14b model but I am confused what both of you are referring to. All I could gather from context (since I have very little experience) is the line of the model description on the ollama models page for Deepseek being "general.basename DeepSeek-R1-Distill-Qwen" and the line on the README being "Below are the models created via fine-tuning against several dense models widely used in the research community using reasoning data generated by DeepSeek-R1."
So is this just someone uploading a replica of Deepseek? Supposedly the actual Deepseek authors have uploaded their version to huggingface but I have not been there to check. For all I know, the authors of models typically may only share specifications for hobbyists to use to build usable models, but again, what do I know? I am just getting into this stuff. Thanks for creating content like this. This is the first video that has introduced me to your channel which I have subscribed to and of course hit like on. Cheers!
@@DigitalSpaceport Can't you update the title?
this is AMAZING. 8b on my mac pro m2 and its intellegence is up there with sonnet 3.5 and its FAST AF.
I've put the LLAMA architecture 8B model on a GTX1080 with 32GB system RAM in LM studio and it is frankly astonishing and indeed fast.
I'm running the 8b well and 13b "ok" and wow, there's just something about this LLM. Running in LM Studio and hitting it from Anything LLM. 45 tokens/s on the 8b quantized on my RTX 3060 and 12 year old i7. EDIT: Forgot to mention, that the coding capability is decent. Understand grammars. It made sense of a complex EBNF grammar I made a few years ago for a custom language. It was smart about neurochemistry, some C#, some high level conceptual thinking. Just gotta say, I have never had this type of interaction with an LLM on my local machine before. It feels like o1 a bit even on my very limited 8b. I found it sensible and looking forward to automating it. It's going to replace llama3.2 as my go-to creative helper. The [thinking] block is brilliant and if you just hit your llm host with some python you can hide the [thinking] block.
@@tristanvaillancourt5889 i am trying to get a cluster up with exo to increase the vram amount. This compose isnt working out great.
Qual a sua experiência com phi 4 14b no LM Studio?
@@DigitalSpaceport well I changed my opinion of this LLM. Also, no LLM passes my tests, describe bellow. The reasoning is pretty good but for my budget build I can't get any significant code generation out of it. It's ok for short snippets. My test is always getting the LLM to generate a "B+ Tree implementation" in C# or something like that. NO LLM HAS EVER PULLED OFF THE "Delete()" METHOD. O1 can't do it. The B+ tree's Delete node method requires next level thinking. It must re-balance the tree as needed, and so far, no LLM has managed it. The code always fails under test and edge cases are always missed. Back to DeepSeek R1 .. now that I played with a dozen different local LLMs since last week, I can say that it has a place in my home setup, but for my budget setup Mistral 7B, Dolphin 3.0 and some BERTs are more useful for automation and chatting. So what's the use case for DeepSeek R1? Planning with RAG and Agents?
So 14b too small for this model .. but wouldn't the 32b fit for you?
That's what surprised me. And 4 bit no less.
Could you even run that with this hardware?
I can run the 32b on my computer, it's just very slow, I don't have a multi GPU setup, just a "gaming PC"@@MelroyvandenBerg
The full size deepseek model does get the peppermint question right. 14 billion model just doesn't have the power to do it properly in spite of the deep thinking.
How come you only test the 14B version? Obviously bigger ones would be closer to Claude
Great video. Do you have any videos for an absolute beginner? I have a pretty beefy system now. I also have quite a few GPUs I can use. Would like very much to get started in AI. Just don’t know really where to begin.
On my mac, deepseek-r1 + ollama is broken based on my internal benchmarks. Simple test is to ask it if it takes 3 hours to dry a shirt on a clothes line, how long will 10 shirts take? If it fumbles all over itself, its broken. What I mean by fumble is it continues to second guess itself rather than give a confident answer. When I run LM studio with the MLX versions of deepseek, it gives correct answers with confidence. If you have access to LM studio, try the deepseek models on there.
This is great, thank you for doing this! Have you tried to check your ports and see if it is doing any external communication? Im VERY concerned that this is Chinese, I don't care if they are private.
There is a big problem with the reporting of Deepseek R1. The Ollama versions are not really Deepseek, with the exception of the 680GB 4bit (half-precision model). They are the distilled versions (llama/qwen) and they are also 4bit. This makes a huge difference when testing prompts. LM Studio now offers the Deepseek 680GB 8bit model as well as lesser quants. I wish streamers would inform people on this issue, because it is quite misunderstood. Only the largest model is a true Deepseek architecture.
Waitone! "2 vowels" is correct in terms of your query. Only two vowels, "e" and "i", appear in peppermint. Now, were you to ask >how many times do vowels appear in "peppermint"...< Yes, it does miss the double "p", which is a common LLM weak spot.
Regarding the driving problem, I think the explanation is a simple case of GIGO, as Phi-4 made the same type of mistake, and using kilometers in some way doesn't fix the problem. To simplify checking, I tweaked the problem to include "using the Interstate highway system" so I could compare Google Maps' answer(s). I find it odd that Phi-4 and DeepSeek both ignore using I-12 to avoid NOLA (saves 20+ miles, and driving I-10 in NOLA sux).
ADDED: So I said here's the latitude and longitude for Austin and Pensacola. DeepSeek-R1:32b went off to compute the great circle distance (a rigorous answer) between start and finish. In the end, it essentially read back to me my input of "Google Maps says", with no explanation of the error. In fact, it still flunked the quiz with "Via I-35 to I-10 :
This route takes you from Austin north on I-35 to San Antonio, then east on I-10 toward Pensacola. According to Google Maps, this is approximately 762 miles ." Last time I drove from Austin to Baton Rouge on I-10, I drove >southwest< on I-35 to get to "San Antone".
IMNSHO DeepSeek's is eyewash to amuse us organic, wetware-based units. In this investigation, and using it to analyze large pieces of text, what it finally "said" and what the "thinking" showed didn't quite match. Further, both the 70b and 32b version spend a lot of time heating my 4090 before putting anything up on Open WebUI. What happens before "" only DeepSeek-R1, and maybe the developers, know.
Compared to Llama 3.3:70b, when it comes to DeepSeek, it gets the "deep six" (rm).
Thanks for your awesome videos; I’m also trying to build a similar rig. But I guess taking a model that has been distilled from 671B FP16 to 14B and then quantized to q4 (a 192 times reduction in size) just cannot have the same results as the original.
I think it would make more sense to test the 671B one on your 7995WX, even if needs hours to respond just to see if the model is really comparable to O1 given the required hardware. I mean, it's not deepseeks fault that we don’t have a couple of H200s lying around to play with :)
There will be over 100 comments dumping on this model, but I think it has merit. I appreciate these chatty Kathys because getting a glimpse of their reasoning helps me write better prompts, and I put a lot of effort into making better prompts, since "better text in" is most of what I can do to get better text out.
70b models are the sweet size, hopefully you get to test it
The peppermint question, may be a prompt issue. You specified how many p's are there in the word, and how many vowels. There are only 2 vowels out of 5 possible vowels. What if you adjust the prompt to indicate how many a, e, i, o, and u are on the word? The LLM may have interpreted how many vowels as how many unique vowels are there in the word.
You just got another subscriber good video mate 🎉
Thanks for sharing your knowledge with us! I just built a similar rig inspired by yours, featuring 4x 4090s on the same board (Rev 3) and an Epyc Milan 7C13 CPU, paired with Noctua NH-U14S TR4-SP3 air cooler. I opted for the C-Payne SlimSAS Gen4 risers after reading about potential issues with traditional ones, and they're working flawlessly. One tip I'd like to pass on is to definitely consider adding a second PSU - I've got two AX1600i PSUs in the same chassis as yours. I'm using the ADD2PSU board to power on the second PSU, which is dedicated to 2/4 4090s (just a heads up: make sure to use one PSU per GPU and avoid mixing/matching). Thanks again for the inspiration, and I hope this helps others building their own rigs!
@@n.lu.x does the gen4 pcie riser works? Or it has to be downgraded To Gen3?
@@wayne8863 works with full pcie4 bandwidth, i've tested it with the p2p script from nvidia-samples (gpu to gpu mem transfers). I've also ran stress tests and no issues!
@@wayne8863 it works at full gen4 speeds. I've confirmed it with the p2p script from nvidia samples library. I've also stress tested the gpus and no issues!
@ I don't know why but my comment keeps getting deleted. In short, it works in gen4 mode
It would be nice to figure out how to add a FIM (fill in the middle) tokens for a text autocompletion. Its challenging since it actually would be nice to view the ... block before the final model's output.
Additionally it would be nice to figure out how to add the tool calling support. It could be really good for coding. The model WITHIN chain of thought will be able to test the code that it has written.
What app is he hosting on the local server that mimics Chatgpt? e.g, he can choose the model and save his chats etc.. I've seen it once but I dont remember the name of it. IT's time I host it on my own.
@@mochalatte3547 its called ‘open webui’
@@mochalatte3547 open webUi
Its called Open WebUI
@@Unfinished_Projects Thanks! I saw it earlier on this feed (what you need).
Hola agradezco enormemente que realices las traducciones a diferentes idiomas en compensación tienes un suscriptor más. Realicé las pruebas con el modelo de 671B de parametros y es alucinante la precisión y claridad en las respuestas. Se que es importante para ti el uso del modelo de 14B pero no le hace justicia a lo superlativo frente a o1 o a Claude 3.5, además que los modelos subyacentes los puedes personalizar o finetunear de acuerdo a tus propios requerimientos, creo que Deepseek toma la corona sobre el liderazgo en modelos de AI tanto en la parte de modelo como en precios, personalmente se convirtió en patrón de comparación como hasta ahora lo fue el modelo o1hasta diciembre siendo replazado por Gemini 2.0 ahora por este mountro Adios temporal a Openai!
Nota: No hay que dejar de lado tampoco a Minimax de Hailuo con sus nuevos modelos...
What is the neat utility you have running in the powershell window that provides all the GPU information, updating regularly?
nvtop
@@MikeHowles which also works for AMD.. Just saying
It would be RL trained on very specific benchmarks. Have you tried questions closer to its training set? That might rule out over fitting vs low quant
thanks for the video, please fix the micro, your voice seems off, micro issue ? thanks anyway
For reliable answers 4 bits is not enough when you compare an llm to open ai 🧐.
Yes. The 32b 8 bit works amazingly well. Hopefully gets tested here.
I would like to indeed see the same test with more parameters.
And maybe also more lower language games?
did I catch that you are using a 2 bit model of DeepSeek? You do know they run faster but the accuracy drops a lot.
Thanks for being honest instead of jumping on the hype train.
I'm also a newb, but if you want quality answers, limit your context window to save VRAM as you aren't using any long context with your questions, stick to q8 or better with largest model possible.
Big fan of the channel but i dont think this was a great test for R1. Ive tested with the api for the full sized version and it is very good.
Do you think you vould get the full sized version working completely locally? I'd love to watch that video!
So as it turns out.... thats not r1. Its a tune which explains a lot. TBF the ollama page didnt mention it originally when I visited it like it clearly states now. I did read the model card to ensure the ctx but failed to read that it had a different base. Pinned a comment on this but yes will be redoing it. Trying to get exo working as its best chance for locally hosting it. In sort, this video is obsolete and that explains likely the poor performance of the LLM.
@DigitalSpaceport that makes sense. Looking forward to the exo vid to see what this thing can really do!
Man you need full context only when you really need it. Otherwise always prioritize model thinking capabilities (model size given the same model family) and insure that you have just enough context for the job at hand. For all your tests 4k will be more than enough. For my questions with PhD level math 16k windows is absolutely fine. If I need to feed my whole paper I set window to 32k. In your VRAM reach case I would load 70b llama finetune at Q6_K with 32k context window and you will be covered in 95% of your use cases.
Might I suggest some test that would interest me greatly, how about tools and function call or perhaps even MCP servers Cline has a awesome MCP setup for VSCode.
Great channel! Why are you running your four 3090s at a max of 175w when their max is 350w?
Recent ollama update appears to be hitting harder on performance, which is good I think, unsure if model specific but it blipped my psu into overload. Usually the way ollama ran parallel processes the process was 25% on each. This hit over 60%. Going to be adding a second PSU and moving this rig to a 30 amp circuit soonish.
@ that’s interesting, with my three 3090s at 350w, I get strange thermal trips for the CPU (Epyc on ASRock Rome mobo) even though the CPU temp is under 50 Celsius. I stop getting thermal trip errors at 185w. This also happens with ollama.
I ran this model on my usual setup at the 32b size, and about an 8K context - I could have gone higher on the context, probably around 32K based on the memory used, but I doubt it would make a difference in this context. I got acceptable TPS even on my 4060 TI's, with mid-20's on most responses. Will this replace LLaMA 3.3? It's interesting, and I would say around the same quality. It's pretty locked down however on creative writing, and I'd have to see more examples with better prompting. But I would say it's "boring", for lack of a better term. And my god it's verbose. I was going to run it on the second API server, but I um, had an accident and messed up the motherboard (ripped out one of my x16 slots - don't do that!) So that's limited now to one GPU, with proxmox having issues using some of the other PCIe slots due to addressing PCI iD's higher than 4 digits. Never knew that was an problem, and there isn't much out there about it.
This is a better demonstration than all the hype I've read everywhere online. Thank you.
As far as I'm concerned, at least compared to me you're a SME, so agree to disagree? Always curious to see how 70B and better models run since I can't handle them at home.
Gave the 14B and 32B models a go this morning, and 32B was noticeably better imho. Armageddon with a twist was a failure for me... didn't give a final answer, got stuck hemming and hawing.
My rig is a real battleship, I can run Qwen 72B at full speed. ($25,000 rig....). I have been playing a game called AI roguelite which now needs to be updated to handle the sections...
wow ! may I ask what your hardware setup would cost to replicate ?
There is a full breakdown of the rig costs at the end of this build video in a spreadsheet. th-cam.com/video/JN4EhaM7vyw/w-d-xo.html
I don't understand why there are tests like the first 100 decimals of pi or counting letters, this is something that LLMs struggle with and will not get better at. Why don't we focus on adding sandbox code running tools instead?
Hi @Digital Spaceport. Thank you for your videos. I have a near identical dual 3090 setup (using your videos as inspiration) on a SM mobo. Hope is to add on more VRAm with time. I’m on a mining rack right now but would really like to have it enclosed in a 12u server rack with a ventilated mesh. Do you think that would limit airflow and my temps would suffer?
What is the case you are checking out? Ill take a look and let ya know.
@@DigitalSpaceport I was considering the Tripp lite 12u (SRW12US). Have a Zotac 3090 which tends to run a bit hot.
@@ArvindKumar-ph6nc how's that work..does that 12u have places to mount a mobo? Sorry if thats a dumb question
@@tvstation8102 actually had it built out in a mining rig similar to what’s shown on the video. Was planning to just stuff it into the server rack (which based on the dims should fit)
If it worked for mining, it will work for LLM. The only major difference is the CPU matters (fast single thread, the ram speed doesnt really, and the pcie lanes for more complex workflows like rag need real risers) The x1 usb dudes do work in case you were wondering.
Itś not all about speed but have you tried, i call it another "engine"? with Sglang instead of Ollama iam getting 100tokens/s with the 1.5b model and my poor RTX 2080.
Can we see a setup using 4 rtx 5070 and see how they preform with the added extra vram compared to one 5090 and the cost savings in doing this way besides the power consumption
I dont get anything free from nvidia 😅 I mean if I can afford it yeah but thats a lot of money. I have been told from a friend who work at big tech hw co. the 5090 perf per $ will be very hard to touch. However a 5090 will also be hard to touch lol.
@ that’s fair enough but in saying that that’s why I’m curious if 3 rtx 5070 having more combined ram than a 5090 and only costing under 1800 if that makes sense well hopefully some one sees your skills and helps you out
I tried r1:7b locally and it makes very simple codeing mistakes like writing variable names with spaces in them or importing python packages that don’t exist. So at least for the lower parameter models I don’t see any superiority in codeing.
I gave the 70b a quick shot and found it quite dumb, despite all the reasoning spam. Qwq is better and a lot smaller.
I see PowerShell, so I have a question - is using Windows with WSL2 is efficient?
Im not running these in windows, just remoted into the machine on the windows terminal.
14b @ 4bpw with 4 3090s 💀 bruh…
at least try an 8bpw quant or even fp16 you don’t even need the full context for those tests. You could also offload the kvcache to ram if you did want full context or even quant the kvcache to squeeze in a higher quant than 4bpw...
how much does deepseek r1 chain of thought think in english and western norms. Is it largely copying and generating the implict reasoning of a latent space parameterized western corpus?
Keep it coming!
This analysis is first class.
Can you use vision capabilities as on website locally? Or only text
txt only on this one right now. Would love for every model to have an optional vision variant. The future we need.
Gracias por el doblaje al español.
Parece que todos los que habéis hecho pruebas, habéis tenido buenos resultados.
That ia good news! Thank you for the feedback, I appreciate it. 👍
Running only a 14b model on a quad 3090 setup is CRIMINAL. I have dual 3090s and I run 70b parameter models. You could easily run the 70b r1 model at 8 bit quant, so i assume youre using full precision... WHY?
I showed live the 128k ctx does not fit into vram at a larger size. VRAM is 82% full here. Im setting up an exo cluster to bang the non distill out rn so an updated video soon as I can get it stable.
@DigitalSpaceport I'd understand better if any of your tests even used that much, but I'd bet none of them even hit 16k!
Even apart from that, I can run 70b with 64k context, with double the vram you should easily be able to run at 128k. 5bpw, even with a 16 bit kv cache should fit into your vram even with your unreasonable context requirement.
I wonder what would be the performance of 14b q8. I think testing on q8/f16 if possible should be a fairer comparison.
Im going to test the 32b q4 and see what context I can fit in. 48k ctx didnt fit it in 96GB vram. 32 should but yes hitting q8... i need more gpus.
Agreed, Q4 models aren't really good indicators of the quality of non-quantized versions. At least Q6, but preferably Q8.
@@tungstentaco495 Iw heard Q8 minimum for coding work, Q4 is fine for creative or roleplay stuff.
Can you disable chain-of-thought generation? Will it affect the results?
I dont tink it can be disabled but you could hide it via template.
Dude, deepseek-R1-70b is not true DeepSeek, it’s just Llama-3-70b trained on synthetic data obtained as a by-product during the DeepSeek R1 training process.
Gostaria de ver você usando o LM Studio explorando todo o seu potencial, estou tendo acesso ao seu conteúdo graças a dublagem automática TH-cam🇧🇷
Any chance you figure out how to spin up the full 650b param version of r1 on a cloud and show us its performance?
May be its better off to use AMD THREADRIPPER PRO with 8 channel RAM to run it? 😊 The speed of data transfer would be around 200 GB/sec which just five times worse than in GDDR6 VRAM.
In fairness, O1 is pretty bad. The initial replies are decently okay, but it doesn’t stick to instructions (it tends to rewrite unrelated code, often claiming to add validation or similar, but removing critical functionality).
For example, it likes to check if every variable in a bash script is set. Fine. But it’ll move those checks above the call that sets the variable. Or it’ll replace a if =>1 with a >0… okay, what about 0.5?
Another time it escaped the escape characters in a complex string, completely u related to the code it was intended to fix (the fix worked, but only if you copy just the lines you wanted changed, it broke unrelated working code).
Beyond that, if you point out a problem or add something else it goes insane, 2-3 messages in and it writes dissertation length explanations and chunks of code that are only tangentially related to the original problem, and in this state the only fix is to grab the last “goodish” code, clear the context and give it the code it wrote and ask it to bug fix.
O1’s initial replies aren’t bad. Maybe better slightly, maybe not, but it craps all over itself shortly after.
Fun to play with? Yeah. Useful in production? No. Other models are far more useful, in my opinion.
It's still a China LLM, which I personally kinda find suspicious. But that is maybe just me.??
I keep seeing R1 doing weird translation errors, it's like it's translating from English to Chinese to English in its brain. You can give it certain text phrases and it completely sees them as something else when it outputs them.
whats this webinterface? im currently using ollama. anyone can tell me what this nice looking web interface is?
Open Web UI. I have a guide that might be of interest to you on it and there is compose cheetsheets linked in that description also. th-cam.com/video/lNGNRIJ708k/w-d-xo.html
After just, cranking double 3090 into case I saw this video 😂
So the conclusion is : LACKLUSTER. Ok. Thanks, that's helpful. But what are you comparing it to ? Another 14B model at Q4 ? If you are comparing to openAI o1 model that's unfair because it doesn't run locally. You could compare it to phi-3 models I suppose it would be lackluster in comparison. So its lackluster compared to what please ?
bro you have so much hardware, can you not test the largest model? i wish to see performance on epyc/tr cpu/ram only and with 2-4 gpus 3090+ (because its a MOE model it will fit into the GPUs), i really wish to see the performance in this scenarios. i dont want to bildly spend 20k for a server and then maybe its just not worth it :(
Open source baby. Deekspeek is for the people.
I'm sure that is their officially approved slogan as well... :D
Jokes aside, the achievement here is just grand!
OpenWebUI needs to update so that the thinking process is not getting shown.
That thumbnail. Honestly thought I clicked on a TechnoTim vid 😅
Same, when I saw it I thought it's Tim 😂
This is not R1. Can you please test real R1 with 671b params? With rtx 5090 and also apple mac mini, mac studio. Otherwise all of these are nonsense. I realize these hardwares probably gonna have hard times working that model but we need to see how much they can handle.
Yeah, there's a pinned comment on it. I'm trying to get enough GPUs to run this. It is a huge model. Hopefully out very soon.
Is it just me or is the sound incredibly bad? I have some good headphones plugged into my computer but I had to turn the volume up so high that I almost had a heart attack when a notification came in. 😅
I'm usually a huge fan of your videos but thought I'd at least leave this as feedback.
U should try 32b with 64k context
Hey.. I've sent you an email to connect with you to fully test out this model on your channel
While I love these videos… I also dislike them. Makes me jealous. See… I have an RTX 4080 with 16GB. I am thinking about getting a 5090 for the 32GB and already know that It's not enough. It's for hobby purposes - so I actually have to stay with distilled models or APIs. I got myself a big Ryzen 7950 X3D with 128GB Ram - and the whole wonderful new AI world can't find a way to make use of it. I sincerely hope that some very intelligent people on this planet may find ways to optimize the whole LLM thing. I know, I am just an 3D artist and don't have the deep insights - but even in my business I have to optimize my stuff for smaller rigs, without making it look worse.
I'm not sure why the extinction level event question is hard. We let generals and army leaders everywhere make this kind of call all the time. X die so Y live. X < Y therefore ....
Ask it to solve P vs NP
Intro was way too long, people want to see what they came for asap
Thanks for the feedback, Ill condense it down but also this video is functionally obsolete. There is a pinned comment you should read, review of actual model R1 soon. Hopefully on exo, but if not then likely one of the cpu heavy rigs.
14b isn't enough to have all knowledge like 200b models try. if nvidia wasn't such a douche on ram amount and amd weren't such clowns on the same, we could run bigger models. GDDR6 costs less than 2$/GB so there should be 128GB versions
Try 32b at 64k context
audio mix is very low. you should fix it
Hey just saying that I did notice this as well. I have lots of range in my speakers, so not an issue here, but it was low for me.
@@anthonyperks2201 you can right click the video and show statistics. anything below -10 is very low. full use of the range might be 0db. -5 might be pretty normal
Try run without ollama. Use only Python
Yep they updated the page a few day ago it seems. I pinned a comment in this from another commenter and will address it soon.
This is not deepseek that's just a finetune of Qwen, shitty misinformation video.
Please check the pinned comment. Im not going to pretent to be above mistakes but the ollama page changed a lot a few days ago. New video out on real deepseek r1 test soon.
... Oh boy after this DeepSeek model publication I will never able buy Nvidia GPU for my gaming PC....
Its an insane amount of vram to run the 671 real deal deepseek at home. It is amazing though.
@ What abaut very fast PCI ssd? Will that let you run big models? I know it will be slow but can it be done with 7000Gb/s SSD?
Why all you tubers testing the small shitty models
Ollama s.ucks, not very user-friendly for the average John and Jane, no wonder why AI acceptance among regular people is rather low, other than using it as a more powerful search engine, when there are so many nerd applications to run them.
Nice reveal. Surprising failures.
Only Englisg and Chinese. All other languges it way hallucinate terribly !!!
Um, I've used German with both :32b and 70b and didn't have problems. OTOH see my earlier comment re: the driver problem, for hallucination (e.g., San Antonio is SW of Austin, not N) in English.
Haha! Unaligned SAD FACE