Make Local Deepseek THINK LONGER!💥 Local Test-Time Scaling 💥
ฝัง
- เผยแพร่เมื่อ 9 ก.พ. 2025
- Paper Abstract
Test-time scaling is a promising new approach to
language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s
o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to
achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K
of 1,000 questions paired with reasoning traces
relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we
develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait”
multiple times to the model’s generation when it
tries to end. This can lead the model to doublecheck its answer, often fixing incorrect reasoning
steps.
In this video, we turn DeepSeek-R1-Distill-Qwen-1.5B into a deep thinking model enabling test time scaling.
Note: this works with all the models that generate thinking tokens!
🔗 Links 🔗
s1: Simple test-time scaling
arxiv.org/pdf/...
MLX LM - pypi.org/proje...
Code by Awni Hannum - gist.github.co...
❤️ If you want to support the channel ❤️
Support here:
Patreon - / 1littlecoder
Ko-Fi - ko-fi.com/1lit...
🧭 Follow me on 🧭
Twitter - / 1littlecoder
This is a ground breaking paper. Well done!, And thank you for presenting the video on it. Great stuff.
Glad you enjoyed it! Might do another paper one more clearly!
Make that scientific - run a benchmark!
Just watched one of your old videos and i must say, the quality of your in-video talking and content presentation has improved drastically. Also love how the videos now are short and to the point
Excellent! I had read about people automating 'continue' response to get better response. Love this❤!
That's great to know!
Can you point me to this
DeepSeek hidden powers!😯
Did I miss where you put the wait state?
Hi Bro one question. When DeepDeek or other similar COT model thinks it displays thoughts on screen..Do these thoughts consume tokens?
Yes
Lol, yeah! OpenAI O1 didnt show the thinking tokens they still charged you for it.
The other one was crashing because you weren't using the special tags that need to format the text prompt
Like the user
Or system.
This is the format that the models needs as input. You can use chat schema that transformars uses to format input from a list of dicts(openai chat history format)
I think you're absolutely right. I guess I got carried away with the working solution. Forgot about this
Liked this video a lot. Ignore the comment that said to scrap it. This video is actually actionable which is what we need. No point to papers if the viewers can't use them at home. Now, could you make a video showing how to use it? i.e maybe in roo cline / claude desktop / aider / chatgpt? Basically other chatbots?
Keep up the great work!
I sent you an email detailing how I preproduced the work in the paper with no special tools and using the same seed. (because youtube comment section moderation bots hated it in comment form) I'm not saying your coverage of the paper was bad, but your approach to reproduce the effects were not good. Not to knit pick, but this whole paper is what I would consider obvious, I've been doing very similar things to these reasoning models from the moment I got them. No easier way to jailbreak a model than to edit what it thinks it wrote so I have been doing worse things to models for even longer. But yeah, apparently more people need to see it. And a reason harder button to automate this would probably be good in some cases and would be trivial to implement.
Hi, I'm very interested to hear more about your work, could you share the example through github or upload it in drive and share?
@@jackmartin1146the method is as simple as editing the output to delete the answer and tag and appending "wait" while it is trivial to make a script to do this at a click of a button, it is also trivial to do it in lm studop manually.
Even more important is the the data set
Yes please do for llamacpp too I am waiting. :)
Bro, I just ran the 14b one to solve that kind of problems. But the issue with your solution is, these kind of problems are really numerous and hard to be hand-rolled every time.
What do you mean by hand-rolled here ?
@@1littlecoder I meant "the numerous times" you have to give it reinforcement to get the expected output. The local LLMs are therefore hopeless (with home use-purpose PCs). I have deleted mine.
Smashed
@@d.d.z. thank you
DEMOS NEVER WORK WHILE RECORDING 🤣 That is a known law of the universe my friend!! Or when you are Zooms.
wow
I didn't understand a single thing actionable from this video 😅
😭
@1littlecoder sowy 😳
@@timmygilbert4102 I'll try another one. Thanks for the feedback
same. the paper says add "wait" but his instruction is install mlx-lm and download that specific qwen model. there are nowhere in the code to add "wait" or anything. so i assume the "wait" stuff is finetuned to the particular qwen model he mentioned?
:(
Not good ?
@@1littlecoder No, but also you can do this in lm studio don't need any mac thing. But also your example failed. This is a clinker, I would have scrapped the video and not posted it. Delete thinking over token and appending wait is trivial to implement. But your thinking model you used was failing before you got to a point where this could be helpful, maybe you had the temperature off or something.
@zyxwvutsrqponmlkh out of 3, 2 worked. And why would you think it failed ?
@@1littlecoder How many c is in.. how many d ins in... You delete the finished thinking token and append wait (which it practically did itself) but still gets it wrong. Making it do more wait and rethink cycles does not address the fact that your llm setup is fundamentally flawed. Need to have something more stable before applying this technique. And you can keep the seed the same so you can measure apples to apples how the output after the extra thinking changes.