Okay .. take an average... small.sample size but a lot more statistically valid... But, those stats seem to consistently show the 1st run to be significantly faster than the subsequent .. presumably even with some overhead because it is the first run (?) My questions: - why are the subsequent runs faster? - what is decaying/increasing to give the curve? What do "first run" amd "subsequent runs" mean? By that I mean, wondering .. does first mean "since mounted" *or* "first use of *this* prompt since mounted"? Dies alternating between different prompts, or using 30 different prompts in the benchmark set make a difference to the patterns observed?
Yep I agree, for what they cost if you can feed them power and keep them cool, they are still good value depending on your use cases! Tensor splitting can be fun to play with when using mixed cards. Mixed cards will effect your speeds (have a video covering tensor splitting and mixed gpus) but it's great way to expand your total VRAM for larger models!
@@InstaKane M40 will be 1/5 the speed of 4080 super but I’m using 8 gig to build a 24 gb card. At level 3 my Dell T7820 with 128gb Llamafile 40 core Xeon CPU. Building this machine to launch my ML/AI channel 🥹.
m40 and p40 card is good at fp32. They are not built for fp 16 or int8. So if you have two p40 or m40 you should try an fp32 llm and you will surprise by the result. also great video.
I agree! They are still good budget cards for expanding VRAM, that's where I started in my lab. Have 3 of them sitting here. They are aging out of cuda a bit, and power hungry, along with passive cooling so gotta handle that also.
Great video, looking forward to seeing that leaderboard get filled up!
Thanks! Lots more coming to fill it out!
Okay .. take an average... small.sample size but a lot more statistically valid...
But, those stats seem to consistently show the 1st run to be significantly faster than the subsequent .. presumably even with some overhead because it is the first run (?)
My questions:
- why are the subsequent runs faster?
- what is decaying/increasing to give the curve?
What do "first run" amd "subsequent runs" mean?
By that I mean, wondering .. does first mean "since mounted" *or* "first use of *this* prompt since mounted"?
Dies alternating between different prompts, or using 30 different prompts in the benchmark set make a difference to the patterns observed?
Thank you for providing comprehensive testing! Can you please share the name of the software that you use to do the tests?
The testing suite is a custom built app I built for my lab that uses streamlit, langchain, python, etc.
M40 AT $85 is great when paired with my 4080 super giving me a 40 Gb GPU 🧐.
Nice, I’m thinking of doing the same, does the M40 GPU slow down the overall inference speed?
Yep I agree, for what they cost if you can feed them power and keep them cool, they are still good value depending on your use cases! Tensor splitting can be fun to play with when using mixed cards.
Mixed cards will effect your speeds (have a video covering tensor splitting and mixed gpus) but it's great way to expand your total VRAM for larger models!
@@InstaKane M40 will be 1/5 the speed of 4080 super but I’m using 8 gig to build a 24 gb card. At level 3 my Dell T7820 with 128gb Llamafile 40 core Xeon CPU. Building this machine to launch my ML/AI channel 🥹.
m40 and p40 card is good at fp32. They are not built for fp 16 or int8. So if you have two p40 or m40 you should try an fp32 llm and you will surprise by the result.
also great video.
I agree! They are still good budget cards for expanding VRAM, that's where I started in my lab. Have 3 of them sitting here. They are aging out of cuda a bit, and power hungry, along with passive cooling so gotta handle that also.