Thank you for the great break down. I really like how you went back to the first paper to explain the theory underlying bitnet and then explained what was different in the new paper. In general these kinds of advancements excite me because they potentially could make running, and in some cases training, these huge models something us mere mortals without infinite compute can actually do locally.
Does using FP8 for activations as opposed to INT8 offer a significant accuracy benefit? I suppose a integer adder is even simpler than a floating point adder and may save additional power
@@gabrielmongaras actually I have seen a paper that reports a stability benefit of FP8 over INT8 for LLMs during training once they scale beyond a certain size.
@@cbuchner1 That sounds really intersting. Can you please send the paper? I wonder if other formats would work better in FP8 such as how BFLOAT16 is usually better than FP16.
@@gabrielmongaras chapter 4.6 here arxiv.org/pdf/2303.17951.pdf But I misremembered in so far as they state it works better for Transformers (not limited to very large ones) and that there are ways to also make it work well with int8 I will have to keep looking for papers that talk about comparing int8/fp8 in training GPTs
Technically impressive that its possible however I only see limited application for this. Pactically most models below 8 bit quantization are way less "aware" of input and context. If I alter a situation ina 8b model it can adjust its output accordingly any model below that is very rigid. That being said maybe you can metigate those effects when you train them on that quantization to begin with and not compress it when it was trained on higher values if it makes sense what I say...
Will this work for CNN or only LLM? Does the training still need to used GPU's or is this only an inference benefit. I don't see any examples or data other than this Paper.
I found an implementation already on pip under the name "bitnet". From looking at the code they fully implemented the first paper and are making the changes now to implement the changes in the second paper. They even have a bitnet version of llama(bit_llama) in the repo. They also have a function where you can replace all of the linear layers in a model with bitlinear layers.
Hi bro, could you recommand me a large Model which can provide girlfriend API , or I can fine tune it to be a girlfriend model? I need an uncensored model for girl friend role. (not be NSFW, just to be a warm girl friend, when you ask her "can you be my girlfriend", the model won't reply "I am a AI model", that is so annoying) , or any other way to solve the problem. I whatched your video" Talking to girlfriend", but I worried that the Model you mentioned might be outdated. I am looking forward to your reply .Thank you!
I don't think LLMs should be used for this task, you can interact with them thats ok but a "girlfriend" is something between humans. You shouldn't make money on the back of lonely people and even its free its unhealthy to form a forced realtionship with a machiene (apart from being morally and ethically questionable).
Thank you for the great break down. I really like how you went back to the first paper to explain the theory underlying bitnet and then explained what was different in the new paper. In general these kinds of advancements excite me because they potentially could make running, and in some cases training, these huge models something us mere mortals without infinite compute can actually do locally.
oh! ive been thinking about this myself. how nice to see it realized!
Excellent review! Thanks
This paper will add a ceiling price to the Nivida stocks
Does using FP8 for activations as opposed to INT8 offer a significant accuracy benefit?
I suppose a integer adder is even simpler than a floating point adder and may save additional power
Ok I found a paper discussing these details: arxiv.org/pdf/2303.17951.pdf
I have no idea why I said FP8 during the video 😳
INT8 is used just like you said since FP8 doesn't offer anything over INT8
@@gabrielmongaras actually I have seen a paper that reports a stability benefit of FP8 over INT8 for LLMs during training once they scale beyond a certain size.
@@cbuchner1 That sounds really intersting. Can you please send the paper? I wonder if other formats would work better in FP8 such as how BFLOAT16 is usually better than FP16.
@@gabrielmongaras chapter 4.6 here arxiv.org/pdf/2303.17951.pdf But I misremembered in so far as they state it works better for Transformers (not limited to very large ones) and that there are ways to also make it work well with int8
I will have to keep looking for papers that talk about comparing int8/fp8 in training GPTs
Can't wait for 0.5 bits models!
According to the calculation log2(0.5) = -1 so does that mean you need a base -1 number system?
😂😂😂
Analog computing
Excellent Explanation !! can you please make a video on speculative streaming
Technically impressive that its possible however I only see limited application for this. Pactically most models below 8 bit quantization are way less "aware" of input and context. If I alter a situation ina 8b model it can adjust its output accordingly any model below that is very rigid. That being said maybe you can metigate those effects when you train them on that quantization to begin with and not compress it when it was trained on higher values if it makes sense what I say...
Will this work for CNN or only LLM? Does the training still need to used GPU's or is this only an inference benefit. I don't see any examples or data other than this Paper.
Does this technology make Nvidia's tech and NPUs obsolete?
Noob Question… so binary/ternary quantization has been around for a while… which part was the major innovation/discovery in BitNet paper?
Mainly that a model can be trained from scratch using binary weights and still be competitive in terms of perplexity and accuracy.
very cool
we need the code
I found an implementation already on pip under the name "bitnet". From looking at the code they fully implemented the first paper and are making the changes now to implement the changes in the second paper. They even have a bitnet version of llama(bit_llama) in the repo. They also have a function where you can replace all of the linear layers in a model with bitlinear layers.
Hi bro, could you recommand me a large Model which can provide girlfriend API , or I can fine tune it to be a girlfriend model? I need an uncensored model for girl friend role. (not be NSFW, just to be a warm girl friend, when you ask her "can you be my girlfriend", the model won't reply "I am a AI model", that is so annoying) , or any other way to solve the problem. I whatched your video" Talking to girlfriend", but I worried that the Model you mentioned might be outdated. I am looking forward to your reply .Thank you!
I don't think LLMs should be used for this task, you can interact with them thats ok but a "girlfriend" is something between humans. You shouldn't make money on the back of lonely people and even its free its unhealthy to form a forced realtionship with a machiene (apart from being morally and ethically questionable).