1-Bit LLM SHOCKS the Entire LLM Industry !

แชร์
ฝัง
  • เผยแพร่เมื่อ 28 ก.พ. 2024
  • In this video, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
    Paper Name: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
    Paper Link: arxiv.org/abs/2402.17764
    Let’s do this!
    Join the AI Revolution!
    #1bitLLM #bitnet #1.58 #milestone #AGI # openai #autogen #windows #ollama #ai #llm_selector #auto_llm_selector #localllms #github #streamlit #langchain #qstar #openai #ollama #webui #github #python #llm #largelanguagemodels
    CHANNEL LINKS:
    🕵️‍♀️ Join my Patreon: / promptengineer975
    ☕ Buy me a coffee: ko-fi.com/promptengineer
    📞 Get on a Call with me - Calendly: calendly.com/prompt-engineer4...
    ❤️ Subscribe: / @promptengineer48
    💀 GitHub Profile: github.com/PromptEngineer48
    🔖 Twitter Profile: / prompt48
    🤠Join this channel to get access to perks:
    TIME STAMPS:
    🎁Subscribe to my channel: / @promptengineer48
    If you have any questions, comments or suggestions, feel free to comment below.
    🔔 Don't forget to hit the bell icon to stay updated on our latest innovations and exciting developments in the world of AI!
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 106

  • @duanelachney4358
    @duanelachney4358 3 หลายเดือนก่อน +34

    As I consider myself a member of the LLM community, I must say that I wasn't shocked 1-Bit! lol

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +4

      Haha 😅. Now let's build some llms using 1 bit.

    • @duanelachney4358
      @duanelachney4358 3 หลายเดือนก่อน +3

      @@PromptEngineer48 I wonder how soon we might see a Mixture of 1-bit Experts Model?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +5

      This paper has mentioned the MoE as future work. Lets hope for the best. Since this model supports huggingface, vllms, llama.cpp, I dont think we would have to wait more than a week. 👨

    • @spencerfunk6697
      @spencerfunk6697 หลายเดือนก่อน

      thought it would come sooner. i really wanna build this now. im going to try my hardest lol. It would be really cool to be able to convert pretrained models to 1 bit then retrain them lol.

  • @TheZEN2011
    @TheZEN2011 3 หลายเดือนก่อน +21

    Creating a Transformer that can use the b1.58 method. I like it because it will work with a CPU and that's what I am designing it for.

  • @mikrchzichy
    @mikrchzichy 3 หลายเดือนก่อน +11

    This opens the door for simple IC assembly language.. ASIC style speed .. oh boy oh boy 🎉

  • @JohnSmith762A11B
    @JohnSmith762A11B 3 หลายเดือนก่อน +2

    This is an incredible breakthrough. Thank you for the video.

  • @NPC.T
    @NPC.T 3 หลายเดือนก่อน

    Wow this is so cool! Especially given the advancements in state space architecture I think this is going to be revolutionary for our compute crisis.

  • @yahm0n
    @yahm0n 3 หลายเดือนก่อน +4

    So is the key to this strategy that the model needs to be trained specifically to be a 1 bit model rather than quantizing a larger model down?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +3

      Yes. I'm 99% sure that is what they are talking about.

  • @TobiasWeg
    @TobiasWeg 3 หลายเดือนก่อน +1

    A few years ago there was paper claiming deep learning matrix multiplication is in the end, just working like a decision tree.
    This paper, drives home this claim quite nicely, I am really looking forward to seeing, how this works in real life test scenarios and curious, if bigger labs already started to useing this technique.
    I am curious, if one could take a model like mistral with FP16 weights and use this technique for a few additional epochs of training to reduce it to 1.58 bits.
    This should generally work, when I understand the idea of the paper correctly.

  • @TheExcellentVideoChannel
    @TheExcellentVideoChannel 3 หลายเดือนก่อน +2

    I'm new to this area so forgive the potentially silly question but .... Instead of training at full FP resolution and quantizing down , would you get acceptable performance if you pre-trained from scratch at the quantized down resolution? That could open the door to full local training of LLMs being within grasp of us plebs and constitute an opensource revolution. Does anyone know of any examples of performance comparisons between identical LLMs pre-trained at different resolutions?

  • @BrianMosleyUK
    @BrianMosleyUK 3 หลายเดือนก่อน +1

    Hey, that SHOCKS the industry meme really works! Never seen your channel before. 😂

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +1

      Thanks. SHOCK makes the dicovery of the channel easy. But the content should let you stick forever. 😂

  • @droidcrackye5238
    @droidcrackye5238 3 หลายเดือนก่อน +2

    1.58 for bit for quantized weight, 8bit for activation, 16bit for gradient, It is not BNN

  • @benshums
    @benshums 3 หลายเดือนก่อน

    Get the word out!

  • @unclecode
    @unclecode 3 หลายเดือนก่อน +3

    B4 I get shocked, gotta understand. Trainable parameters only got 3 possible values. How come LLM accuracy's so touchy to quantization, but switching everything to just 3 values keeps performance? Paper doesn't explain much. Does this mean all attention layers, feed forward, and embedding switch to this 1.58 bits layer?

    • @dankkush5678
      @dankkush5678 3 หลายเดือนก่อน

      True seems hard to reproduce the results

    • @scottmiller2591
      @scottmiller2591 3 หลายเดือนก่อน +3

      TL;DR - they quantize continuously during training, rather than quantizing the weights when training is completed like everybody else.
      The architecture is almost identical to LLaMa. The trick here is that they use full floating point throughout the network, but during each step in the training, after accumulating the logistics and computing the neuron outputs, they quantize the output to [-1,0,+1] in the forward pass, while CONTINUING TO STORE THE FULL FLOATING POINT LOGISTIC. In the backward pass, they replace the non-differentiable quantizer with a linear transfer. The backprop uses the previously stored full floating point values.
      During inference, they only use the final quantized values, which is what they use in their published tables. BTW, they cheat in some of the tables. Table 2, for instance has some values in the BitNet b1.58 3B row in bold (best value) that are NOT the best value, nor even tied with the best value. Only the last line of that table, where they use a 3.7B model and compare it with a 3B model (which is also cheating) do they come out on top. It's still impressive, and there was no need to cheat to be impressive, but this is unfortunately not uncommon in Chinese papers these days - I haven't seen any other nationality doing this.

    • @dankkush5678
      @dankkush5678 3 หลายเดือนก่อน

      @@scottmiller2591 ok so it is just another quantization they dont actually train using ternary values?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +1

      @scottmiller2591 I wholeheartedly thank you for the explanation.
      Regarding the cheating part, I would like to state what in the paper they have mention that as the parameters goes on increating bitnet is able to beat the LLAMA. They have also mention that until 3.7 B, bitnet looks just a normal but once the parameters cross 3.7 B, there is really a performance boost.

    • @scottmiller2591
      @scottmiller2591 3 หลายเดือนก่อน

      @@PromptEngineer48Yeah, I hated to mention it, but they did it.

  • @P-G-77
    @P-G-77 3 หลายเดือนก่อน +1

    As they say, "the paper is allowed to be written or typed," we will see the actual results, real then... certainly it turns out interesting the thing as it is described, however I have learned well that the best test... actual comparison, try it on the road and check the Pros and Cons, and who know...

  • @babbagebrassworks4278
    @babbagebrassworks4278 2 หลายเดือนก่อน +1

    The Russians made Ternary computers that used 1, 0, -1.

  • @hypercube717
    @hypercube717 3 หลายเดือนก่อน

    Interesting

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน

      Really Actually !! I am building some use cases for this !

  • @luciengrondin5802
    @luciengrondin5802 3 หลายเดือนก่อน +1

    This only apply to inference, right? I mean not training?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +5

      No. It applies to training as well.

    • @luciengrondin5802
      @luciengrondin5802 3 หลายเดือนก่อน

      @@PromptEngineer48 How can gradient descent work on one-bit weights?

    • @cristianpercivati5097
      @cristianpercivati5097 3 หลายเดือนก่อน

      @@luciengrondin5802 that's a good question, I think what others say is: through every epoch, it first applies the quantization function, then during back propagation it applies a normal float point gradient descent (just as you know it) to those weights, and then they apply the quantization function again during the forward propagation. At the end of the day, you get -1, 0, or 1 as weights, that's the same as saying the coefficients X1, X2... Xn remaining the same, negative, or zero. The model seems to be thinked to boost the inference efficiency, not much the training efficiency. This is not explained in the paper (they only explain the quantization function there), I got this reasoning from the community, so please take in mind it might not be what really happens behind the scenes.

  • @user-ek3qi9bx8c
    @user-ek3qi9bx8c 3 หลายเดือนก่อน +1

    is this model is ready to use ?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +5

      No. But just wait for a day or two. 😁

    • @user-ek3qi9bx8c
      @user-ek3qi9bx8c 3 หลายเดือนก่อน

      okay thankyou for the valuable information@@PromptEngineer48

  • @elyakimlev
    @elyakimlev 3 หลายเดือนก่อน +3

    07:41 you missed the point of the table. It's not "decent" accuracy. It shows that for models below 3B, it has a "decent" accuracy, but from 3B model size and up, it beats LLaMA of the same size.

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน

      Thanks for pointing out ! I Yes, once the size crosses 3.7B , Bitnet outperforms LLAMA. Nice to have a smart audience.. I am very grateful..

    • @elyakimlev
      @elyakimlev 3 หลายเดือนก่อน +1

      @@PromptEngineer48 Hehe, not smart. I just read the paper 10 minutes before watching your video, so it was still fresh in my mind. Good news for all of us.

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน

      Right !

  • @Macorelppa
    @Macorelppa 3 หลายเดือนก่อน +2

    How do u find time to read these papers? don't u have a full time job?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +4

      I have a 9-6 job. I have numerous projects from my clients which I am not able to complete. I am restless and sad. 😅 That's why I dont get time to even edit the videos. But hey, was life ever easy ! I need to accept all and move along.

    • @NPC.T
      @NPC.T 3 หลายเดือนก่อน +2

      @@PromptEngineer48Inspiring, thank you for putting in the effort to disseminate useful info.

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +2

      Thanks for watching the content !!

    • @chrismachabee3128
      @chrismachabee3128 3 หลายเดือนก่อน +1

      Dedication. 24 hours in a day. We'll sleep tomorrow.

    • @truthwillout2371
      @truthwillout2371 3 หลายเดือนก่อน +1

      How do you not find time to read a paper? Do you need a different job?

  • @ronnetgrazer362
    @ronnetgrazer362 3 หลายเดือนก่อน +4

    I'll allow the shocking clickbait, because this seems to be a huge deal. Like, feasable locally-run-AGI-within-a-few-years huge. What if hardware like groq gets an ASIC for specifically this kind of workload? Coupled with 24, nay, 48 gigabyte of SRAM, that would be lovely, take my money!

    • @JohnSmith762A11B
      @JohnSmith762A11B 3 หลายเดือนก่อน +1

      It works better the larger the model gets, and supports Mixture of Experts models very well. Smartphone AGI in a couple of years?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +1

      Yes. I am very hopeful. But may not be couple of years. Instead couple of month

    • @ronnetgrazer362
      @ronnetgrazer362 3 หลายเดือนก่อน

      I love it when other people are even more optimistic :)
      And I can't wait to cram a 170B model into 2 x 16GB of VRAM. Even the CPU spillover layers will probably feel snappy.

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน

      😅😁👑

  • @ipsdon
    @ipsdon 3 หลายเดือนก่อน +1

    this is counter intuitive, how do we explain quantized version is less accurate than full version. Quantization reduces datatype precision, and the paper is telling us a 3 state quantization can be nearly as good as full precision floating point? defy law of physics!

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน

      I think the answer is here "BitNet b1.58 is based on the BitNet architecture, which is a Transformer that replaces nn.Linear with BitLinear"

    • @ipsdon
      @ipsdon 3 หลายเดือนก่อน +1

      @@PromptEngineer48 There are only 3 things in the world that can approximate an arbitrary function, NN is one of it. Unless one increase the depth and width of NN tremendously with a 3 state value, you need the range of data type to have enough ‘resolution’ . If this were true, those guys had found something that will worth hundreds of billion of dollar of savings.

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน

      True

    • @cristianpercivati5097
      @cristianpercivati5097 3 หลายเดือนก่อน +1

      Exactly! Right in the spot!

  • @brianj7204
    @brianj7204 3 หลายเดือนก่อน +1

    Guys, I think i'm shocked...

  • @biskero
    @biskero 3 หลายเดือนก่อน

    Is there a model to test?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +1

      No. unfortunately.

    • @JohnSmith762A11B
      @JohnSmith762A11B 3 หลายเดือนก่อน +1

      Well, there is the handful of LLama models they trained (paper mentions 7B, 13B and 70B), though I don't think any of those are public yet. You would also need custom software to run them too,

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +1

      @@JohnSmith762A11B Correct.

    • @biskero
      @biskero 3 หลายเดือนก่อน

      @@JohnSmith762A11B yeah I though about that. I am sure there will a model to test soon at the speed AI is going ! Looking forward since I am experimenting with RPI and AI.

  • @TheGalacticIndian
    @TheGalacticIndian 3 หลายเดือนก่อน +1

    SHOCKING!😋

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +1

      Yes

    • @TheGalacticIndian
      @TheGalacticIndian 3 หลายเดือนก่อน +1

      @@PromptEngineer48I love the fact that you included the 'shocks entire industry' meme creators are using right now👍

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน

      Yes, its trending now !

  • @boonkiathan
    @boonkiathan 3 หลายเดือนก่อน +2

    1-bit ternary

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน

      Let's Go o o !!

    • @user-wr2cd1wy3b
      @user-wr2cd1wy3b 3 หลายเดือนก่อน

      or 1.58 bits binary

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +1

      @@user-wr2cd1wy3b It's 1.58 bits actually !

    • @user-wr2cd1wy3b
      @user-wr2cd1wy3b 3 หลายเดือนก่อน

      @@PromptEngineer48 how does {-1, 0, 1} = 1.58? 0, 01, 11 would be two bits, can you explain how it works?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +2

      Yes, that comes from entropy equation. Take the name of the paper and ask chat gpt. how 1.58 value is derived.

  • @marcfruchtman9473
    @marcfruchtman9473 3 หลายเดือนก่อน +1

    Thank you for the video: Note this paper only uses their 1 bit methods to a certain part of the LLM process, it must still use floating point for gradients... "BitNet employs
    low-precision binary weights and quantized activations, while maintaining high precision for the optimizer states and gradients during training."
    Nevertheless, the results are very good.
    (Be careful, overusing the "Shocked" in the Title thing is causing a backlash amongst your regular viewers and may result in... oh crap, the subscribed button un-subbed... WTH, this is shocking.) Ok ok, I was slightly exaggerating,... but please... you are like the 4th one in a row using the same technique... its getting to the point where if I see "shocked | Stunned | Awed | Do this" I am seriously thinking NOT to look. Yea yea, it helps the algorithm... and simultaneously, irritates your core viewers.

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน

      Good point ! I will keep that in mind.

  • @hanslick3375
    @hanslick3375 3 หลายเดือนก่อน +1

    They are calling it BitNet? Why not call it Skynet right away?

    • @PromptEngineer48
      @PromptEngineer48  3 หลายเดือนก่อน +1

      Nice one

    • @hanslick3375
      @hanslick3375 3 หลายเดือนก่อน +1

      @@PromptEngineer48 thx, let's hope the similarities are coincidental 😬

  • @chrismachabee3128
    @chrismachabee3128 3 หลายเดือนก่อน +1

    As skeptic, I am listening and I really find difficult to consume. -1,0, 1 is all that is needed to best the best of LLMs now? Really. today March 2nd, and there is zero news anywhere on this humangous breaktrough. OK, it's Sunday, but I would think there would have been notice that this has even been in experimental trials lrsding to these wondrous results. So, unfortunately my friends, it's going to take a lot more than a paper to convince me. When it is duplicated and proved I will stand amazed with everyone else, but right now this sounds like dreaming out loud. We will see this week if this the great breakthrough is deserving of notice or if it's some Data Scientists breakthrough in the lab. We will see.

  • @savire.ergheiz
    @savire.ergheiz 3 หลายเดือนก่อน +1

    Meh the smallest 0.5B pretty much useless can't even do simple math