Gemma 2 - Google's New 9B and 27B Open Weights Models

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 พ.ย. 2024

ความคิดเห็น • 44

  • @Nick_With_A_Stick
    @Nick_With_A_Stick 4 หลายเดือนก่อน +21

    27b takes up 15gb in 4bits ❤. Although llama 8b smashes both its human eval with 62.2, which google conveniently left out of the chart in the paper. But then again llama 8b’s human eval drops to 40 at 4bits, and there was a new code benchmark I think by big code bench, I saw on twitter and it showed llama 3 70b actually sucks at coding and *potentially* *allegedly* trained on human eval, probably on accident with a 8T token dataset 🤷‍♂️ Elo dosent really lie, with the exception of GPT4-o people just like it because it makes pretty outputs, like they way it formulates it’s outputs is really visually appealing (for example it uses a ton of markdown, like lines separating the title, and big font and small font at certain times). which 100% launch the scores to the moon because claude sonnet 3.5 is significantly better, provided my main use case is coding.

    • @blisphul8084
      @blisphul8084 4 หลายเดือนก่อน +2

      the IQ1_S quant is only 6GBs so it can fit on an 8GB GPU like the RTX 3060ti. No need for H100s here. Though based on humaneval, I'll stick with dolphin Qwen2 for now.

    • @Nick_With_A_Stick
      @Nick_With_A_Stick 4 หลายเดือนก่อน

      @@blisphul8084 for coding I like Codestral with continue dev a vs code extension, but then again ever model sucks in comparison to sonnet 3.5’s 1 shot code ability. And for some reason it actually kinda looses performance at multi shot, if you are in a convo with it editing long code it occasionally messes up, but it rarely ever makes an error if you start a new convo and re ask the question with the code.
      Side note, I wish Eric had done his fine-tuning on top of the Qwen instruct model using lora. It would help combine the strengths of both the datasets.

    • @BrianDalton-w1p
      @BrianDalton-w1p 3 หลายเดือนก่อน

      Claude Sonnet 3.5 still makes frequent coding mistakes though

    • @Nick_With_A_Stick
      @Nick_With_A_Stick 3 หลายเดือนก่อน

      @@BrianDalton-w1p absolutely, but its 1 shot coding is 3x better than gpt4-o, for example if you asked it to write you a standard SFT training script for llama 7b, claude 3.5 will get it done in maybe 2-3 tries, gpt4-o literally fails after 10 attempts, and I can’t keep going since I’m not paying for some overfit model that can’t fix its own bugs. I heard from a few people gpt-4-o is a 40b parameter model, which would make sence, but I still believe its a continued pretrain of standard gpt-4 with multi modal tokens like used in meta chameleon model. Lets say open AI continued pretraining gpt-4 on only code, it would smash every model by a TON, but they it would be stupid in every other region. This Is why deepseek coder v2 is so good.

  • @supercurioTube
    @supercurioTube 4 หลายเดือนก่อน +10

    All quantizations are available for Gemma2 9b and 27b in Ollama, but the 27b has an issue, with a tendency to never stop its output.

    • @falmanna
      @falmanna 4 หลายเดือนก่อน

      The same happened to me with 9B 4bit ksm

    • @volkovolko
      @volkovolko 4 หลายเดือนก่อน +1

      Got the issue on my video on my channel

  • @jbsiddall
    @jbsiddall 4 หลายเดือนก่อน

    great video sam! this is my first look at gemm 2.

  • @unclecode
    @unclecode 4 หลายเดือนก่อน +6

    Thanks, Sam! You know, it started with 7b as a trend, then Meta made it 8b, and now Google has 9b. I wish they'd compete in the opposite direction. 😄 Btw, I have an opinion. Let me share it and hear yours. I’ve noticed recently proprietary models often train with a chain-of-thought style, to the level became annoying because it’s hard to get the model to do otherwise. This approach ensures the model crosses benchmarks but gives it one personality that's hard to change.
    For instance, GPT-4o became a headache for me! It always follows one pattern, regenerating entire previous answers, even if the answer is just one word! It's annoying, especially in coding. Imagine you want a small change, but it regenerates the whole code. constantly have to remind it not to regenerate the whole code, just show a part of it, and it's frustrating. This is clearly due to the training data. I don’t see this issue with open-source models. One proprietary model I like, Anthropic, still feels controllable. I can shape its responses and keep it consistent.
    To me, this technique hides model weaknesses. It’s easier to train a model to stick to one style, especially if the data is syntactically generated. Language models need a word distribution that's not overly adjusted, or they become biased. When they release a model as an instruct model with one level of fine-tuning, you still expect it to be unbiased. Fine-tuning it to take on another behavior would be tough.

    • @longboardfella5306
      @longboardfella5306 4 หลายเดือนก่อน

      Interesting. I’ve noticed the same thing with getting stuck - when it kept producing an incorrect word table I couldn’t get it to stop repeating that each time.

    • @samwitteveenai
      @samwitteveenai  4 หลายเดือนก่อน

      Definitely post training ( SFT, IT, RLHF,RLAIF etc ) has changed a lot in the last 9 months. All the big proprietary models and big company open weights are now using synthetic data heavily. A big challenge with synthetic data is creating the right amount of diversity. This could explain some of what you are seeing. Also you might be seeing models that have been overly aligned with reward models etc. Anthropic has “Ant thinking “ for their version of CoT and it is wrapped in xml tags. I think a lot of that gets filtered on their UI etc. The Gemma models clearly show Google has gone down the path of baking CoT into the models. For following System prompts well I think Llama is much better. I test the model by asking them to be a drunk assistant. For some reason Llama can do that very well.

  • @olimiemma
    @olimiemma 4 หลายเดือนก่อน +4

    Am so glad I found your channel by the way.

  • @SwapperTheFirst
    @SwapperTheFirst 4 หลายเดือนก่อน +1

    fantastic news and great overview, Sam.

  • @toadlguy
    @toadlguy 4 หลายเดือนก่อน +3

    Running Genma2 9B with Ollama on an 8GB M1 Mac, even though it is only 5.5 GBs (for the 4-bit quantized model) it immediately starts running into swap problems and outputs at about 1 token/sec. The llama3 8B (which is 4.7GBs for 4-bit quantized model) runs fine entirely in working memory even with lots of other processes running. So there must be something different about how the inference code is running (or Ollama's implementation)

    • @NoSubsWithContent
      @NoSubsWithContent 4 หลายเดือนก่อน

      are you sure that 8GB isn't being partially used to run the system at all? it could also just be that the hardware is too old, I had 16GB and still couldn't run qwen 2 0.5B

    • @samwitteveenai
      @samwitteveenai  4 หลายเดือนก่อน +1

      Apparently they had issues with it and are fixing them. For me I just made vid in the last few hours and it seemed fine. Maybe try to uninstall and try again.

  • @SirajFlorida
    @SirajFlorida 4 หลายเดือนก่อน +2

    Well if Gemma 2 is just barely beating llama3 8b and it has an additional billion parameters than I would leap to say that llama is the higher quallity model. Not to mention outstanding support for llama. I get the commercial license, but I kind of see the llama license as not allowing big tech to just plagiarize models. If only we had llama 30ish B. Oh dear zuck. If you can ever hear these words. Please give us the 30B. We love and appreciate all that you do!!!

    • @SirajFlorida
      @SirajFlorida 4 หลายเดือนก่อน

      Actually, I think he did... and it's multimodal model called chameleon. :-D

    • @onlyms4693
      @onlyms4693 4 หลายเดือนก่อน +1

      So can we use llama 3 for our chat bot support freely for our enterprise?

  • @TreeYogaSchool
    @TreeYogaSchool 4 หลายเดือนก่อน

    Thank you for making this video!

  • @jondo7680
    @jondo7680 4 หลายเดือนก่อน +1

    The 9b one is very interesting and promising.

  • @MeinDeutschkurs
    @MeinDeutschkurs 4 หลายเดือนก่อน +3

    dolphin, dolphin, dolphin!!!! ❤❤❤❤ I hope it’s being read!!!! 🎉🎉🎉🎉gemma2 seems to be a cool model for dolphin. Have I already mentioned it? Dolphin. Just in case! 😆😆🤩🤩

    • @Outplayedqt
      @Outplayedqt 4 หลายเดือนก่อน

      Dolphin-3.0-Gemma2 🙏🏼

    • @MeinDeutschkurs
      @MeinDeutschkurs 4 หลายเดือนก่อน

      @@Outplayedqt 🥰

  • @AdrienSales
    @AdrienSales 4 หลายเดือนก่อน +1

    I gave a try to gemma:9b vs llama3:7b on function calling... and I got much better results with llama3. Did you give a try to function calling ?... maybe will there be a specifi tuning for fucntion calling.

    • @samwitteveenai
      @samwitteveenai  4 หลายเดือนก่อน +1

      AFAIK Google doesn’t do any FT for function calling on the open weights models. I have been told it could be due to legal issues, doesn’t make a lot of sense to me. The base models can be tuned to do this though

    • @AdrienSales
      @AdrienSales 4 หลายเดือนก่อน

      @@samwitteveenai Keeping an eye on FT on ollama library within the next few days

  • @dahahaka
    @dahahaka 4 หลายเดือนก่อน +3

    Damn, feels like Gemma came out last month

  • @micbab-vg2mu
    @micbab-vg2mu 4 หลายเดือนก่อน +2

    thank you - I am waiting for Geminy 2.0 Pro - )

    • @samwitteveenai
      @samwitteveenai  4 หลายเดือนก่อน

      Give it a bit of time.

  • @strangelyproton
    @strangelyproton 4 หลายเดือนก่อน +1

    hello can you please tell me whats the best hardware to buy for running at max 70b models not just for inferencing but also for instructor tuning

    • @NoSubsWithContent
      @NoSubsWithContent 4 หลายเดือนก่อน

      with quantization I think you can get away with a single H100, 80GB. using QDoRA will achieve nearly the same performance as full finetuning while still fitting within this constraint.
      for cost effectiveness you could try multi-GPU training with older versions, this is just harder to set up and takes a lot more understanding of the specs

  • @TomM-p3o
    @TomM-p3o 4 หลายเดือนก่อน +1

    What I really wanted from Gemma is at least a 100k context window. It looks like that is not forthcoming.

    • @samwitteveenai
      @samwitteveenai  4 หลายเดือนก่อน +2

      Someone may do a fine tune to get it out to that length. Let’s see

  • @imadsaddik
    @imadsaddik 4 หลายเดือนก่อน +1

    Thanks for the video

  • @AudiovisuelleDroge
    @AudiovisuelleDroge 4 หลายเดือนก่อน

    Neither 9b or 27b Instruct supports a system prompt, what were you testing?

    • @samwitteveenai
      @samwitteveenai  4 หลายเดือนก่อน

      You can just append them together, which is what I did there. You can see it in the notebook.

    • @flat-line
      @flat-line 4 หลายเดือนก่อน

      @@samwitteveenaiwhat is system prompt support for ? If we can just do it like this ?

    • @samwitteveenai
      @samwitteveenai  4 หลายเดือนก่อน +1

      So on models that support a system prompt it normally gets fed into the model with a special token added. If the model is trained for that it can respond better for it (like the Llama models) if it doesn’t have it like Gemma prepending like I did still can work well but it is just part of the overall context.

    • @flat-line
      @flat-line 4 หลายเดือนก่อน

      @@samwitteveenai thanks this is informative, what is this special tokens? How do we learn about them

  • @romanbolgar
    @romanbolgar 2 หลายเดือนก่อน

    Когда уже начнут сравнивать не по параметрам А по возможностям

  • @sammcj2000
    @sammcj2000 4 หลายเดือนก่อน +4

    Tiny little 8k context. Pretty useless for many things.