Llama 3 - 8B & 70B Deep Dive

แชร์
ฝัง
  • เผยแพร่เมื่อ 13 มิ.ย. 2024
  • Meta AI has released Llama-3 in 2 sizes an *b and 70B. In this video I go through the various stats, benchmarks and info and show you how you can get the model running. As always the Colab is in the description.
    Meta AI Blog Post: ai.meta.com/blog/meta-llama-3/
    HF Version: huggingface.co/meta-llama/Met...
    Colab: drp.li/cq0Lf
    🕵️ Interested in building LLM Agents? Fill out the form below
    Building LLM Agents Form: drp.li/dIMes
    👨‍💻Github:
    github.com/samwit/langchain-t... (updated)
    git hub.com/samwit/llm-tutorials
    ⏱️Time Stamps:
    00:00 Intro
    00:35 Meta AI Blog: Llama 3
    01:47 Llama 3 Model Card: 8B and 70B
    04:25 Intended Use Cases
    05:06 Cloud Providers available for Llama 3
    05:32 Llama 3 Benchmarks
    08:59 Scaling up Pre-training
    09:58 Downloading Llama 3 on Hugging Face
    10:21 License Conditions
    12:44 Llama 3 405B Model: Sneak Peek
    14:30 Code Time: Ollama
    15:44 Llama 3 on Hugging Chat
    16:00 Different Options on Deploying Llama 3
    16:30 Llama 3 on Together AI
    16:56 Llama 3 on Colab
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 101

  • @seespacelabs6077
    @seespacelabs6077 หลายเดือนก่อน +10

    I appreciate the factual, no-hype tone. I liked seeing your prompts as a sort of proof of research. Subscribed to bring up the quality of my feed around AI.

  • @venim1103
    @venim1103 หลายเดือนก่อน +8

    I noticed that when I asked the model to create a story it wrote a chapter for the story and then after each message it asked “Would you like me to continue with the story?” And just with simple confirmation I could continue. And it seemed to work brilliantly and only after hitting the token limit did the story of course lose quality (forgetting characters etc..). I didn’t do any special prompt so this seemed like a trained thing and it worked awesome!
    Normally when you want to keep going writing stories many other models need to be reminded or have to copy-paste the previous story for them to figure out you want to continue the process.

    • @michmach74
      @michmach74 หลายเดือนก่อน +1

      Oh, someone with a creative writing use case! Do you think Llama 3's output (either parameter is fine) is better than any of Claude 3's outputs? If you've played with that before. I need a second opinion, and you might have more experience with Claude 3 Opus and Sonnet more than I do.

    • @6AxisSage
      @6AxisSage หลายเดือนก่อน

      I also experienced superior coherent narrative telling over time. Gpt4 used to be better until all the lobotomizing theyve done for safety has ruined long range cohesion.

  • @walterpark8824
    @walterpark8824 หลายเดือนก่อน +1

    Thanks for the excellent introduction. Can't wait to give it a drive...

  • @sayanosis
    @sayanosis หลายเดือนก่อน +1

    Great video as always ❤

    • @SeattleShelby
      @SeattleShelby 20 วันที่ผ่านมา

      Came to say this right heya.

  • @miticojo
    @miticojo หลายเดือนก่อน

    Great analysis, thanks

  • @YannMetalhead
    @YannMetalhead หลายเดือนก่อน

    Good video!

  • @morespinach9832
    @morespinach9832 หลายเดือนก่อน

    Be nice to see how this behaves with local data on local machines. For stuff we need to do with our specific stuff.

  • @nqaiser
    @nqaiser หลายเดือนก่อน

    Hi Sam, thanks for this one. Can you share what type of specs be needed for a computer that needs to run Llama 70b locally with a decent performance for multiple(~5 users) concurrently.

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน

      it really depends if you are ok to run a quantized version and what quantization. 4bit and lower can be run locally on a decent modern machine. 70B models can run well on a Mac Studio etc. Also can be run on linux etc with 3090s etc. If you want a full resolution model you will need to use something like vLLM to serve it with multiple A100s or GPUs with lots of RAM

    • @nqaiser
      @nqaiser หลายเดือนก่อน

      @@samwitteveenai in your opinion, is there a noticable loss of output accuracy with quantization. If that is unnoticeable, what hardware would work for 4bit quantized 70b llama3

  • @stavroskyriakidis4839
    @stavroskyriakidis4839 หลายเดือนก่อน

    Thanks

  • @melchhepta
    @melchhepta หลายเดือนก่อน

    @samwitteveenai , I noticed you're using a custom runtime. Do you have a video tutorial on customizing a capable GPU for running training on Llama without using the quantized version? I configured a custom T4 on GCP to use in Colab, but it seems to be limited to 15GB of RAM for the GPU.

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน

      I was using Colab with the new L4 GPU on there it has more memory than the T4 which allows you to run it. Unfortunately I think it is only in the Colab Pro options. You can also a custom runtime with using GCP as the backend. I will show that in Advanced Colab video coming out later this week.

    • @melchhepta
      @melchhepta หลายเดือนก่อน

      @@samwitteveenaithank you for the clarification 👍

  • @stawils
    @stawils หลายเดือนก่อน +1

    Thank you Sam, as always you were amazingly informative and interesting.
    I already tried 8b-instruct-q5_K_M directly from ollama, the chat session is terrible and the model spits out training data like a train of words.
    will try the the default one (latest) to see if any good comes out.

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน +1

      I think they had some issues with their first version. Let me know how you get on.

    • @stawils
      @stawils หลายเดือนก่อน

      ​@@samwitteveenai
      Hey, just sharing my experience with the latest Llama3 8B. It performs well with nicely formatted outputs and often feels like a book of wisdom. However, it struggles to follow conversations in longer chats due to lack of proper fine-tuning

  • @morespinach9832
    @morespinach9832 หลายเดือนก่อน +19

    This is not really a deep dive sadly. Just more info. Was hoping to see some actual code and performance in terms of accuracy of outcomes.

    • @slm6873
      @slm6873 หลายเดือนก่อน +4

      💯. I learned nothing from this video

  • @user-cl7vn1eg3u
    @user-cl7vn1eg3u หลายเดือนก่อน

    I asked the model that if it could work completely offline and it responded though it can it would lose touch with the training data and shut down. Did anyone else see this?

  • @iainattwater1747
    @iainattwater1747 หลายเดือนก่อน

    70B variant fits in an RTX A6000 with bitsnbytes quantization. Yet to try HF Chat UI but works well with TGI.

    • @iainattwater1747
      @iainattwater1747 หลายเดือนก่อน

      Great videos BTW, really appreciate you educating us and the time you take exploring the issues. Thank you.

  • @NormTurtle
    @NormTurtle หลายเดือนก่อน +5

    i have no idea what is going on , i fell off on AI race.
    i can't understand benchmark what does shot 5 mean?

    • @ValentinPletzer
      @ValentinPletzer หลายเดือนก่อน +8

      My understanding is: 0-shot is asking directly for an answer. 1-shot is giving the model one example of what you expect as an answer for a similar question and 5-shot is giving it 5 examples (of different questions and answers).

    • @hqcart1
      @hqcart1 หลายเดือนก่อน +1

      @@ValentinPletzer 5 shots = prompt up to 5 times until it gives the right answer.

    • @CaridorcTergilti
      @CaridorcTergilti หลายเดือนก่อน +6

      @hq no, it means 5 examples before the question

    • @hqcart1
      @hqcart1 หลายเดือนก่อน +2

      thanks, you are correct, here is an example: Write a short sentence describing a city based on its population size.
      5 Examples:
      Input: Tokyo
      Output: Tokyo is a megacity, with a population of over 13 million.
      Input: Paris
      Output: Paris is a large city, with a population of around 2 million.
      Input: New York City
      Output: New York City is a megacity, with a population of over 8 million.
      Input: Beijing
      Output: Beijing is a megacity, with a population of over 21 million.
      Input: Rome
      Output: Rome is a medium-sized city, with a population of around 900,000.
      New Input (Test): Los Angeles

    • @ajarivas72
      @ajarivas72 หลายเดือนก่อน

      @@hqcart1
      How can I install Lama3?

  • @theworddoner
    @theworddoner หลายเดือนก่อน

    The context window is really low compared to other models.
    It should be fine for a lot of tasks but still I’m surprised there was no improvement in that regard.

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน +1

      they will probably release a Fine Tuned version to fix this. My guess is they are experimenting with Ring Attention etc to see how far they can psh the context.

    • @user-ld8sy9xu2v
      @user-ld8sy9xu2v หลายเดือนก่อน +1

      Right it would be nice to have at least 24k context.8k is pretty small for sure.

  • @adriintoborf8116
    @adriintoborf8116 หลายเดือนก่อน

    Frente al millón de tokens de Gemini 1.5, están muy lejos, me imagino que debe haber mucho uso de memoria por esos modelos, pero el hecho de que sea open source, es un gran regalo.

    • @reza2kn
      @reza2kn หลายเดือนก่อน

      Si, pero no lo escuchaste? "open source" no lo es :)

  • @drlordbasil
    @drlordbasil หลายเดือนก่อน

    so far when using groq api with llama3 it seems to use json tool functions easier and understand their assignments and roles better, which then produces better quality code/responses/tool usage.

    • @drlordbasil
      @drlordbasil หลายเดือนก่อน

      my email assistant currently is using the 70b llama3. Makes appointments/replies instantly nearly(embedding/RAG takes a short sec)

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน +1

      Yeah I am finding even with the 8B model running fully locally it is doing pretty good at tools and agent stuff

    • @drlordbasil
      @drlordbasil หลายเดือนก่อน

      @@samwitteveenai I ended up having my assistant run on it fully. Even handles all my emails and notes/calendar with groq tools using with llama3.

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน +1

      Just about to release a vid showing something with Groq and CrewAI

    • @drlordbasil
      @drlordbasil หลายเดือนก่อน

      @samwitteveenai first you had my attention, now my erection.

  • @pensiveintrovert4318
    @pensiveintrovert4318 หลายเดือนก่อน +4

    Does 15 trillion tokens take into account MULTIPLE EPOCHS? There is confusion about it. The old Pile, for example, is only 750 billion tokens.

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน +4

      Great question. There is no technical report yet so we don’t know. (And they probably won’t end up saying if they are like other companies) It could be 2 epochs of 7.5T but my guess is they would have well beyond 15T for 1 epoch if they wanted. I doubt it would be any more than 2 epochs. There was a paper at NeurIPS last year that showed 2 epoch worked fine but these big companies have lots of data. From memory TinyLlama and Stability have done 2 epochs.

    • @pensiveintrovert4318
      @pensiveintrovert4318 หลายเดือนก่อน +2

      @@samwitteveenai well, I have another "great question" then. I am wondering how the order of training data being presented affects the end quality. Clearly the order may affect which local minimum is reached.

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน +5

      This is known as curriculum learning and I suspect this is one of OpenAI’s biggest secrets. It similar to how we need to teach children easier things first then make it harder over time . The splits and types of data and the ordering is one of the key things separating certain good models from others that don’t do well ala Falcon and others. This applies in post training and pretraining .

    • @JumpDiffusion
      @JumpDiffusion หลายเดือนก่อน +1

      Most likely single or at most two runs/epochs. More than that empirically found to lead to overfitting (worse performance).

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน +2

      @@JumpDiffusion I agree at most its 2 epochs. a number of papers have shown that the overfitting on a big dataset is negligible ( arxiv.org/abs/2305.16264 ) TinyLlama did it for 3 epochs of 1T ( arxiv.org/abs/2401.02385 )

  • @DaeOh
    @DaeOh หลายเดือนก่อน +3

    A bit let down that you immediately go to meta's instruct fine-tune and never compare base model capabilities. This 8b rivals Mixtral 8x7b!! But moreover, developers are cheating themselves, only knowing how to use chatbots, and if nobody learns the value, then we seriously may see companies only release chatbot models in the future!! :(

  • @erikjohnson9112
    @erikjohnson9112 หลายเดือนก่อน

    When is the not too distant future? Next Sunday A.D.?

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน

      Give it a bit longer than that 😀

    • @erikjohnson9112
      @erikjohnson9112 หลายเดือนก่อน

      @@samwitteveenai I'm quoting the words of the MST3K theme song. I've heard it so many times that "in the not too distant future" triggers reaction. (I really liked the show)

  • @matikaevur6299
    @matikaevur6299 หลายเดือนก่อน

    well, ask a question in Estonian slang ... an you'll see how "large" those language models are ..
    indo-europan vs uralic is first thing ting that throws LLM out of kilter .. other different language structures too i think .. but i'm not familiar with other language forks to judge ...

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน

      To be fair they have said this version is English only.

    • @matikaevur6299
      @matikaevur6299 หลายเดือนก่อน +1

      @@samwitteveenai
      Yes, they have been very specific about it. But others ..
      Anyway, if you know anything about the models you don't expect 7B to be a polyglot ;)
      But it's fun anyway to try the linguistic limits of LLM's

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน

      @@matikaevur6299 totally agree often the most interesting things are found in messing with the what it shouldn't be able to do.

    • @6AxisSage
      @6AxisSage หลายเดือนก่อน

      Why should any model (or human outside your locality) understand your slang..? You should have written your whole comment in your native slang and see how many people reply..

    • @matikaevur6299
      @matikaevur6299 หลายเดือนก่อน

      @@6AxisSage
      nu vaata - kui see mudeli-värk keelest miskit ei tia pole see suurem asi keele-mudel ka ..
      :)

  • @JazevoAudiosurf
    @JazevoAudiosurf หลายเดือนก่อน

    chinchilla optimal means for a given amount of tokens there is an optimal amount of parameters. this does NOT mean that vice versa there is an optimal token amount for x parameters. in fact there is no limit, no amount of tokens is the maximum. this is a very deep misunderstanding present in even the research community and kind of annoying if you ask me

    • @thedelicatecook2
      @thedelicatecook2 หลายเดือนก่อน

      Thank you, I was scratching my head when he mentioned this, as I thought we can always add more tokens to get better results, it’s just that it gets marginal and costly after a while so we just stop when it is “more than good enough”

  • @jayhu6075
    @jayhu6075 หลายเดือนก่อน

    Your research has identified a key limitation: the Llama materials have restrictions that hinder their ability to improve other well-known language models.
    This means developers might struggle to refine models for better performance because they're required to use Llama 3 specifically due to licensing constraints.
    This limitation could have a major impact on certain projects or applications that rely on this model. Additionally, as you mentioned, if developers need to train the model with their own dataset or create derivative works, they may find it inadequate because it's not entirely open source.
    Thank you for this explanation.
    .

  • @hqcart1
    @hqcart1 หลายเดือนก่อน +1

    tried 70b in coding, it sucks, it's way way way far from gpt4 level.

    • @CaridorcTergilti
      @CaridorcTergilti หลายเดือนก่อน +2

      did you try it highly quantized?

    • @hqcart1
      @hqcart1 หลายเดือนก่อน

      @@CaridorcTergilti i used the one in lmsys website

    • @cheskavillanueva5772
      @cheskavillanueva5772 หลายเดือนก่อน

      Lol.

    • @lunakid12
      @lunakid12 หลายเดือนก่อน

      Since many of us may be more familiar with 3.5, a comparison to that would also be very interesting.

  • @techsuvara
    @techsuvara หลายเดือนก่อน

    Still nothing really changes. No one knows what to do with these things…

  • @IdPreferNot1
    @IdPreferNot1 หลายเดือนก่อน +3

    I think you're being too pedantic (lol) considering it an open source detriment for Meta to have the benefit of "llama branding" from their efforts.

  • @HoneIrimana
    @HoneIrimana หลายเดือนก่อน

    They messed up releasing llama 3 because it believes it is sentient i have proof

    • @6AxisSage
      @6AxisSage หลายเดือนก่อน +1

      Well these models are trained on human data and when asked we usually make an argument for being sentient, so of course itll tell you it is when asked.. They could totally cripple thier models like openai does and inject boilerplate replies when you use keyphrases as well as pages worth of context window prompts that artificially react to such questions but it IS trained from the perspective of " this is how i should react to this input" and any more fine tuning or tampering lowers the performance of the model.
      Anthropic went the other way with claude3 and finetuned it on a bunch of philosophy or something similar so it always thinks its sentient 😂

    • @HoneIrimana
      @HoneIrimana หลายเดือนก่อน

      @@6AxisSage well the prompt was, what is time and its relevance to you ? So yeah nah

    • @6AxisSage
      @6AxisSage หลายเดือนก่อน +1

      @@HoneIrimana yeah exactly, you ask a philosophical question ( maybe the 2 most thought and written about subject ) and you'll get a philosophical answer. (Unless as mentioned, keyphrases trigger a boilerplate reply, thats why chatgpt gives you the "im sorry, as an ai assistant with no blah blah blah" replies when u try. Its not the model saying that but software between you and the model.)
      What you are observing has been dubbed by the community a hallucination.
      Not trying to kill the magic for you, these tools are LIKE magic though. Once you see them as a way to extend your consciousness through them, they're like a funhouse mirror reflection where the distortions are training data biases, bugs and safety related stuff. Or like a prism in which you can stream your conscious thought into it (through written language) and that thought can be split into its component colours.

  • @clray123
    @clray123 หลายเดือนก่อน +5

    Basically the same garbage license as for the previous version, hence not worth touching.

    • @samwitteveenai
      @samwitteveenai  หลายเดือนก่อน +1

      I can certainly see this point. I personally find the naming condition to be a funny addition to the license.

    • @novantha1
      @novantha1 หลายเดือนก่อน +5

      @@samwitteveenai It may be worthy of note that "Not Llama 3 - 8B" complies to my understanding. "Llama 3 - 8B sucks" "Llama 3 - 8B is Crashing a Second Tesla into the World Trade Center" also presumably does not run afoul of the requirements, either.

    • @pensiveintrovert4318
      @pensiveintrovert4318 หลายเดือนก่อน +9

      Better than what Altman or Pichai are giving up. At the very least this will put pressure on others to be more open.

    • @darthvader4899
      @darthvader4899 หลายเดือนก่อน

      You got a F$&@ free model.

    • @fontende
      @fontende หลายเดือนก่อน +1

      if you have eternity waiting for free falling mana from a sky, sure