Extending Llama-3 to 1M+ Tokens - Does it Impact the Performance?

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 มิ.ย. 2024
  • In this video we will look at the 1M+ context version of the best open llm, llama-3 built by gradientai.
    🦾 Discord: / discord
    ☕ Buy me a Coffee: ko-fi.com/promptengineering
    |🔴 Patreon: / promptengineering
    💼Consulting: calendly.com/engineerprompt/c...
    📧 Business Contact: engineerprompt@gmail.com
    Become Member: tinyurl.com/y5h28s6h
    💻 Pre-configured localGPT VM: bit.ly/localGPT (use Code: PromptEngineering for 50% off).
    Signup for Advanced RAG:
    tally.so/r/3y9bb0
    LINKS:
    Model: ollama.com/library/llama3-gra...
    Ollama tutorial: • Ollama: The Easiest Wa...
    TIMESTAMPS:
    [00:00] LLAMA-3 1M+
    [00:57] Needle in Haystack test
    [02:45] How its trained?
    [03:32] Setting Up and Running Llama3 Locally
    [05:45] Responsiveness and Censorship
    [07:25] Advanced Reasoning and Information Retrieval
    All Interesting Videos:
    Everything LangChain: • LangChain
    Everything LLM: • Large Language Models
    Everything Midjourney: • MidJourney Tutorials
    AI Image Generation: • AI Image Generation Tu...
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 43

  • @engineerprompt
    @engineerprompt  หลายเดือนก่อน +5

    CORRECTION: There is a mistake in testing long context in the video [haystack test towards the end of the video (Tim Cook and Apple related question)]. If you set the context length in a session in ollama and exist it, you will have to reset the context length in the new session again. Parameters set in one session do not persist across sessions. An oversight on my end and thanks to everyone for pointing it out.

    • @antonvinny
      @antonvinny หลายเดือนก่อน +1

      So did it work correctly after setting the context length?

  • @john_blues
    @john_blues หลายเดือนก่อน +4

    The AI went from scholarly professor to unintelligible drunk quite quickly.

  • @user-cl7vn1eg3u
    @user-cl7vn1eg3u หลายเดือนก่อน +8

    I've been testing it. It has a hallucination issue when large text is put in. However the writing is good so even the hallucinations are interesting.

    • @maxieroo629
      @maxieroo629 หลายเดือนก่อน +1

      Have you tried lowering the temperature?

  • @abadiev
    @abadiev หลายเดือนก่อน

    need more information about this model. How it work with autogen or another agent system?

  • @sergeaudenaert
    @sergeaudenaert หลายเดือนก่อน +7

    thank you for the video. When you exited and reran the model, shouldnt you also reset the context window to 256K ?

    • @engineerprompt
      @engineerprompt  หลายเดือนก่อน

      that's a valid point. I thought (mistakenly) it persists for the llm but seems like you actually have to do it for each session.

  • @supercurioTube
    @supercurioTube หลายเดือนก่อน

    Thanks a lot for this showcase! That test you've done is fantastic:
    "A glass door has 'push' on it in mirror writing. Should you push or pull it?
    Please think out loud step by step."
    I've tried with several llama3 8b, down to llama3:8b-instruct-q4_1 quantization which ends quickly with a spot-on: "So, to answer the question: You should pull the glass door"
    I'm able to reproduce the infinite output you get with llama3-gradient:8b-instruct-q5_K_M.
    So there was something broken in this fine tune for larger context indeed. I was hoping to leverage larger context in an application with llama3 but that won't be the model for that I guess.

    • @supercurioTube
      @supercurioTube หลายเดือนก่อน

      And I've tried with the full fp16 from Ollama too.
      It does seem to stop consistently but answers wrong:
      "Conclusion: Even though the word is written backwards when looking from within your reflection, I should still try opening the glass doors by pushing them like they would say."

    • @engineerprompt
      @engineerprompt  หลายเดือนก่อน +1

      there are 16k and 64k versions which are finetuned. Might be interesting to look into those.

    • @supercurioTube
      @supercurioTube หลายเดือนก่อน +1

      @@engineerprompt thanks for the suggestion, I will 😌

  • @smartduck904
    @smartduck904 หลายเดือนก่อน

    So I guess this will not run on a GTX 1080 TI?

  • @henkhbit5748
    @henkhbit5748 หลายเดือนก่อน

    Thanks for the update. Anybody try to do "real" RAG using multiple documents? You cannot access it using GRoq?

    • @engineerprompt
      @engineerprompt  หลายเดือนก่อน

      you can look into localgpt :)

  • @ikjb8561
    @ikjb8561 หลายเดือนก่อน +2

    Due to regressive nature of LLMs there is an exponential chance of producing errors for every passing token. Be careful what you wish for.

  • @jeffwads
    @jeffwads หลายเดือนก่อน +5

    Yes, without multiple needle runs, the test is pretty weak.

  • @Vadinaka
    @Vadinaka หลายเดือนก่อน

    May I ask which system you are using to run this?

    • @engineerprompt
      @engineerprompt  หลายเดือนก่อน

      I am using M2 Max 96GB to run this.

  • @user-gp6ix8iz9r
    @user-gp6ix8iz9r หลายเดือนก่อน +1

    Can you do a review on AirLLM it’s lets you run a 70b model on 4gb of vram

    • @engineerprompt
      @engineerprompt  หลายเดือนก่อน

      havne't seen that before. Will explore what it is.

    • @HassanAllaham
      @HassanAllaham หลายเดือนก่อน

      does it let us run such size without GPU.. i.e. on CPU only ??

  • @hoblon
    @hoblon หลายเดือนก่อน +2

    You need to set context size in each session. Not just once. That's why the needle test failed.

    • @JoeBrigAI
      @JoeBrigAI หลายเดือนก่อน +3

      The setting isn't persistent? Major oversight in the video if this the case.

    • @hoblon
      @hoblon หลายเดือนก่อน +1

      @@JoeBrigAI They are persistent per session. Once you enter /bye that's it.

    • @engineerprompt
      @engineerprompt  หลายเดือนก่อน +1

      that is true. I thought it to be otherwise. Added a pinned comment to highlight this.

  • @unclecode
    @unclecode หลายเดือนก่อน

    Interesting, this one didn't bring a ladder to the party for the joke haha. About the model not stopping, it's probably related to RoPE (Rotary Positional Encoding). If someone messed with that, things could go forever. Anyway the quantized version definitely affects the model's behavior.

    • @engineerprompt
      @engineerprompt  หลายเดือนก่อน

      haha, that's true. Pleasantly surprised with the joke :) that's actually a good point with RoPE.

  • @GetzAI
    @GetzAI หลายเดือนก่อน

    you need to pick up an M4 Mac Studio when it comes out ;)

  • @vertigoz
    @vertigoz หลายเดือนก่อน +1

    Phi3 128k got worse against 4k when trying to analyze a program I gave to him

  • @R0cky0
    @R0cky0 หลายเดือนก่อน

    13:16 it appears the llm was suffering Schizophrenia that moment 😅

  • @8eck
    @8eck หลายเดือนก่อน

    100+ GB of vram for 4-bit quantized model? 🙄Are you sure about quantized one?

  • @acekorneya1
    @acekorneya1 หลายเดือนก่อน +1

    The issue with all these "BENCHMARKS" is that they are all lies. We need a better, real benchmark for LLM because what we get from people who make these models are all lies. They don't perform well when it comes down to doing real work, or anything in real production. They all suck compared to closed models. It's like the people who benchmark them show very cherry-picked examples.

    • @ritpop
      @ritpop หลายเดือนก่อน

      Yes, some models are good to Daily use and in some case better than gpt 3.5 of chatgpt but I never used one that is close to gpt 4. And in some use cases the 3.5 still better than mistral in my own experience. So they really should put the real breachmarks

  • @farazfitness
    @farazfitness หลายเดือนก่อน

    Lmao 64gb vram I'm using rtx 4070 which only has 8gb vram

    • @engineerprompt
      @engineerprompt  หลายเดือนก่อน

      :)

    • @kecksbelit3300
      @kecksbelit3300 27 วันที่ผ่านมา

      how did you manage to pick up an 8gb 4070 even the founders edition has 12gb

    • @farazfitness
      @farazfitness 27 วันที่ผ่านมา

      @@kecksbelit3300 using acer predator hellos neo 16 laptop

  • @jamesvictor2182
    @jamesvictor2182 หลายเดือนก่อน

    why are you using ollama not llama.cpp directly?

  • @HappySlapperKid
    @HappySlapperKid หลายเดือนก่อน

    64gb vram 😂