How the Gemma/Gemini Tokenizer Works - Gemma/Gemini vs GPT-4 vs Mistral

แชร์
ฝัง
  • เผยแพร่เมื่อ 24 ก.พ. 2024
  • in this video, we go under the hood of the gemini and gemma-7b and gemma-2b tokenizer. we look at the large vocabulary and the impact that it has on the size of the model, and how Google has put a focus on people, places, culture, languages and things over efficient vocabulary and frequent sub-words. in this video chris introduced his new tokenizer benchmark test, dataset and tokenizer visualizer tools
    github
    ---------------
    github.com/chrishayuk/tokeniz...
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 10

  • @Aberger789
    @Aberger789 4 หลายเดือนก่อน +2

    Well, it's 2am, and I can't wait to watch your other videos. I am building some RAG implementations with scientific journals from PDF, and feeling like I'm going in circles. Taking a step back and considering the bigger concepts is helping. Great format for learning, I really appreciate your time!

    • @chrishayuk
      @chrishayuk  4 หลายเดือนก่อน

      glad you're enjoying, you might wanna checkout my RAG video, and listen to my stoopid poems

  • @cybermanaudiobooks3231
    @cybermanaudiobooks3231 5 หลายเดือนก่อน +2

    Great video. Companion piece to Andrej Karpathy's most recent. Very insightful. Thanks!

    • @chrishayuk
      @chrishayuk  5 หลายเดือนก่อน

      Thank you, glad it’s useful. This one was a video I’ve been trying to get right for a while

  • @smithnigelw
    @smithnigelw 5 หลายเดือนก่อน +2

    Thanks Chris. Very interesting how they have chosen the vocabulary. For representation of programs in Python, how do they tokenise the white-space? I’m looking forward to the video on embedding.

    • @chrishayuk
      @chrishayuk  5 หลายเดือนก่อน

      it's a similar approach to llama, because not every language seperates using whitespace. i'll maybe cover that in a future video. i will update the programming languages in the dataset, i didn't have time to merge all the other versions back in (where python was covered)

  • @reza2kn
    @reza2kn 5 หลายเดือนก่อน +2

    This is wonderful! The dataset alone is super useful to have, and the video walk through was really awesome for someone who's just trying to understand what's what here :D Please keep on doing what you're doing! One thing I have been interested in is visualizing the entire vocabulary inside a tokenizer to actually see what's inside, but have it be done in a easy to explore way. tried world clouds and didn't work at all. Do you have any ideas?
    I'm also super interested in fine-tuning models to teach them another language and using agents, but not to just look at codes for 30 mins. Specific , real-world use-cases with applied examples. I think TH-cam is really lacking that at the moment.
    P.S: Cool glasses :)

    • @chrishayuk
      @chrishayuk  5 หลายเดือนก่อน +2

      thank you, glad it's useful. you might find my next video on embeddings useful for visualization (no spoilers :). As for fine-tuning. I recently downloaded a lot of english-welsh translations, and was planning to do a video on that. i was going to use llama2-7b as i know it doesn't do welsh. i might do it with Gemma but not sure if does Welsh already. Regardless i'll be doing a language fine tune video soon

  • @garyhamilton2104
    @garyhamilton2104 5 หลายเดือนก่อน +1

    Commenting cuz I know Chris will give me a heart :)

    • @chrishayuk
      @chrishayuk  5 หลายเดือนก่อน

      because i love you all