Pride and BPE: How We Solved Tokenization but Got It Wrong

แชร์
ฝัง
  • เผยแพร่เมื่อ 17 ต.ค. 2024

ความคิดเห็น • 9

  • @anaindreias9835
    @anaindreias9835 14 วันที่ผ่านมา

    What a coincidence! I started working on my MSc thesis at EPFL this week, and was going through your papers on BPE and tokenisation formalisation. This video was suggested on my feed. Only now noticed you were one of the authors! 😮 Thank you so much for your contributions to the field!

  • @mailailuan
    @mailailuan 5 หลายเดือนก่อน +1

    This was very enlightening, thank you!

  • @argh44z
    @argh44z 5 หลายเดือนก่อน +1

    useful talk, thanks

  • @GeoffLadwig
    @GeoffLadwig 5 หลายเดือนก่อน +1

    Great video, thanks. It does seem like it is hard to consider tokenization independent of embedding - at least when using tokenization for transformers.

    • @zouharvi
      @zouharvi  5 หลายเดือนก่อน

      Indeed, though we didn't have enough paper space nor manpower to look into this. I suppose a good tokenization would lead to most spaced-out embeddings of individual subwords? Or maybe some other measure.
      Have you worked on tokenization yourself?

    • @GeoffLadwig
      @GeoffLadwig 5 หลายเดือนก่อน +1

      @@zouharvi No, Just watching and using it. I have been trying to understand how tokenization impacts RAG - which seems tightly linked to emedding.

  • @hunterkudo9832
    @hunterkudo9832 5 หลายเดือนก่อน +2

    the volume could have been higher.
    Feedback for next video.

  • @GeoffryGifari
    @GeoffryGifari 5 หลายเดือนก่อน +1

    Hmmm what if there's no "best" way to tokenize, but there are only case-dependent optimal ones?

    • @zouharvi
      @zouharvi  5 หลายเดือนก่อน +2

      For MT, which is representative for multilingual NLG tasks (to some extent), Rényi entropy seems to be a good predictor. However it does not take into account the nuances of the particular task. For example for LLMs that have to solve math word problems, there needs to be a special consideration of how numbers are tokenized. If we'd just throw it all as a text to BPE, then we'd get 10000 as a single token, but 10004 as 10 @@00 @@4. Some LMs solve this by enforcing single-digit tokenization of numbers. The bottom-line (as you insinuate) is that the goodness of the tokenization indeed diverges between tasks.