What a coincidence! I started working on my MSc thesis at EPFL this week, and was going through your papers on BPE and tokenisation formalisation. This video was suggested on my feed. Only now noticed you were one of the authors! 😮 Thank you so much for your contributions to the field!
Great video, thanks. It does seem like it is hard to consider tokenization independent of embedding - at least when using tokenization for transformers.
Indeed, though we didn't have enough paper space nor manpower to look into this. I suppose a good tokenization would lead to most spaced-out embeddings of individual subwords? Or maybe some other measure. Have you worked on tokenization yourself?
For MT, which is representative for multilingual NLG tasks (to some extent), Rényi entropy seems to be a good predictor. However it does not take into account the nuances of the particular task. For example for LLMs that have to solve math word problems, there needs to be a special consideration of how numbers are tokenized. If we'd just throw it all as a text to BPE, then we'd get 10000 as a single token, but 10004 as 10 @@00 @@4. Some LMs solve this by enforcing single-digit tokenization of numbers. The bottom-line (as you insinuate) is that the goodness of the tokenization indeed diverges between tasks.
What a coincidence! I started working on my MSc thesis at EPFL this week, and was going through your papers on BPE and tokenisation formalisation. This video was suggested on my feed. Only now noticed you were one of the authors! 😮 Thank you so much for your contributions to the field!
This was very enlightening, thank you!
useful talk, thanks
Great video, thanks. It does seem like it is hard to consider tokenization independent of embedding - at least when using tokenization for transformers.
Indeed, though we didn't have enough paper space nor manpower to look into this. I suppose a good tokenization would lead to most spaced-out embeddings of individual subwords? Or maybe some other measure.
Have you worked on tokenization yourself?
@@zouharvi No, Just watching and using it. I have been trying to understand how tokenization impacts RAG - which seems tightly linked to emedding.
the volume could have been higher.
Feedback for next video.
Hmmm what if there's no "best" way to tokenize, but there are only case-dependent optimal ones?
For MT, which is representative for multilingual NLG tasks (to some extent), Rényi entropy seems to be a good predictor. However it does not take into account the nuances of the particular task. For example for LLMs that have to solve math word problems, there needs to be a special consideration of how numbers are tokenized. If we'd just throw it all as a text to BPE, then we'd get 10000 as a single token, but 10004 as 10 @@00 @@4. Some LMs solve this by enforcing single-digit tokenization of numbers. The bottom-line (as you insinuate) is that the goodness of the tokenization indeed diverges between tasks.