Phonetic Word Embeddings and Tasks They Facilitate [LREC-COLING 2024]

What are Digital Signatures? - Computerphile

Solving Wordle using information theory

Useful phone hack for parents! 📱🩷

🔴Live โหนกระแส ติดกับดัก...รักบอสตัวร้าย #4 "ตอนตามหาหมอและคนเก็บขยะ"

แฉอีก! "บอสกันต์" เพจดังขุดวีรกรรมในอดีต | 17 ต.ค. 67 | ข่าวใหญ่ช่อง8

Pride and BPE: How We Solved Tokenization but Got It Wrong

Vilém Zouhar

มุมมอง 906

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 17 ต.ค. 2024

ความคิดเห็น • 9

@anaindreias9835 14 วันที่ผ่านมา
What a coincidence! I started working on my MSc thesis at EPFL this week, and was going through your papers on BPE and tokenisation formalisation. This video was suggested on my feed. Only now noticed you were one of the authors! 😮 Thank you so much for your contributions to the field!
@mailailuan 5 หลายเดือนก่อน ⁺¹
This was very enlightening, thank you!
@argh44z 5 หลายเดือนก่อน ⁺¹
useful talk, thanks
@GeoffLadwig 5 หลายเดือนก่อน ⁺¹
Great video, thanks. It does seem like it is hard to consider tokenization independent of embedding - at least when using tokenization for transformers.
@zouharvi 5 หลายเดือนก่อน
Indeed, though we didn't have enough paper space nor manpower to look into this. I suppose a good tokenization would lead to most spaced-out embeddings of individual subwords? Or maybe some other measure.
Have you worked on tokenization yourself?
@GeoffLadwig 5 หลายเดือนก่อน ⁺¹
@@zouharvi No, Just watching and using it. I have been trying to understand how tokenization impacts RAG - which seems tightly linked to emedding.
@hunterkudo9832 5 หลายเดือนก่อน ⁺²
the volume could have been higher.
Feedback for next video.
@GeoffryGifari 5 หลายเดือนก่อน ⁺¹
Hmmm what if there's no "best" way to tokenize, but there are only case-dependent optimal ones?
@zouharvi 5 หลายเดือนก่อน ⁺²
For MT, which is representative for multilingual NLG tasks (to some extent), Rényi entropy seems to be a good predictor. However it does not take into account the nuances of the particular task. For example for LLMs that have to solve math word problems, there needs to be a special consideration of how numbers are tokenized. If we'd just throw it all as a text to BPE, then we'd get 10000 as a single token, but 10004 as 10 @@00 @@4. Some LMs solve this by enforcing single-digit tokenization of numbers. The bottom-line (as you insinuate) is that the goodness of the tokenization indeed diverges between tasks.

ต่อไป

เล่นอัตโนมัติ

Phonetic Word Embeddings and Tasks They Facilitate [LREC-COLING 2024]

Phonetic Word Embeddings and Tasks They Facilitate [LREC-COLING 2024]

What are Digital Signatures? - Computerphile

What are Digital Signatures? - Computerphile

Solving Wordle using information theory

Solving Wordle using information theory

Useful phone hack for parents! 📱🩷

Useful phone hack for parents! 📱🩷

🔴Live โหนกระแส ติดกับดัก...รักบอสตัวร้าย #4 "ตอนตามหาหมอและคนเก็บขยะ"

🔴Live โหนกระแส ติดกับดัก...รักบอสตัวร้าย #4 "ตอนตามหาหมอและคนเก็บขยะ"

แฉอีก! "บอสกันต์" เพจดังขุดวีรกรรมในอดีต | 17 ต.ค. 67 | ข่าวใหญ่ช่อง8

แฉอีก! "บอสกันต์" เพจดังขุดวีรกรรมในอดีต | 17 ต.ค. 67 | ข่าวใหญ่ช่อง8

[UNCUT] "บอสณวัฒน์" แฉ! แหล่งซุกเงินแก๊งบอส ขยันผิดที่ไม่กี่ปีก็เข้าคุก l คนดังนั่งเคลียร์ l17ต.ค.67

[UNCUT] "บอสณวัฒน์" แฉ! แหล่งซุกเงินแก๊งบอส ขยันผิดที่ไม่กี่ปีก็เข้าคุก l คนดังนั่งเคลียร์ l17ต.ค.67

The Most Important (and Surprising) Result from Information Theory

The Most Important (and Surprising) Result from Information Theory

1 5 Byte Pair Encoding

1 5 Byte Pair Encoding

When Optimisations Work, But for the Wrong Reasons

When Optimisations Work, But for the Wrong Reasons

How large language models work, a visual intro to transformers | Chapter 5, Deep Learning

How large language models work, a visual intro to transformers | Chapter 5, Deep Learning

Subword Tokenization: Byte Pair Encoding

Subword Tokenization: Byte Pair Encoding

WHY IS THE HEAP SO SLOW?

WHY IS THE HEAP SO SLOW?

How AI 'Understands' Images (CLIP) - Computerphile

How AI 'Understands' Images (CLIP) - Computerphile

The Oldest Unsolved Problem in Math

The Oldest Unsolved Problem in Math

Simple Explanation of AutoEncoders

Simple Explanation of AutoEncoders

핑크 버블티로 체감되는 요즘 물가

핑크 버블티로 체감되는 요즘 물가

ITZY "GOLD" M/V

ITZY "GOLD" M/V

บริวารเป็นพิษ! จุดจบ "พยัคฆ์บูรพา" | DAILYNEWSTODAY 17/10/67

บริวารเป็นพิษ! จุดจบ "พยัคฆ์บูรพา" | DAILYNEWSTODAY 17/10/67

[Gegagedigedagedago] NEW Help Gegagedigedagedago Nugget escape from Nikocado Avocado Challenge

[Gegagedigedagedago] NEW Help Gegagedigedagedago Nugget escape from Nikocado Avocado Challenge

Fake watermelon by Secret Vlog

Fake watermelon by Secret Vlog

#วิเคราะห์ แทคติก ไทยแชมป์ เดอะตุ๊ก เปิดสูตร ทีมชาติไทย 10 เต็ม 10 ท็อปฟอร์มสุด !!

#วิเคราะห์ แทคติก ไทยแชมป์ เดอะตุ๊ก เปิดสูตร ทีมชาติไทย 10 เต็ม 10 ท็อปฟอร์มสุด !!

ไฮไลท์ฟุตบอล คิงส์ คัพ 2024 | ทีมชาติไทย พบ ทีมชาติซีเรีย

ไฮไลท์ฟุตบอล คิงส์ คัพ 2024 | ทีมชาติไทย พบ ทีมชาติซีเรีย