Code LLM from Scratch Part 1 Tokenization
ฝัง
- เผยแพร่เมื่อ 3 ม.ค. 2025
- This video session demystifies the world of natural language processing (NLP) and Large Language Models (LLMs)! We'll start by exploring the building blocks of text analysis, tackling tokenization methods like vocabulary building and Byte Pair Encoding (BPE).
Next, we'll delve into the creation of training data, where we'll break down concepts like context windows, batching, sliding windows, and managing the maximum number of tokens.
This session lays the foundation for understanding how machines learn from language. Stay tuned for the next session, where we'll uncover the magic of self-attention and embeddings!
After the session, you will be able to understand the following questions:
1. The cost of GPT 3.5 usage is 0.8$ / Million Tokens Input/Output. Is it more or less? Which model is cheaper or more expensive? Understanding tokenization can help you to decide.
2. The context length of llama3 is 8K vs ChatGpt of 120K. What does it mean?
3. Every LLM supports a parameter called Max Tokens while generating text. What does it mean?
4. How can you optimize your cost while utilizing LLMs? You need to understand Tokenization
5. Indic LLM (Tamil, Telegu etc.). How do they really differ from each other? You will realize that is is all about Tokenization
Reference Notebook:
www.kaggle.com...
By Detoxio AI (detoxio.ai)
Follow us on:
Substack - detoxioai.subs...
Twitter - / detoxioai
TH-cam - / @detoxioai
Linkedin - / detoxio-ai