Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Umar Jamil

มุมมอง 41 677

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 15 พ.ย. 2024

ความคิดเห็น • 94

@TheMzbac 10 หลายเดือนก่อน ⁺⁶
Highly recommended for anyone who wants to understand open source LLM inside and out.
@sounishnath513 ปีที่แล้ว ⁺¹¹
No comments.... Need to learn many things... Thank you very much for creating such interesting and helpful content...
I am fortunate - that I found your channel.
@imbingle 4 หลายเดือนก่อน ⁺³
would love to see lighterweight llms trained on custom datasets, thanks for the video! this channel is a gold mine.
@gabchen ปีที่แล้ว ⁺⁷
Haven't watched the full video yet but thanks for the promising content. please keep it going.
Would like to see more of the environment set up and the debugging process.
@RaghavendraK458 9 หลายเดือนก่อน ⁺³
Very good video. You have a knack for conveying complex content in understandable format. Thank you and keep up the great work
@Patrick-wn6uj 7 หลายเดือนก่อน ⁺²
55:44 "I could have also written the code and not tell you and not tell you anything but I like to give proof to what i do " Wow thank you for going that extra mile we really appreciate it.
@mazenyasser8299 9 หลายเดือนก่อน ⁺²
You are a hidden gem, great explanation with theoretical and technical concepts.
@umarjamilai ปีที่แล้ว ⁺⁸
As always the PDF slides and the source code are available on GitHub: github.com/hkproj/pytorch-llama/
Prerequisites:
1) Transformer explained: th-cam.com/video/bCz4OMemCcA/w-d-xo.html
2) LLaMA explained: th-cam.com/video/Mn_9W1nCFLo/w-d-xo.html
@МихаилЮрков-т1э 9 หลายเดือนก่อน ⁺¹
Thank you for such a detailed analysis of the architecture and implementation features of the model! You are very good at presenting information!
@ravimandliya1881 ปีที่แล้ว ⁺⁵
Very excited for this!!! Weekend is going to be fun!
@pi5549 ปีที่แล้ว ⁺¹⁷
Might you consider creating a Discord guild? I'd love to hang with the people that are watching these videos!
@umarjamilai ปีที่แล้ว ⁺⁹
Hi! I am considering it, will let you know with a public post when it's online 🤖🦾
@FireFly969 7 หลายเดือนก่อน
Yep, such great people
@Umar-Ateeq 5 หลายเดือนก่อน
Great idea man!!
@dongdongqiaqia ปีที่แล้ว ⁺⁴
Marked for my next watch. Thanks for producing high quality video for the series. Hope you have fun in China.
@marshallmcluhan33 ปีที่แล้ว ⁺⁵
Thanks for explaining all of these concepts. Keep up the good work 😎
@sharjeel_mazhar 5 หลายเดือนก่อน
Umar bhai, your tutorials on transformer architectures and open-source LLMs are truly remarkable. As a Pakistani, seeing your expertise in deep learning is incredibly inspiring. Have you ever considered creating Urdu versions of your content? It could make your valuable knowledge more accessible to a wider audience. Your contributions are invaluable to the global tech community. Keep up the fantastic work! Huge fan of your work. May ALLAH bless you with health and success!
@azain47 3 หลายเดือนก่อน
He's Italian, I doubt he knows urdu
@sharjeel_mazhar 3 หลายเดือนก่อน
@@azain47 oh, i thought he was Pakistani, but nvm, it's really good to see a Muslim working and sharing his knowledge and expertise to the rest of the world, generally we don't see Muslim people making great content regarding to CS
@azain47 3 หลายเดือนก่อน
@@sharjeel_mazhar be the change you wish to see, brother.
@justcars2454 7 หลายเดือนก่อน ⁺²
its an honor to me, to be in those 23500 viewers who watched this video, thank you so much umar jamil for your content
@renanangelodossantos4726 7 หลายเดือนก่อน
EXCELENT! I would like to see the se series with Llava.
@kaushilkundalia2197 หลายเดือนก่อน ⁺¹
This video is a gold mine!!!
@saima6759 5 หลายเดือนก่อน ⁺¹
this is hardcore machine learning engineering!
@yonistoller1 ปีที่แล้ว ⁺³
Thank you so much for sharing this, it was really well done!
@박준서-q9v 22 วันที่ผ่านมา
Thank you for informative video. I have a question for 34:19. Where is -2 in the theta calculation?
@马国鑫 3 หลายเดือนก่อน ⁺¹
Thanks! I learned a lot from your excellent video.
@n.8642 4 หลายเดือนก่อน ⁺¹
Thanks! I learned a lot from your excellent video.
@riyajatar6859 หลายเดือนก่อน
Great video . would you do some videos one increasing context length of models.
With any bert based model or decoder model.🎉
@zz79ya 6 หลายเดือนก่อน
Thanks for your lecture, and I have a question. What happens if the start_pos is longer than the query cache size? If this code is not dealing with such a situation, which kind of additional modification do we need?
@ehsanzain5999 ปีที่แล้ว
Thank you Umar very much for the efforts here. One question, is there any PPO and finetuning on above of this in next videos?
@abrahamowos 28 วันที่ผ่านมา
at 1:13:21 I don't understand why the `head_dimension = dimension (.ie 4096) // n_heads (32)`. isn't the whole dimension is supposed to be passed through 32 different attn blocks which is then added up to form the multihead attention?
@PaoloTshiyole 9 หลายเดือนก่อน
Great video
So what about the dataset used in this video?
@riyajatar1311 หลายเดือนก่อน
if i want to use this code for training what needs to be changed
@adatalearner8683 7 หลายเดือนก่อน
why is the context window size limited? Is it because these models are based on transformers and for a given transformer architecture, long distance semantic relationship detection will be bounded by the number of words/context length ?
@tharunbhaskar6795 8 หลายเดือนก่อน
What are the system requirements to run the inference for this model? By the way, its a great video
@tljstewart ปีที่แล้ว ⁺³
Great content as usual! Thanks
@Titu-z7u 9 หลายเดือนก่อน
Hi, I want to fine tune the model. In that case, will it be required to get rid of the k-v caching?
@IRFANSAMS 8 หลายเดือนก่อน
Can i use llama2 model open source for life time or can i code along with you and use the model
@adatalearner8683 7 หลายเดือนก่อน
Lets say a llm application has a context window of 4000 words. It also supports historical chats. So user can effectively send more than allowed words in a given prompt, and yet get answers related to the previous historical conversation? How does this work ?
@GrifinsBrother 10 หลายเดือนก่อน ⁺¹
Incredible explanation!
@mathlife5495 ปีที่แล้ว ⁺¹
A suggestion for all your videos is to increase the font size or the zoom level. They are kind of unreadable.
@umarjamilai ปีที่แล้ว ⁺¹
Thanks for your feedback! I'll keep that in mind 🤗
@modaya3382 ปีที่แล้ว ⁺²
Thank you very much for your efforts
@oiooio7879 ปีที่แล้ว ⁺³
Great video very educational
@thelonejordan 2 หลายเดือนก่อน
Shouldn't the start_pos should be current_pos - 1. (in the model foward line within the completion function) If you look at the kv caching, you are never caching keys and values for position 0.
@jensenlwt 4 หลายเดือนก่อน
Can somebody help to explain why when calculating theta, we are not including the -2, e.g., theta = theta ** (-2 * theta_numerator / head_dim)
@feixyzliu5432 9 หลายเดือนก่อน
Thank you for the wonderful lecture. I wondering why you use torch.matmul / transpose things in the video, but use torch.einsum in the slides? They are mathematically equal, but how about their efficiency, which one will be run faster?
@wilfredomartel7781 ปีที่แล้ว
Amazing work Umar.
@Ianlee-t2d ปีที่แล้ว ⁺³
This is the way!
@edoziemenyinnaya7637 ปีที่แล้ว
Please can we get the training code too?
@churchofyangs 3 หลายเดือนก่อน
I got a lot of questions to ask
what is in checklist.chk file ?
what is in cosolidated.00.pth file ?
what is in tokenizer.model?
I got error in here
model = LLaMA.build(
checkpoints_dir='llama-2-7b/',
tokenizer_path='tokenizer.model',
load_model=True,
max_seq_len=1024,
max_batch_size=3,
device=device
)
please kindly explain and guide me thank you
@edoziemenyinnaya7637 ปีที่แล้ว ⁺¹
Do you’ve a discord channel
@jiaxingyu8300 ปีที่แล้ว ⁺¹
Thank you so much for sharing!
@stsouko ปีที่แล้ว ⁺²
Wow. Now I got this trick
@feixyzliu5432 9 หลายเดือนก่อน
Wouldn't it be 'cur_pos - 1' for start_pos argument (line 81 in inference.py, 2:45:58)?
@mikeliu8533 8 หลายเดือนก่อน
Agreed.
@skanderbegvictor6487 10 หลายเดือนก่อน
I tried loading the model from M1 mac 8GB RAM but it seems that it requires more memory (I am guessing 28GB RAM)
@ramanandr7562 3 หลายเดือนก่อน
Does it answer to our questions just like chatgot do??? Anyone please answer me
@RayGuo-bo6nr 11 หลายเดือนก่อน ⁺²
Thanks! 谢谢你！
@zhenfutaofang2534 11 หลายเดือนก่อน
anyone know how to execute the code on cuda 4090gpu , i faced the out of memoery error
@MR_GREEN1337 10 หลายเดือนก่อน
where do you apply the causal mask?
@MR_GREEN1337 10 หลายเดือนก่อน
and the sliding window attention. Thank you
@feixyzliu5432 9 หลายเดือนก่อน
causal mask is not needed since kv cache is used
@SumanGameDev 8 หลายเดือนก่อน ⁺¹
oh boy this amazing video
@atanuchowdhury6582 11 หลายเดือนก่อน ⁺¹
awesome work boss
@coolguy69235 11 หลายเดือนก่อน
is llama 2 encoder only or decoder only model ?
@umarjamilai 11 หลายเดือนก่อน
People call it "Decoder-Only", because it resembles the Decoder of the Transformer, but it lacks the Cross Attention. Technically it's the Encoder of the Transformer plus a Linear Layer and Softmax. But commonly, people call LLaMA a "decoder only" and BERT a "Encoder only" model.
@coolguy69235 11 หลายเดือนก่อน
@@umarjamilai Thanks a lot for your prompt reply. And amazing video
@tarequeovi4051 ปีที่แล้ว ⁺²
Great Content
@hussainshaik4390 ปีที่แล้ว ⁺²
great content !
@hussainshaik4390 ปีที่แล้ว ⁺⁴
Thanks
@umarjamilai ปีที่แล้ว ⁺²
Thank you for your support!
@خالدالحارثي-ع7ض ปีที่แล้ว ⁺¹
We need one more video to explain download weights and inferencing, because it is not clear.
@umarjamilai ปีที่แล้ว ⁺¹
Hi! To download the LLaMA weights, you need to request access to it by using the following link: ai.meta.com/resources/models-and-libraries/llama-downloads/
Meta will send you an email with the details on how to download the model.
@德发王-h6i 9 หลายเดือนก่อน ⁺¹
great video❤
@mohammadyahya78 4 หลายเดือนก่อน ⁺¹
amazing
@spencerfunk6697 4 หลายเดือนก่อน
please do mistral
@DiegoSilva-dv9uf 10 หลายเดือนก่อน ⁺¹
Thanks!
@umarjamilai 10 หลายเดือนก่อน
Thank you Diego for your support!
@KarenClarke-q9z 8 หลายเดือนก่อน ⁺¹
watch again
@hautran-uc8gz 8 หลายเดือนก่อน
thank you
@wilfredomartel7781 ปีที่แล้ว
🎉🎉
@wd25548 7 หลายเดือนก่อน
Great video! one question though: In th-cam.com/video/oM4VmoabDAI/w-d-xo.htmlsi=TBFoV5Kj0lnbNaee&t=4272, Why do we have to have the "gamma" of all one? I did comparison on the code with and without the self.weight, the outputs are the same
@wd25548 7 หลายเดือนก่อน
Oh forgive me dummy question - for anyone else who's thinking about it, the self.weight is learnable
@mayuringole9679 หลายเดือนก่อน
hey @umarjamilai is their any way to deploy this LLM
@wangqis หลายเดือนก่อน ⁺¹
Thank you
@forresthu6204 9 หลายเดือนก่อน ⁺¹
Thanks!
@umarjamilai 9 หลายเดือนก่อน
谢谢你！

ต่อไป

เล่นอัตโนมัติ

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU