Haven't watched the full video yet but thanks for the promising content. please keep it going. Would like to see more of the environment set up and the debugging process.
why is the context window size limited? Is it because these models are based on transformers and for a given transformer architecture, long distance semantic relationship detection will be bounded by the number of words/context length ?
Hi! To download the LLaMA weights, you need to request access to it by using the following link: ai.meta.com/resources/models-and-libraries/llama-downloads/ Meta will send you an email with the details on how to download the model.
No comments.... Need to learn many things... Thank you very much for creating such interesting and helpful content... I am fortunate - that I found your channel.
Shouldn't the start_pos should be current_pos - 1. (in the model foward line within the completion function) If you look at the kv caching, you are never caching keys and values for position 0.
I got a lot of questions to ask what is in checklist.chk file ? what is in cosolidated.00.pth file ? what is in tokenizer.model? I got error in here model = LLaMA.build( checkpoints_dir='llama-2-7b/', tokenizer_path='tokenizer.model', load_model=True, max_seq_len=1024, max_batch_size=3, device=device ) please kindly explain and guide me thank you
Great video! one question though: In th-cam.com/video/oM4VmoabDAI/w-d-xo.htmlsi=TBFoV5Kj0lnbNaee&t=4272, Why do we have to have the "gamma" of all one? I did comparison on the code with and without the self.weight, the outputs are the same
As always the PDF slides and the source code are available on GitHub: github.com/hkproj/pytorch-llama/ Prerequisites: 1) Transformer explained: th-cam.com/video/bCz4OMemCcA/w-d-xo.html 2) LLaMA explained: th-cam.com/video/Mn_9W1nCFLo/w-d-xo.html
Lets say a llm application has a context window of 4000 words. It also supports historical chats. So user can effectively send more than allowed words in a given prompt, and yet get answers related to the previous historical conversation? How does this work ?
Thanks for your lecture, and I have a question. What happens if the start_pos is longer than the query cache size? If this code is not dealing with such a situation, which kind of additional modification do we need?
Umar bhai, your tutorials on transformer architectures and open-source LLMs are truly remarkable. As a Pakistani, seeing your expertise in deep learning is incredibly inspiring. Have you ever considered creating Urdu versions of your content? It could make your valuable knowledge more accessible to a wider audience. Your contributions are invaluable to the global tech community. Keep up the fantastic work! Huge fan of your work. May ALLAH bless you with health and success!
@@azain47 oh, i thought he was Pakistani, but nvm, it's really good to see a Muslim working and sharing his knowledge and expertise to the rest of the world, generally we don't see Muslim people making great content regarding to CS
55:44 "I could have also written the code and not tell you and not tell you anything but I like to give proof to what i do " Wow thank you for going that extra mile we really appreciate it.
People call it "Decoder-Only", because it resembles the Decoder of the Transformer, but it lacks the Cross Attention. Technically it's the Encoder of the Transformer plus a Linear Layer and Softmax. But commonly, people call LLaMA a "decoder only" and BERT a "Encoder only" model.
Thank you for the wonderful lecture. I wondering why you use torch.matmul / transpose things in the video, but use torch.einsum in the slides? They are mathematically equal, but how about their efficiency, which one will be run faster?
Highly recommended for anyone who wants to understand open source LLM inside and out.
Haven't watched the full video yet but thanks for the promising content. please keep it going.
Would like to see more of the environment set up and the debugging process.
Very excited for this!!! Weekend is going to be fun!
Might you consider creating a Discord guild? I'd love to hang with the people that are watching these videos!
Hi! I am considering it, will let you know with a public post when it's online 🤖🦾
Yep, such great people
Great idea man!!
Thanks
Thank you for your support!
amazing
why is the context window size limited? Is it because these models are based on transformers and for a given transformer architecture, long distance semantic relationship detection will be bounded by the number of words/context length ?
We need one more video to explain download weights and inferencing, because it is not clear.
Hi! To download the LLaMA weights, you need to request access to it by using the following link: ai.meta.com/resources/models-and-libraries/llama-downloads/
Meta will send you an email with the details on how to download the model.
Thanks! I learned a lot from your excellent video.
No comments.... Need to learn many things... Thank you very much for creating such interesting and helpful content...
I am fortunate - that I found your channel.
would love to see lighterweight llms trained on custom datasets, thanks for the video! this channel is a gold mine.
Shouldn't the start_pos should be current_pos - 1. (in the model foward line within the completion function) If you look at the kv caching, you are never caching keys and values for position 0.
I got a lot of questions to ask
what is in checklist.chk file ?
what is in cosolidated.00.pth file ?
what is in tokenizer.model?
I got error in here
model = LLaMA.build(
checkpoints_dir='llama-2-7b/',
tokenizer_path='tokenizer.model',
load_model=True,
max_seq_len=1024,
max_batch_size=3,
device=device
)
please kindly explain and guide me thank you
Marked for my next watch. Thanks for producing high quality video for the series. Hope you have fun in China.
Thanks for explaining all of these concepts. Keep up the good work 😎
Great video . would you do some videos one increasing context length of models.
With any bert based model or decoder model.🎉
Can somebody help to explain why when calculating theta, we are not including the -2, e.g., theta = theta ** (-2 * theta_numerator / head_dim)
Very good video. You have a knack for conveying complex content in understandable format. Thank you and keep up the great work
Does it answer to our questions just like chatgot do??? Anyone please answer me
Thank you so much for sharing this, it was really well done!
if i want to use this code for training what needs to be changed
Thanks! I learned a lot from your excellent video.
You are a hidden gem, great explanation with theoretical and technical concepts.
This is the way!
Great video very educational
Thank you very much for your efforts
Great video! one question though: In th-cam.com/video/oM4VmoabDAI/w-d-xo.htmlsi=TBFoV5Kj0lnbNaee&t=4272, Why do we have to have the "gamma" of all one? I did comparison on the code with and without the self.weight, the outputs are the same
Oh forgive me dummy question - for anyone else who's thinking about it, the self.weight is learnable
Thank you
Thanks! 谢谢你!
Do you’ve a discord channel
Wow. Now I got this trick
Great content as usual! Thanks
Thank you for such a detailed analysis of the architecture and implementation features of the model! You are very good at presenting information!
A suggestion for all your videos is to increase the font size or the zoom level. They are kind of unreadable.
Thanks for your feedback! I'll keep that in mind 🤗
Great Content
oh boy this amazing video
please do mistral
this is hardcore machine learning engineering!
great content !
Great video @Umar.
I think line 47 , The transformation goes from (B, Seq_Len, H, Head_Dim) -> (B, Seq_Len, H, Head_Dim/2, 2)
As always the PDF slides and the source code are available on GitHub: github.com/hkproj/pytorch-llama/
Prerequisites:
1) Transformer explained: th-cam.com/video/bCz4OMemCcA/w-d-xo.html
2) LLaMA explained: th-cam.com/video/Mn_9W1nCFLo/w-d-xo.html
Lets say a llm application has a context window of 4000 words. It also supports historical chats. So user can effectively send more than allowed words in a given prompt, and yet get answers related to the previous historical conversation? How does this work ?
watch again
Thanks for your lecture, and I have a question. What happens if the start_pos is longer than the query cache size? If this code is not dealing with such a situation, which kind of additional modification do we need?
Incredible explanation!
great video❤
I tried loading the model from M1 mac 8GB RAM but it seems that it requires more memory (I am guessing 28GB RAM)
Thank you so much for sharing!
awesome work boss
Hi, I want to fine tune the model. In that case, will it be required to get rid of the k-v caching?
What are the system requirements to run the inference for this model? By the way, its a great video
Can i use llama2 model open source for life time or can i code along with you and use the model
EXCELENT! I would like to see the se series with Llava.
anyone know how to execute the code on cuda 4090gpu , i faced the out of memoery error
Great video
So what about the dataset used in this video?
Thank you Umar very much for the efforts here. One question, is there any PPO and finetuning on above of this in next videos?
Thanks!
谢谢你!
Wouldn't it be 'cur_pos - 1' for start_pos argument (line 81 in inference.py, 2:45:58)?
Agreed.
Thanks!
Thank you Diego for your support!
Please can we get the training code too?
thank you
Amazing work Umar.
🎉🎉
Umar bhai, your tutorials on transformer architectures and open-source LLMs are truly remarkable. As a Pakistani, seeing your expertise in deep learning is incredibly inspiring. Have you ever considered creating Urdu versions of your content? It could make your valuable knowledge more accessible to a wider audience. Your contributions are invaluable to the global tech community. Keep up the fantastic work! Huge fan of your work. May ALLAH bless you with health and success!
He's Italian, I doubt he knows urdu
@@azain47 oh, i thought he was Pakistani, but nvm, it's really good to see a Muslim working and sharing his knowledge and expertise to the rest of the world, generally we don't see Muslim people making great content regarding to CS
@@sharjeel_mazhar be the change you wish to see, brother.
55:44 "I could have also written the code and not tell you and not tell you anything but I like to give proof to what i do " Wow thank you for going that extra mile we really appreciate it.
where do you apply the causal mask?
and the sliding window attention. Thank you
causal mask is not needed since kv cache is used
is llama 2 encoder only or decoder only model ?
People call it "Decoder-Only", because it resembles the Decoder of the Transformer, but it lacks the Cross Attention. Technically it's the Encoder of the Transformer plus a Linear Layer and Softmax. But commonly, people call LLaMA a "decoder only" and BERT a "Encoder only" model.
@@umarjamilai Thanks a lot for your prompt reply. And amazing video
its an honor to me, to be in those 23500 viewers who watched this video, thank you so much umar jamil for your content
Thank you for the wonderful lecture. I wondering why you use torch.matmul / transpose things in the video, but use torch.einsum in the slides? They are mathematically equal, but how about their efficiency, which one will be run faster?