No comments.... Need to learn many things... Thank you very much for creating such interesting and helpful content... I am fortunate - that I found your channel.
Haven't watched the full video yet but thanks for the promising content. please keep it going. Would like to see more of the environment set up and the debugging process.
55:44 "I could have also written the code and not tell you and not tell you anything but I like to give proof to what i do " Wow thank you for going that extra mile we really appreciate it.
As always the PDF slides and the source code are available on GitHub: github.com/hkproj/pytorch-llama/ Prerequisites: 1) Transformer explained: th-cam.com/video/bCz4OMemCcA/w-d-xo.html 2) LLaMA explained: th-cam.com/video/Mn_9W1nCFLo/w-d-xo.html
Umar bhai, your tutorials on transformer architectures and open-source LLMs are truly remarkable. As a Pakistani, seeing your expertise in deep learning is incredibly inspiring. Have you ever considered creating Urdu versions of your content? It could make your valuable knowledge more accessible to a wider audience. Your contributions are invaluable to the global tech community. Keep up the fantastic work! Huge fan of your work. May ALLAH bless you with health and success!
@@azain47 oh, i thought he was Pakistani, but nvm, it's really good to see a Muslim working and sharing his knowledge and expertise to the rest of the world, generally we don't see Muslim people making great content regarding to CS
Thanks for your lecture, and I have a question. What happens if the start_pos is longer than the query cache size? If this code is not dealing with such a situation, which kind of additional modification do we need?
at 1:13:21 I don't understand why the `head_dimension = dimension (.ie 4096) // n_heads (32)`. isn't the whole dimension is supposed to be passed through 32 different attn blocks which is then added up to form the multihead attention?
why is the context window size limited? Is it because these models are based on transformers and for a given transformer architecture, long distance semantic relationship detection will be bounded by the number of words/context length ?
Lets say a llm application has a context window of 4000 words. It also supports historical chats. So user can effectively send more than allowed words in a given prompt, and yet get answers related to the previous historical conversation? How does this work ?
Shouldn't the start_pos should be current_pos - 1. (in the model foward line within the completion function) If you look at the kv caching, you are never caching keys and values for position 0.
Thank you for the wonderful lecture. I wondering why you use torch.matmul / transpose things in the video, but use torch.einsum in the slides? They are mathematically equal, but how about their efficiency, which one will be run faster?
I got a lot of questions to ask what is in checklist.chk file ? what is in cosolidated.00.pth file ? what is in tokenizer.model? I got error in here model = LLaMA.build( checkpoints_dir='llama-2-7b/', tokenizer_path='tokenizer.model', load_model=True, max_seq_len=1024, max_batch_size=3, device=device ) please kindly explain and guide me thank you
People call it "Decoder-Only", because it resembles the Decoder of the Transformer, but it lacks the Cross Attention. Technically it's the Encoder of the Transformer plus a Linear Layer and Softmax. But commonly, people call LLaMA a "decoder only" and BERT a "Encoder only" model.
Hi! To download the LLaMA weights, you need to request access to it by using the following link: ai.meta.com/resources/models-and-libraries/llama-downloads/ Meta will send you an email with the details on how to download the model.
Great video! one question though: In th-cam.com/video/oM4VmoabDAI/w-d-xo.htmlsi=TBFoV5Kj0lnbNaee&t=4272, Why do we have to have the "gamma" of all one? I did comparison on the code with and without the self.weight, the outputs are the same
Highly recommended for anyone who wants to understand open source LLM inside and out.
No comments.... Need to learn many things... Thank you very much for creating such interesting and helpful content...
I am fortunate - that I found your channel.
would love to see lighterweight llms trained on custom datasets, thanks for the video! this channel is a gold mine.
Haven't watched the full video yet but thanks for the promising content. please keep it going.
Would like to see more of the environment set up and the debugging process.
Very good video. You have a knack for conveying complex content in understandable format. Thank you and keep up the great work
55:44 "I could have also written the code and not tell you and not tell you anything but I like to give proof to what i do " Wow thank you for going that extra mile we really appreciate it.
You are a hidden gem, great explanation with theoretical and technical concepts.
As always the PDF slides and the source code are available on GitHub: github.com/hkproj/pytorch-llama/
Prerequisites:
1) Transformer explained: th-cam.com/video/bCz4OMemCcA/w-d-xo.html
2) LLaMA explained: th-cam.com/video/Mn_9W1nCFLo/w-d-xo.html
Thank you for such a detailed analysis of the architecture and implementation features of the model! You are very good at presenting information!
Very excited for this!!! Weekend is going to be fun!
Might you consider creating a Discord guild? I'd love to hang with the people that are watching these videos!
Hi! I am considering it, will let you know with a public post when it's online 🤖🦾
Yep, such great people
Great idea man!!
Marked for my next watch. Thanks for producing high quality video for the series. Hope you have fun in China.
Thanks for explaining all of these concepts. Keep up the good work 😎
Umar bhai, your tutorials on transformer architectures and open-source LLMs are truly remarkable. As a Pakistani, seeing your expertise in deep learning is incredibly inspiring. Have you ever considered creating Urdu versions of your content? It could make your valuable knowledge more accessible to a wider audience. Your contributions are invaluable to the global tech community. Keep up the fantastic work! Huge fan of your work. May ALLAH bless you with health and success!
He's Italian, I doubt he knows urdu
@@azain47 oh, i thought he was Pakistani, but nvm, it's really good to see a Muslim working and sharing his knowledge and expertise to the rest of the world, generally we don't see Muslim people making great content regarding to CS
@@sharjeel_mazhar be the change you wish to see, brother.
its an honor to me, to be in those 23500 viewers who watched this video, thank you so much umar jamil for your content
EXCELENT! I would like to see the se series with Llava.
This video is a gold mine!!!
this is hardcore machine learning engineering!
Thank you so much for sharing this, it was really well done!
Thank you for informative video. I have a question for 34:19. Where is -2 in the theta calculation?
Thanks! I learned a lot from your excellent video.
Thanks! I learned a lot from your excellent video.
Great video . would you do some videos one increasing context length of models.
With any bert based model or decoder model.🎉
Thanks for your lecture, and I have a question. What happens if the start_pos is longer than the query cache size? If this code is not dealing with such a situation, which kind of additional modification do we need?
Thank you Umar very much for the efforts here. One question, is there any PPO and finetuning on above of this in next videos?
at 1:13:21 I don't understand why the `head_dimension = dimension (.ie 4096) // n_heads (32)`. isn't the whole dimension is supposed to be passed through 32 different attn blocks which is then added up to form the multihead attention?
Great video
So what about the dataset used in this video?
if i want to use this code for training what needs to be changed
why is the context window size limited? Is it because these models are based on transformers and for a given transformer architecture, long distance semantic relationship detection will be bounded by the number of words/context length ?
What are the system requirements to run the inference for this model? By the way, its a great video
Great content as usual! Thanks
Hi, I want to fine tune the model. In that case, will it be required to get rid of the k-v caching?
Can i use llama2 model open source for life time or can i code along with you and use the model
Lets say a llm application has a context window of 4000 words. It also supports historical chats. So user can effectively send more than allowed words in a given prompt, and yet get answers related to the previous historical conversation? How does this work ?
Incredible explanation!
A suggestion for all your videos is to increase the font size or the zoom level. They are kind of unreadable.
Thanks for your feedback! I'll keep that in mind 🤗
Thank you very much for your efforts
Great video very educational
Shouldn't the start_pos should be current_pos - 1. (in the model foward line within the completion function) If you look at the kv caching, you are never caching keys and values for position 0.
Can somebody help to explain why when calculating theta, we are not including the -2, e.g., theta = theta ** (-2 * theta_numerator / head_dim)
Thank you for the wonderful lecture. I wondering why you use torch.matmul / transpose things in the video, but use torch.einsum in the slides? They are mathematically equal, but how about their efficiency, which one will be run faster?
Amazing work Umar.
This is the way!
Please can we get the training code too?
I got a lot of questions to ask
what is in checklist.chk file ?
what is in cosolidated.00.pth file ?
what is in tokenizer.model?
I got error in here
model = LLaMA.build(
checkpoints_dir='llama-2-7b/',
tokenizer_path='tokenizer.model',
load_model=True,
max_seq_len=1024,
max_batch_size=3,
device=device
)
please kindly explain and guide me thank you
Do you’ve a discord channel
Thank you so much for sharing!
Wow. Now I got this trick
Wouldn't it be 'cur_pos - 1' for start_pos argument (line 81 in inference.py, 2:45:58)?
Agreed.
I tried loading the model from M1 mac 8GB RAM but it seems that it requires more memory (I am guessing 28GB RAM)
Does it answer to our questions just like chatgot do??? Anyone please answer me
Thanks! 谢谢你!
anyone know how to execute the code on cuda 4090gpu , i faced the out of memoery error
where do you apply the causal mask?
and the sliding window attention. Thank you
causal mask is not needed since kv cache is used
oh boy this amazing video
awesome work boss
is llama 2 encoder only or decoder only model ?
People call it "Decoder-Only", because it resembles the Decoder of the Transformer, but it lacks the Cross Attention. Technically it's the Encoder of the Transformer plus a Linear Layer and Softmax. But commonly, people call LLaMA a "decoder only" and BERT a "Encoder only" model.
@@umarjamilai Thanks a lot for your prompt reply. And amazing video
Great Content
great content !
Thanks
Thank you for your support!
We need one more video to explain download weights and inferencing, because it is not clear.
Hi! To download the LLaMA weights, you need to request access to it by using the following link: ai.meta.com/resources/models-and-libraries/llama-downloads/
Meta will send you an email with the details on how to download the model.
great video❤
amazing
please do mistral
Thanks!
Thank you Diego for your support!
watch again
thank you
🎉🎉
Great video! one question though: In th-cam.com/video/oM4VmoabDAI/w-d-xo.htmlsi=TBFoV5Kj0lnbNaee&t=4272, Why do we have to have the "gamma" of all one? I did comparison on the code with and without the self.weight, the outputs are the same
Oh forgive me dummy question - for anyone else who's thinking about it, the self.weight is learnable
hey @umarjamilai is their any way to deploy this LLM
Thank you
Thanks!
谢谢你!