One mistake I made in the video was when writing: forward_expansion = 4 I thought this would be the number of nodes to be expanded (similar to how the original paper did) but it's actually the number of nodes. Which means that going through the linear layer we are mapping to only 4 nodes in total, which obviously doesn't work. I think we're saved by the skip connections here so we still get decent performance but remember to change this!
I pretty feel thankful to your tutorial all which makes a based to upgrade my techstack ! I confidently recommend this tutorial nearby my person who needs to learn about Deeplearning Framework from tensorflow to pytorch anyway, so appericeate your good explanation !
@@AladdinPersson I have a question about positional embeddings that you used in your model. In Attention Is All You Need and bunch of other tutorials they define positional encoding/embedding (is there a difference?) as a sin/cos function. Here, you use PyTorch's nn.Embedding - could you please clarify on why it can be used instead of sin/cos positional embeddings or recommend a link where it's explained? Thank you!
@@Nastenka1029 That's a good point! There was a similar question on the Transformer from scratch video so I'll reuse a bit from my response there. What you're referring to is using position embedding vs position encodings (the latter is what they used in the paper, there is a difference) and you can read a bit more about them here: www.peterbloem.nl/blog/transformers. I didn't look into using these positional embedding vs positional encoding in too much detail so I am no expert on this part, but from my understanding it doesn't matter which one we choose too much. The position encoding works for arbitrarily length sequences whereas having these embeddings we need to set a max_length. Why I chose to use these embeddings rather than how did they did in the original paper is that it simplified the implementation quite a bit and it didn't seem to cost us very much. This was one simplification I did that's different from the original transformer paper which I perhaps incorrectly didn't state clearly in the video.
Hi Aladdin! I watched your transformer from scratch video, read the blogpost you left in the description, then watched this video and implemented MT using the transformer we built before. This was quite a journey and now I feel much more confident in this topic, thank you very much! Some FeedBack: I would prefer if you did include the utils.translate_sentence() in the video, since there are a lot of things you could demonstrate such as encoder-side caching instead of sending sentence_tensor to the Encoder each time
Hello ,thank you for this amazing tutorial . just one question could you please specify the version of torch and torchtext you are using in this tutorial
Aladdin do you happen to have the utils somewhere, I try looking for them in the video description but I believe they are not there. Also I didn't find them in you'r git
Hello man, thanks for your work,but I have a question ,you make src_mask based on ,but you didn't take care of of trg when you make trg_mask,may you tell me why?
Hi, I'm confused with one thing. During the training process you eval the same sentence in every epoch to see the progress. You called translate_sentence() which first passed in to the Transformer (and to the decoder I believe) and keeps output translated tokens until the output is . This totally makes sense to get the translated output. However, I didn't get the reason why the single line of code 'output = model(inp_data, target[:-1, :])' could handle the training process. Why in training mode you don't need to generate the decoder output token by token until obtaining an ?
Can I use only the encoder part of the transformer of the PyTorch transformer module? If possible the how can I use the encoder part of the transformer?
Which video has the creation of the Utils module? Looked through a number of videos and the Github, but can't find where translate_sentence is defined.
Thank you so much for your explanation, it will be very nice if you build a translation model without spacy coz many of the unpopular languages don't have spacy models, so if you can make a model which can work with text file dataset it will be nice.
We could just as well create our own, so for example a very simple one could be to split each sentence where there's a space in between the words that should work for multiple languages but it wouldn't be a very good tokenizer. I am not familiar with the different tokenizers we can use that extends to multiple languages, do you have a suggestion for this?
@@AladdinPersson I am also in search of such tokenizers coz my dataset is tab separatable, when I will find one I will let you know in the meantime if you find something regarding please make a video on it . It will be very helpful to me.and as always thank you for awesome videos and quick reply.
Thanks! can you please write down what versions you are using? I am trying to debug issues here. Pytorch torchtext Cuda version GPU Used ... and all the relevant libraries Thank you🙏
Hi, i am facing problem with (from torchtext.data import Field, BucketIterator), it says it can't find Field and Bucketiterator. I checked and in the package there is no such objects. Same problem with utils. Can you help me?
Mate I follow each of your tutorial on Seq2Seq(they are siq) but your BLEU scores are really high almost hit the SOTA, have you checked if the models are overfitting?
He is trying to remove the EOS_TOKENS in target[:-1, :] and trying to remove the SOS_TOKENS in target[1:] target[1:] removes the SOS_TOKENS quite accurately. But target[:-1, :] doesn't seem to work, For Example: [ [1, 1, 1, 1, 1, 1, 1, 1], [3, 4, 5, 6, 7, 5, 8, 5], [4, 6 7, 7, 9, 7, 5, 5], [7, 2, 4, 6, 2, 3, 4, 2], [2, 0, 2, 4, 0, 7, 2, 0], [0, 0, 0, 2, 0, 2, 0, 0] ] Consider SOS_TOKEN = 1, EOS_TOKEN = 2, PAD_TOKEN = 0 Removing the last row doesn't seem to remove all EOS_TOKENs and looks like it's broken For my personal projects, I did it like: > tar[tar == EOS_TOKEN] = PAD_TOKEN > tar = tar[:-1]
Hi ! First, thank you for, his amazing tutorial. Then, I'm trying to remake this code to perform text summarization, however, the model always predicts the same word even though the loss seems to be decreasing. Any idea how I could solve that ? Thanks in advance
Thanks a lot for the nice video! When I run the for loop of the epochs it shows the following error: AttributeError: 'Transformer' object has no attribute 'encoder' Could you please help me to figure out what is wrong?
It's a lower triangular matrix with ones with the size depending on the sequence length, you can check out the source code: github.com/pytorch/pytorch/blob/9f743015bfbe2d70102a1e22c62ce5263e20ce89/torch/nn/modules/transformer.py#L129-L135 Also in my transformer from scratch video I implement this function if I remember correctly
Yes,It works little worse than the nn.transformer ,but still awesome,and you need to do some modification if you want to use the scratch one,here is what I did github.com/dayangai/paper_implementation/blob/main/transformer_translation.ipynb
@@yizhouyang3376 excellent work mate 👍👍 I highly appreciate it. Can you please tell me how can I use a sentence piece tokenizer and train the entire model on TPUs. That's just the 2 parts I'm missing. I tried using Pytorch Lighting but wasn't able to structure my dataloader to fit in lightings code
how can we do inference of these kind of networks? . i mean testing part. because test phase for these networks is totally different from ususal networks . I am doing time series forecasting using transformers and i have struggeled in test phase
Great tutorial! Is there a way to visualize the attention heat map while using the inbuilt transformer pytorch layer? We would need the attention weights for it. All the ones I found online are using the custom encoder decoder layers to build the transformer and passing the attention.
Thank you:) I'm not sure about that, my interpretation is the same as yours, only seen people visualizing attention when having their own implementation.
Hi, I am having a doubt. When we train a transformers model for machine translation, the network takes the input as a batch of sentences (according to the batch size). But after the training when we use the model for translating the test sentences, we translate sentence by sentence not as batch. So, how it is happening that during training a batch is given as an input and during testing only one sentence. The configuration becomes different, right?
From my little bit knowledge and understanding I think these libraries take care of the number of sentences automatically during testing as long as you send them in as a batch even if it is a single sentence send it like a batch. This does not cause an issue for the network itself because when training, the loss gets scaled according to batch size. (Only my understanding I'm a noob btw)😅
hello I using this tutorial to perform translation from English to gloss asl (American sign language intermediate representation) so I made some adjustments on the data set reading process but it show me this error at the training part 'Example' object has no attribute 'src' and is generated from this line for batch_idx, batch in enumerate(train_iterator): if you can help I will be so grateful
What if we work on a different data like audio or time series data. What would be the vocab sizes in that case? I think we cannot use this architecture directly? But can anyone give an idea about that? Thanks.
Sir i am getting the error while running this code... The error says can’t import translate_sentences,blue from utils... Can you please explain it to me that how can i fix it. Thank you in advance
Doesn't forward_expansion correspond to dim_feedforward in nn.Transformer? If so, then the number should be a lot higher than 4. Maybe 4 is supposed to be what you multiply by the embedding size? Anyway, thanks for very nice tutorial. It might be worth trying to initialize the encoder with a pretrained model, freeze, and then just finetune the decoder.
I think you're absolutely right, when implementing the original model they define a forward expansion rather than explicitly writing out the dimensions, so I think I was trying to stick to that convention. I'm just so surprised it worked so well even by training on forward dimension to 4. That doesn't make any sense to me, will look into this some more when I find the time
hi can you pelase explain what source padd index represent, i'm preproccessing my texts from scratch i did everything all whats left is to figure out src padd idx
Hey can u please build a YOLO object detection model from scratch.Like building the YOLO architecture fron scratch using pytorch(just like u built the resnet architecture) and then putting it on training on some dataset.I desperately need it for my project basis
Hi I am getting RuntimeError: CUDA error: device-side assert triggered error. Could you help me out? --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) in 24 25 # Forward prop ---> 26 output = model(inp_data, target[:-1, :]) 27 28 # Output is of shape (trg_len, batch_size, output_dim) but Cross Entropy Loss home\ubuntu\Projects\virtualEnvs\transformers\lib\site-packages\torch n\modules\module.py in _call_impl(self, *input, **kwargs) 720 result = self._slow_forward(*input, **kwargs) 721 else: --> 722 result = self.forward(*input, **kwargs) 723 for hook in itertools.chain( 724 _global_forward_hooks.values(), in forward(self, src, trg) 65 66 src_padding_mask = self.make_src_mask(src) ---> 67 trg_mask = self.transformer.generate_square_subsequent_mask(trg_seq_length).to( 68 self.device 69 ) RuntimeError: CUDA error: device-side assert triggered
@@AladdinPersson Hi Aladdin, yes I reused your code and made some tweaks. But I get this error with another dataset. I have prepared a dataset with English sentences and their lemmatized sentences. The sentences are of more or less equal length. My idea was to use the attention mechanism to predict the lemmatized sentences, hoping it maintains the context which is the motivation of the task (e.g. a sentence like "I cut wood with saw" lemmatized to "I cut wood with saw" (here saw instead of the lemmatized word "see" is appropriate)). Here is my link to my GitHub: github.com/shoebjoarder/Lemmatization-using-Attention-Mechanism/blob/master/engLemAttention.ipynb By the way, I changed the device to "CPU", the new error is "IndexError: index out of range in self". I am really looking forward to learning about the problem I am having here. Hope you can help me out :).
@@shoebjoarder It's difficult for me to know without running the code and debugging (this can take some time). Reading through your notebook it seems that the issue is with the indexing in your Embedding layer. If I recall correctly we used max_len in the Embedding for the transformer to gain positional understanding (word order matter) but a con of using these positional embeddings is that we had to cut everything at a certain max_len. In this case we set it to 100, are you sure no sentence is longer than 100 words?
@@AladdinPersson Hi Aladdin, you were right, the length of the sentences in the dataset was the issue. First, I wrote a program to check the maximum number of words in a sentence of the dataset and I saw it was 768! So I manually checked the sentences and there were indeed a few more sentences longer than 100 words. Thank you for helping me out locating the problem. You are a good mentor! :)
Dude, please stop voicing out the exact code you are typing. I can see what you are typing. Instead, say what it actually is doing and why you are doing it.
Any particular point in the video where you felt that way? I just skimmed through it and felt that I did say outloud what I was typing but oftentimes I shared some additional thoughts on why we're doing it that way
@@AladdinPersson Definitely from the 6 minute mark (arguably the most important part of the video) 1. Very hard to know the purpose of the variables of the initialize function since all you are saying is "We're gonna do a 〇〇 and then we are going to do 〇〇" 2. When you move onto adding the Embedding layers, you just add them without giving any context into why we need them. I know you have implemented Transformers from scratch in a previous video, but it will be so much more helpful if you make a connection between the role of these embedding layers, how it works in pytorch, and the part in the previous video when you implement it from scratch. 3. You add self.device and say " Now we de self.device" instead of saying the constructive reasoning something such as "Keeping track of if we are using CPU or GPU to train this model." 4. Then you assign a Transformer module without even touching on the role of the parameters passed on to the constructor for the model. Some connections between the Transformer diagram from the Attention is all you need paper would have been really helpful. These are just some of the points from the entire video. These might sound like trivial points to you since you might already be working with PyTorch all the time but from the perspective of a beginner (which I assume is the target of the video), it is extremely hard to keep up with the content.
I hear you, I'll try to do better in the future. I'm definitely assuming a lot of previous knowledge, and it's definitely valuable advice to try and give a quick explanation of why we are doing certain things. The difficulty is finding the correct level, because explaining embeddings would also perhaps be too basic and might be confusing to some exactly why I'm explaining how they work in this video. I'll think some more of this and I'm going to re-watch some of my old(er) videos to try and see how I can do better moving forward, thank you for sharing your thoughts on this
@@AladdinPersson sorry, i edited question. I think it would be great if you start explaining things, for example: highlighting code and explaining things deeper which was difficult for you at the first time, i mean i am sure when you were learning this you would have a questions, so it would be great if give us answers on those question. also i would suggest using drawing tool, (sendtex sometimes uses it), drawing is best for explaining, that's why andrew ng and Sal Khan have such a great feedback from students
One mistake I made in the video was when writing:
forward_expansion = 4
I thought this would be the number of nodes to be expanded (similar to how the original paper did) but it's actually the number of nodes. Which means that going through the linear layer we are mapping to only 4 nodes in total, which obviously doesn't work. I think we're saved by the skip connections here so we still get decent performance but remember to change this!
To how many?
@@tenthings9351 default value is good to go, he is already taking most of the values as default, i think it's 2048
This is a great video, definitely the best tutorial on Transformers I've seen. Instantly subscribed and turned on notifications!
I really appreciate you saying that. For that you will surely be blessed with healthy gradients! 😁
@@AladdinPersson hahaha let us hope so! Keep up the good videos!
I felt exactly the same and did exactly the same.
I pretty feel thankful to your tutorial all which makes a based to upgrade my techstack ! I confidently recommend this tutorial nearby my person who needs to learn about Deeplearning Framework from tensorflow to pytorch anyway, so appericeate your good explanation !
I really appreciate you saying that and is what makes me want to make more of these videos 🙏
Thank you for the awesome walkthrough! The most up-to-date tutorial I have ever seen so far.
Appreciate the kind words, it means a lot! :)
@@AladdinPersson I have a question about positional embeddings that you used in your model. In Attention Is All You Need and bunch of other tutorials they define positional encoding/embedding (is there a difference?) as a sin/cos function. Here, you use PyTorch's nn.Embedding - could you please clarify on why it can be used instead of sin/cos positional embeddings or recommend a link where it's explained?
Thank you!
@@Nastenka1029 That's a good point! There was a similar question on the Transformer from scratch video so I'll reuse a bit from my response there. What you're referring to is using position embedding vs position encodings (the latter is what they used in the paper, there is a difference) and you can read a bit more about them here: www.peterbloem.nl/blog/transformers.
I didn't look into using these positional embedding vs positional encoding in too much detail so I am no expert on this part, but from my understanding it doesn't matter which one we choose too much. The position encoding works for arbitrarily length sequences whereas having these embeddings we need to set a max_length.
Why I chose to use these embeddings rather than how did they did in the original paper is that it simplified the implementation quite a bit and it didn't seem to cost us very much. This was one simplification I did that's different from the original transformer paper which I perhaps incorrectly didn't state clearly in the video.
@@AladdinPersson Thank you for your answer, it's really good to know!
And thanks for the link, I'll check it out :)
Hi Aladdin! I watched your transformer from scratch video, read the blogpost you left in the description, then watched this video and implemented MT using the transformer we built before. This was quite a journey and now I feel much more confident in this topic, thank you very much!
Some FeedBack: I would prefer if you did include the utils.translate_sentence() in the video, since there are a lot of things you could demonstrate such as encoder-side caching instead of sending sentence_tensor to the Encoder each time
This is one if the amazing toturial on transformer
I am curious about how utils are made and what the specific functions look like. Can I find clues in your other videos?
Thank you for letting me know about the ignore_index in the nn.CrossEntropyLoss(ignore_index=k)!
Please make videos on BERT,Electra and other pretrained transformer models
Hello ,thank you for this amazing tutorial . just one question could you please specify the version of torch and torchtext you are using in this tutorial
for me torch 1.13.0 and torchtext 0.6.0 took me a couple of days to get there too tried many versions.
Aladdin do you happen to have the utils somewhere, I try looking for them in the video description but I believe they are not there. Also I didn't find them in you'r git
Hi Aladdin, where can we find the code of this tutorial? I checked the link to the github you mentioned, but wasn't able to find it.
Hello man, thanks for your work,but I have a question ,you make src_mask based on ,but you didn't take care of of trg when you make trg_mask,may you tell me why?
please make a video on implementation of Video Summarization With Frame Index Vision Transformer
Hi, I'm confused with one thing. During the training process you eval the same sentence in every epoch to see the progress. You called translate_sentence() which first passed in to the Transformer (and to the decoder I believe) and keeps output translated tokens until the output is . This totally makes sense to get the translated output. However, I didn't get the reason why the single line of code 'output = model(inp_data, target[:-1, :])' could handle the training process. Why in training mode you don't need to generate the decoder output token by token until obtaining an ?
Can I use only the encoder part of the transformer of the PyTorch transformer module? If possible the how can I use the encoder part of the transformer?
Really great explanation in the video! You have used he Multi30K dataset , I want to implement the same using my dataset, is it possible, if yes how?
Can someone explain me why we use target[:,:-1] during model training. Thanks
please tell me about the utils in the import sections which you ignored, where can i find it
Which video has the creation of the Utils module? Looked through a number of videos and the Github, but can't find where translate_sentence is defined.
Thank you so much for your explanation, it will be very nice if you build a translation model without spacy coz many of the unpopular languages don't have spacy models, so if you can make a model which can work with text file dataset it will be nice.
We could just as well create our own, so for example a very simple one could be to split each sentence where there's a space in between the words that should work for multiple languages but it wouldn't be a very good tokenizer. I am not familiar with the different tokenizers we can use that extends to multiple languages, do you have a suggestion for this?
@@AladdinPersson I am also in search of such tokenizers coz my dataset is tab separatable, when I will find one I will let you know in the meantime if you find something regarding please make a video on it . It will be very helpful to me.and as always thank you for awesome videos and quick reply.
Nice explanation and informative content
Thanks a lot!
Thanks! can you please write down what versions you are using? I am trying to debug issues here.
Pytorch
torchtext
Cuda version
GPU Used
...
and all the relevant libraries
Thank you🙏
Hi, i am facing problem with (from torchtext.data import Field, BucketIterator), it says it can't find Field and Bucketiterator. I checked and in the package there is no such objects. Same problem with utils. Can you help me?
same problem here. Field and Bucketiterator are deprecated now. Probably need to downgrade torchtext to version 0.8
Mate I follow each of your tutorial on Seq2Seq(they are siq) but your BLEU scores are really high almost hit the SOTA, have you checked if the models are overfitting?
Thanks!
Hi, can you explain why you use target[:-1, :] to model and use target[1:] to calculate the loss ?
He is trying to remove the EOS_TOKENS in target[:-1, :] and trying to remove the SOS_TOKENS in target[1:]
target[1:] removes the SOS_TOKENS quite accurately.
But target[:-1, :] doesn't seem to work, For Example:
[
[1, 1, 1, 1, 1, 1, 1, 1],
[3, 4, 5, 6, 7, 5, 8, 5],
[4, 6 7, 7, 9, 7, 5, 5],
[7, 2, 4, 6, 2, 3, 4, 2],
[2, 0, 2, 4, 0, 7, 2, 0],
[0, 0, 0, 2, 0, 2, 0, 0]
]
Consider SOS_TOKEN = 1, EOS_TOKEN = 2, PAD_TOKEN = 0
Removing the last row doesn't seem to remove all EOS_TOKENs and looks like it's broken
For my personal projects, I did it like:
> tar[tar == EOS_TOKEN] = PAD_TOKEN
> tar = tar[:-1]
Hi !
First, thank you for, his amazing tutorial.
Then, I'm trying to remake this code to perform text summarization, however, the model always predicts the same word even though the loss seems to be decreasing. Any idea how I could solve that ?
Thanks in advance
Thanks a lot for the nice video!
When I run the for loop of the epochs it shows the following error:
AttributeError: 'Transformer' object has no attribute 'encoder'
Could you please help me to figure out what is wrong?
i have a question. you write target[-1] which mean that the fed target shifted to the right one position, right ?
Great Video. Could you explain how masks has to be created ? I am confused with the dimension of the mask
It's a lower triangular matrix with ones with the size depending on the sequence length, you can check out the source code: github.com/pytorch/pytorch/blob/9f743015bfbe2d70102a1e22c62ce5263e20ce89/torch/nn/modules/transformer.py#L129-L135
Also in my transformer from scratch video I implement this function if I remember correctly
@@AladdinPersson Thanks .
Hey , Can i use your previous transformer from scratch code to be implemented with this training script ? will it work ??
Yes,It works little worse than the nn.transformer ,but still awesome,and you need to do some modification if you want to use the scratch one,here is what I did github.com/dayangai/paper_implementation/blob/main/transformer_translation.ipynb
@@yizhouyang3376 excellent work mate 👍👍 I highly appreciate it. Can you please tell me how can I use a sentence piece tokenizer and train the entire model on TPUs. That's just the 2 parts I'm missing. I tried using Pytorch Lighting but wasn't able to structure my dataloader to fit in lightings code
@@yizhouyang3376 Thank you very much!
great video !!!! as usual
how can we do inference of these kind of networks? . i mean testing part. because test phase for these networks is totally different from ususal networks . I am doing time series forecasting using transformers and i have struggeled in test phase
Great tutorial! Is there a way to visualize the attention heat map while using the inbuilt transformer pytorch layer? We would need the attention weights for it. All the ones I found online are using the custom encoder decoder layers to build the transformer and passing the attention.
Thank you:) I'm not sure about that, my interpretation is the same as yours, only seen people visualizing attention when having their own implementation.
Sir I am getting nan values for transformer model output for regression problem
Could you please help sir
Pycharm doesn't have transformers model?
AttributeError: 'Transformer' object has no attribute 'encoder'
I'm facing this issue in your util function can you please suggest what I did wrong?
Hi, I am having a doubt. When we train a transformers model for machine translation, the network takes the input as a batch of sentences (according to the batch size). But after the training when we use the model for translating the test sentences, we translate sentence by sentence not as batch. So, how it is happening that during training a batch is given as an input and during testing only one sentence. The configuration becomes different, right?
From my little bit knowledge and understanding I think these libraries take care of the number of sentences automatically during testing as long as you send them in as a batch even if it is a single sentence send it like a batch. This does not cause an issue for the network itself because when training, the loss gets scaled according to batch size. (Only my understanding I'm a noob btw)😅
This channel exist and TH-cam didn't recommend it sooner?
Thank you for saying that, appreciate it 🙏
hello I using this tutorial to perform translation from English to gloss asl (American sign language intermediate representation) so I made some adjustments on the data set reading process but it show me this error at the training part 'Example' object has no attribute 'src' and is generated from this line for batch_idx, batch in enumerate(train_iterator):
if you can help I will be so grateful
What if we work on a different data like audio or time series data. What would be the vocab sizes in that case? I think we cannot use this architecture directly? But can anyone give an idea about that? Thanks.
Sir i am getting the error while running this code...
The error says can’t import translate_sentences,blue from utils...
Can you please explain it to me that how can i fix it.
Thank you in advance
Doesn't forward_expansion correspond to dim_feedforward in nn.Transformer? If so, then the number should be a lot higher than 4. Maybe 4 is supposed to be what you multiply by the embedding size?
Anyway, thanks for very nice tutorial. It might be worth trying to initialize the encoder with a pretrained model, freeze, and then just finetune the decoder.
I think you're absolutely right, when implementing the original model they define a forward expansion rather than explicitly writing out the dimensions, so I think I was trying to stick to that convention. I'm just so surprised it worked so well even by training on forward dimension to 4. That doesn't make any sense to me, will look into this some more when I find the time
Perhaps I changed it when training the model but didn't mention it in the video but that seems unlikely
hi can you pelase explain what source padd index represent, i'm preproccessing my texts from scratch i did everything all whats left is to figure out src padd idx
We're adding to the start and end of all sentences right, src_pad_idx gets the index of that in vocab
Hey can u please build a YOLO object detection model from scratch.Like building the YOLO architecture fron scratch using pytorch(just like u built the resnet architecture) and then putting it on training on some dataset.I desperately need it for my project basis
I have it on my list, working on another video playlist currently but it's coming sooner or later:)
Hey can you upload a copy of the code on Git I couldn't find it in the Advanced folder of your repo
github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
@@AladdinPersson Thanks a lot
Hi I am getting RuntimeError: CUDA error: device-side assert triggered error. Could you help me out?
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
in
24
25 # Forward prop
---> 26 output = model(inp_data, target[:-1, :])
27
28 # Output is of shape (trg_len, batch_size, output_dim) but Cross Entropy Loss
home\ubuntu\Projects\virtualEnvs\transformers\lib\site-packages\torch
n\modules\module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
in forward(self, src, trg)
65
66 src_padding_mask = self.make_src_mask(src)
---> 67 trg_mask = self.transformer.generate_square_subsequent_mask(trg_seq_length).to(
68 self.device
69 )
RuntimeError: CUDA error: device-side assert triggered
Hey Shoeb, did you change the implementation any or are you running it as I implemented it (by copying/cloning from Github)?
@@AladdinPersson Hi Aladdin, yes I reused your code and made some tweaks. But I get this error with another dataset. I have prepared a dataset with English sentences and their lemmatized sentences. The sentences are of more or less equal length. My idea was to use the attention mechanism to predict the lemmatized sentences, hoping it maintains the context which is the motivation of the task (e.g. a sentence like "I cut wood with saw" lemmatized to "I cut wood with saw" (here saw instead of the lemmatized word "see" is appropriate)).
Here is my link to my GitHub: github.com/shoebjoarder/Lemmatization-using-Attention-Mechanism/blob/master/engLemAttention.ipynb
By the way, I changed the device to "CPU", the new error is "IndexError: index out of range in self".
I am really looking forward to learning about the problem I am having here. Hope you can help me out :).
@@shoebjoarder It's difficult for me to know without running the code and debugging (this can take some time). Reading through your notebook it seems that the issue is with the indexing in your Embedding layer. If I recall correctly we used max_len in the Embedding for the transformer to gain positional understanding (word order matter) but a con of using these positional embeddings is that we had to cut everything at a certain max_len. In this case we set it to 100, are you sure no sentence is longer than 100 words?
Did you manage to solve it?
@@AladdinPersson Hi Aladdin, you were right, the length of the sentences in the dataset was the issue. First, I wrote a program to check the maximum number of words in a sentence of the dataset and I saw it was 768! So I manually checked the sentences and there were indeed a few more sentences longer than 100 words. Thank you for helping me out locating the problem. You are a good mentor! :)
Dude, please stop voicing out the exact code you are typing. I can see what you are typing. Instead, say what it actually is doing and why you are doing it.
I'll try to improve thanks
Any particular point in the video where you felt that way? I just skimmed through it and felt that I did say outloud what I was typing but oftentimes I shared some additional thoughts on why we're doing it that way
@@AladdinPersson Definitely from the 6 minute mark (arguably the most important part of the video)
1. Very hard to know the purpose of the variables of the initialize function since all you are saying is "We're gonna do a 〇〇 and then we are going to do 〇〇"
2. When you move onto adding the Embedding layers, you just add them without giving any context into why we need them. I know you have implemented Transformers from scratch in a previous video, but it will be so much more helpful if you make a connection between the role of these embedding layers, how it works in pytorch, and the part in the previous video when you implement it from scratch.
3. You add self.device and say " Now we de self.device" instead of saying the constructive reasoning something such as "Keeping track of if we are using CPU or GPU to train this model."
4. Then you assign a Transformer module without even touching on the role of the parameters passed on to the constructor for the model. Some connections between the Transformer diagram from the Attention is all you need paper would have been really helpful.
These are just some of the points from the entire video.
These might sound like trivial points to you since you might already be working with PyTorch all the time but from the perspective of a beginner (which I assume is the target of the video), it is extremely hard to keep up with the content.
I hear you, I'll try to do better in the future. I'm definitely assuming a lot of previous knowledge, and it's definitely valuable advice to try and give a quick explanation of why we are doing certain things. The difficulty is finding the correct level, because explaining embeddings would also perhaps be too basic and might be confusing to some exactly why I'm explaining how they work in this video. I'll think some more of this and I'm going to re-watch some of my old(er) videos to try and see how I can do better moving forward, thank you for sharing your thoughts on this
u aren't explaining anything, you are writing and speaking what you are writing
Sorry you feel that way, I've gotten some feedback on this video already but you can share more detailed feedback if you'd like
@@AladdinPersson sorry, i edited question. I think it would be great if you start explaining things, for example: highlighting code and explaining things deeper which was difficult for you at the first time, i mean i am sure when you were learning this you would have a questions, so it would be great if give us answers on those question. also i would suggest using drawing tool, (sendtex sometimes uses it), drawing is best for explaining, that's why andrew ng and Sal Khan have such a great feedback from students
Hi Aladdin, where can we find the code of this tutorial? I checked the link to the github you mentioned, but wasn't able to find it.