Hi Aladdin, thanks for the awesome tutorials. Could you please elaborate on 27:51, this statement outputs=model(imgs, captions[:-1]) Why are we ignoring the last row ? The last row would mostly contain padded characters, and very few EOS indexes. Could you please explain how ignoring the last row works in this context ? Thanks
Maybe it's very late to reply, but just for completeness. In the code, a feature from a CNN model is used as the first word of the input sequence to LSTM. This increases the length of the input and output sequence of LSTM by 1. The Crossentropy loss will not work until the last word of the output sequences is ignored (output[:-1]) or, as done in the code, the input sequence length is reduced by 1.
@@saurabhvarshneya4639 Hey did you get a good model out of this? If so then how many epochs did you run it? The thing is the model that i built wasn't making any difference like there is no major change in an error even after like 10 epochs of training in the print_examples()
since you feed the feature vector at timestamp-0 so at inference time we also only feed the feature-vector at timestamp-0 we not have to provide the start token in the test phase
While training , you took output of lstm only once , but while testing you used a for loop ,for generating whole sentence , why didn't you do the same thing while training , can you plz explain
Thanks alot! one important question: In the training loop the loss is calculated from scores and the captions which are the target. there is no shifting to the right of the target captions. Without doing so how does the model still knows to learn the next word? Is there an internal pytorch method that does so implicitly? I tried to look and i dont understand how in this way the loss can be calculated in a way such the model would learn to predict the next word
Should'nt the concat and unsqueezing happen on dim=1? the output of the fc layer is of shape (batch_size, embed_size) and the shape of the embedded captions im assuming is supposed to be (batch_size, num_words, embed_size), unsqueezing features on dim 0 results into (1, batch_size, embed_size) which cannot be concated with the embedded captions on dim 0, am I missing something?
in his data augmentation video he has generated caption tensor with shape (seq_len,batch_size) that's why he is concat them at dim=0, hope this clarifies your doubt
In the forward function of decoder, you are giving the output from lstm of shape batch, seq, hidden directly to the linear layer. I'm confused...doesn't linear layer expects a linear tensor....
Awesome tutorial, followed it till the end. I have a question, where do we split the training and test set? and how as there are image data and caption data too. Can you help me with that?
Hey, Thanks for this. These videos makes it so much easier as well as gives validation. I had one quick question though, when concating the features from CNN to the embedding we are actually adding one more timestep in front of the caption embeddings. Does this mean that at time step 0->lstm has image features as input, at time step->1 lstm has token embeddings as input and so on.
i HAVE A QUESTION THAT IF WE ARE TRAINING THE MODEL IS BATCHES THEN WE CANNOT USE THE LOGIC OF BREAKING THE LOOP IF IT PREDICT THE END TOKEN SINCE THE END TOKEN POISTION MAY VARIES FOR EACH CAPTION WITH IN THE BATCH SO WHAT THE SOL FOR THAT THANKS
Aladdin, that is an excellent video. Very easy to follow along. You are setting which parameters are trainable in the forward function. Wouldn’t this mean it is set again and again in each forward pass? I am wondering if you have thoughts on putting it into the constructir or in the configuration of the optimizer?
Thank you for the comment! I definitely understand your point and I changed the requires_grad from the forward and put this in the train function after initializing the model. I initially thought the speed improvements would be more than they were. My tests show that the forward pass of the encoderCNN runs 1.6% faster with the change, although this is definitely an improvement. I updated the code with this change on Github.
Aladdin Persson Just to clarify. My intention was not to improve performance, but conceptual clarity. As the parameters do not change in between forward passes, as they are not dependent on the forward pass input, I would maintain them separately to communicate this to the reader. Allocating it with the training code, as you suggest, makes sense to me. Keep it up.
Yeah it definitely makes things cleaner and that's primary goal for me. That it would be computationally inefficient just popped into my mind. Again I appreciate your feedback and for taking the time to comment! :)
Hi Aladdin, Thanks for the excellent tutorials. One question though, in the Encoder code, wouldn't it be better to set the requires_grad for the layers in __init__() method instead of forward() method? I guess this assignment is required only once during initialisation. You are anyways getting the train_CNN value in the __init__() method and not in the forward() method.
Excellent video. Very well explained. Can you tell us what GPUs you used to train this model and how much time it took? Also, is your code open on git?
I only trained for a couple of hours on a 1060 and the model was very small (256 hidden & embed size with 1 layer on the lstm). Performance can definitely become better and the goal of the video was only to convey the underlying ideas of image captioning. Everything is uploaded on Github: github.com/AladdinPerzon/Machine-Learning-Collection
Hi Aladdin, thanks for the video😍 I want to capture an image, check which class it belongs to and tell that class in audio format ..so basically image classification and image to speech conversion . so how to tell the image class since image doesn't contain any text? I can use this same code right ? Since I am a beginner so ...
Hi Aladdin , thanks so much for this awesome series of videos. Could you please explain how to use BERT instead of RNN in this model ? thanks in advance
HI Aladdin, thanks for the video and GitHub link. I've gone through your code and have entered it into Jupyter. The program gets through just about everything, and then right before it trains it stops and gives me the following error message: "ValueError: too many values to unpack (expected 2)". I'm really at a lose here. Just wondering if you could provide a recommendation? I know this has happened to a few other people, but it's odd that it's not a universal issue. I am using my own dataset. Thanks a lot in advance.
I had run the same code. But i got an error. in the below i have mentioned the error. would you please tell me why this error for? " ValueError: Expected input batch_size (1152) to match target batch_size (1120). "
Did you manage to solve it? Try and remove the tqdm part and see if it works, when running it on my machine from the Github repo I don't obtain an error. Have you downloaded the github repository and ran the script for image captioning is on there and still get the error? If so, what version of Pytorch and tqdm are you using? EDIT: Also you can try replacing and doing enumerate(tqdm(train_loader)) and see if that works
Hey, did anyone get a good model Cause I did like 40 epochs and the print_examples are giving me the same answer again and again. If anyone did get a good model pls do reply how many epochs did you run to get a good model. BTW awesome video, really helpful
Same here. I trained for 10 epochs and getting the same output for any image I give. Is there something to change in the code or is it just about insufficient training? Thanks in advance.
Hello Aladdin, I have a question. The number of images in the dataset is around 8090 and the batch size I selected as 32. So the total number of batches in each epoch should be 253. but the when I load the data and check length of data loader it shows 1265. I don't understand this. Can you please explain if you have any idea. I have never seen this.
I hope this is not too late. Despite there are 8090 images, there are also 40455 captions (about 5 captions for each image not 1). This determines the length of the data and when the batch size is 32, you get ceil(40455/32)=1265 batches in total. I hope it will still be useful :)
when i run train.py file it run 0-1265 then it start at 0 and again goto 1265 and it continuing . would you please explain how can i solve this problem? btw thanks for pytorch videos
There are 1265 batches in this. That's why it goes from 0 to 1265 and than back. If you change batch size from 32 to something else you will see the change as well in 0 to 1265.
The new version of pytorch makes you use ... self.inception = models.inception_v3(weights="DEFAULT") ... return self.dropout(self.relu(features [0])) This is a bug in pytorch
I would assume it just has to do with tweaking the hyperparameters and training longer. If I recall correctly this was the case for me as well before I found hyperparameters that worked. Although make sure you're outputting the feature vector from EfficientNet so the connection between the CNN and RNN makes sense
@@AladdinPersson How did you tune parameters? By Random Search? It really look difficult problem to tune to me . I am thinking may be I do Random search over 25 jobs for 2 epochs ,The parameter with lowest loss I will try it . Will that work?
@@haideralishuvo4781 Yeah random search works. But let's get back to basics first, did you follow my common mistakes video #1 of didn't overfit a batch first. Have you made sure your model can overfit a single batch of 1, 2, 32, 64, etc before trying anything else? That would be what I would check first then when you find something that works there most likely it's going to work fine for the entire training set too but random search can help in that process
Hi, the Github link is in the description of the video although it's linking to the machine learning repository and there you can find Image Captioning. Here is the direct link for the code in the video: github.com/AladdinPerzon/Machine-Learning-Collection/tree/master/ML/Pytorch/more_advanced/image_captioning
How is it that you are so good at explaining?
Keep up the good work champ.
u r such a great engineer!
I found out this vid sooo useful!!
Thanks!!!!
Thanks:)
one of the best channels evaaaaa
thank you very much for your videos, please continue your work, many people need your video
Awesome complete tutorial, thank you.
Thanks for the great tutorial :)
Hi Aladdin, thanks for the awesome tutorials.
Could you please elaborate on 27:51, this statement
outputs=model(imgs, captions[:-1])
Why are we ignoring the last row ? The last row would mostly contain padded characters, and very few EOS indexes. Could you please explain how ignoring the last row works in this context ?
Thanks
Maybe it's very late to reply, but just for completeness. In the code, a feature from a CNN model is used as the first word of the input sequence to LSTM. This increases the length of the input and output sequence of LSTM by 1. The Crossentropy loss will not work until the last word of the output sequences is ignored (output[:-1]) or, as done in the code, the input sequence length is reduced by 1.
@@saurabhvarshneya4639 Hey did you get a good model out of this?
If so then how many epochs did you run it?
The thing is the model that i built wasn't making any difference like there is no major change in an error even after like 10 epochs of training in the print_examples()
that was super helpful man, thanks
That was a very Aladdin tutorial, thank you!
Thank you :) Excellent video!
3:37 feed predicted words as input, difference connection for inference and training
Excellent explanation, thank you 👍
since you feed the feature vector at timestamp-0 so at inference time we also only feed the feature-vector at timestamp-0 we not have to provide the start token in the test phase
looking forward to new videos. awesome!
very nice tutorial. Awesome
Thank you!
Amazing tutorial!!
Can we do it using the transformer instead of LSTM?
did you found any answer, because im also seraching the same.
OK, I have known it. Excellent Pytorch Tutorial.
where can I find it please
While training , you took output of lstm only once , but while testing you used a for loop ,for generating whole sentence , why didn't you do the same thing while training , can you plz explain
can we have a demo on visual question generation also?
I'm not familiar with visual question generation, will do some research on that! :)
You are Excellent... Thanks a lot...
Thanks alot! one important question:
In the training loop the loss is calculated from scores and the captions which are the target.
there is no shifting to the right of the target captions. Without doing so how does the model still knows to learn the next word? Is there an internal pytorch method that does so implicitly? I tried to look and i dont understand how in this way the loss can be calculated in a way such the model would learn to predict the next word
Should'nt the concat and unsqueezing happen on dim=1? the output of the fc layer is of shape (batch_size, embed_size) and the shape of the embedded captions im assuming is supposed to be (batch_size, num_words, embed_size), unsqueezing features on dim 0 results into (1, batch_size, embed_size) which cannot be concated with the embedded captions on dim 0, am I missing something?
in his data augmentation video he has generated caption tensor with shape (seq_len,batch_size) that's why he is concat them at dim=0, hope this clarifies your doubt
In the forward function of decoder, you are giving the output from lstm of shape batch, seq, hidden directly to the linear layer. I'm confused...doesn't linear layer expects a linear tensor....
Awesome tutorial, followed it till the end. I have a question, where do we split the training and test set? and how as there are image data and caption data too. Can you help me with that?
Hey, Thanks for this. These videos makes it so much easier as well as gives validation. I had one quick question though, when concating the features from CNN to the embedding we are actually adding one more timestep in front of the caption embeddings. Does this mean that at time step 0->lstm has image features as input, at time step->1 lstm has token embeddings as input and so on.
Yeah you're exactly right! Then we just compare those outputs to the full correct captions (including start token and end token).
i HAVE A QUESTION THAT IF WE ARE TRAINING THE MODEL IS BATCHES THEN WE CANNOT USE THE LOGIC OF BREAKING THE LOOP IF IT PREDICT THE END TOKEN SINCE THE END TOKEN POISTION MAY VARIES FOR EACH CAPTION WITH IN THE BATCH SO WHAT THE SOL FOR THAT
THANKS
Aladdin, that is an excellent video. Very easy to follow along.
You are setting which parameters are trainable in the forward function. Wouldn’t this mean it is set again and again in each forward pass? I am wondering if you have thoughts on putting it into the constructir or in the configuration of the optimizer?
Thank you for the comment!
I definitely understand your point and I changed the requires_grad from the forward and put this in the train function after initializing the model. I initially thought the speed improvements would be more than they were. My tests show that the forward pass of the encoderCNN runs 1.6% faster with the change, although this is definitely an improvement. I updated the code with this change on Github.
Aladdin Persson Just to clarify. My intention was not to improve performance, but conceptual clarity. As the parameters do not change in between forward passes, as they are not dependent on the forward pass input, I would maintain them separately to communicate this to the reader. Allocating it with the training code, as you suggest, makes sense to me.
Keep it up.
Yeah it definitely makes things cleaner and that's primary goal for me. That it would be computationally inefficient just popped into my mind. Again I appreciate your feedback and for taking the time to comment! :)
It can be run successfully. Thanks
Wowwww.....just awesome ❤️❤️❤️❤️
You really are my biggest supporter, first comment on all my videos ❤️
@@AladdinPersson i always fill be forever.....you taught me everything and i owe to you.
very good work . please make some videos on medical imaging . thanks
I am also interested in it
YOU'RE AWESOME
Thank you. Can you add the requirment.txt file so we know what are the versions of each library?
Hi Aladdin, Thanks for the excellent tutorials. One question though, in the Encoder code, wouldn't it be better to set the requires_grad for the layers in __init__() method instead of forward() method? I guess this assignment is required only once during initialisation. You are anyways getting the train_CNN value in the __init__() method and not in the forward() method.
Yeah you're right about that, someone else brought this up as well and I updated the Github code shortly after:)
How to get the datasets?
Excellent video. Very well explained. Can you tell us what GPUs you used to train this model and how much time it took?
Also, is your code open on git?
I only trained for a couple of hours on a 1060 and the model was very small (256 hidden & embed size with 1 layer on the lstm). Performance can definitely become better and the goal of the video was only to convey the underlying ideas of image captioning. Everything is uploaded on Github: github.com/AladdinPerzon/Machine-Learning-Collection
Hi Aladdin, thanks for the video😍 I want to capture an image, check which class it belongs to and tell that class in audio format ..so basically image classification and image to speech conversion . so how to tell the image class since image doesn't contain any text? I can use this same code right ? Since I am a beginner so ...
Hi Aladdin , thanks so much for this awesome series of videos. Could you please explain how to use BERT instead of RNN in this model ? thanks in advance
thank you very much
HI Aladdin, thanks for the video and GitHub link. I've gone through your code and have entered it into Jupyter. The program gets through just about everything, and then right before it trains it stops and gives me the following error message: "ValueError: too many values to unpack (expected 2)". I'm really at a lose here. Just wondering if you could provide a recommendation? I know this has happened to a few other people, but it's odd that it's not a universal issue. I am using my own dataset. Thanks a lot in advance.
Great tutorial!!! But how to save model?
I had run the same code.
But i got an error. in the below i have mentioned the error. would you please tell me why this error for?
" ValueError: Expected input batch_size (1152) to match target batch_size (1120). "
thank you... what are your pc hardware? can i run this code in real time?
Please make one vedio for attention in audio processing ex. Speech emotion
hi , how do we check the model on custom data?
Where & when is the caption_image method getting called ?
i am getting value error at -> for idx,(imgs,caption) in tqdm.... -> too many values to unpack
Did you manage to solve it? Try and remove the tqdm part and see if it works, when running it on my machine from the Github repo I don't obtain an error. Have you downloaded the github repository and ran the script for image captioning is on there and still get the error? If so, what version of Pytorch and tqdm are you using?
EDIT: Also you can try replacing and doing enumerate(tqdm(train_loader)) and see if that works
@@AladdinPersson thanks , I actually it was my mistake I accidentally deleted the other argument of get_loader. THANKS for replying though✌️
I like your color theme very much, could you tell me which theme are your using?
You'll find the theme and how to set up in his first video of this playlist
Hey, did anyone get a good model
Cause I did like 40 epochs and the print_examples are giving me the same answer again and again. If anyone did get a good model pls do reply how many epochs did you run to get a good model.
BTW awesome video, really helpful
Same here. I trained for 10 epochs and getting the same output for any image I give. Is there something to change in the code or is it just about insufficient training? Thanks in advance.
How to extend the code to check validation and testing accuracy
Hello Aladdin, I have a question. The number of images in the dataset is around 8090 and the batch size I selected as 32. So the total number of batches in each epoch should be 253. but the when I load the data and check length of data loader it shows 1265. I don't understand this. Can you please explain if you have any idea. I have never seen this.
I hope this is not too late. Despite there are 8090 images, there are also 40455 captions (about 5 captions for each image not 1). This determines the length of the data and when the batch size is 32, you get ceil(40455/32)=1265 batches in total. I hope it will still be useful :)
Why is the image and captions concatenated and sent to LSTM ?
can you show us how to inference with this model. you did not show the code
I get an error related to spacy.... I installed it still same error. it says it's deprecated .....
Where i can get the loader file
6:44
In python-3.x you can just do
super().__init__()
getting this error------TypeError: relu(): argument 'input' (position 1) must be Tensor, not InceptionOutputs
Can someone send me link of MSVD dataset please?
It's been removed from website so if anyone has in drive or something then it'd be great
Could you please tell me,Where can I get my_checkpoint.path.tar and run files
Did you find out?
how to execute this in colab
make video about automated hair removal from dermoscopy image please
Hey, Thanks for this. Is the source code open? Where is the GitHub website address?
when i run train.py file it run 0-1265 then it start at 0 and again goto 1265 and it continuing .
would you please explain how can i solve this problem?
btw thanks for pytorch videos
Have you tried running the one on Github? It works for me
There are 1265 batches in this. That's why it goes from 0 to 1265 and than back. If you change batch size from 32 to something else you will see the change as well in 0 to 1265.
same with me have you got the solution
I am also stuck at it, the training is done repeatedly
hello sir can you guide me how to run this code. I'm new to Python so still bad at it :(( (nice video anyway)
TypeError: relu(): argument 'input' (position 1) must be Tensor, not InceptionOutputs This is the error I get
The new version of pytorch makes you use
...
self.inception = models.inception_v3(weights="DEFAULT")
...
return self.dropout(self.relu(features [0]))
This is a bug in pytorch
I created almost same model only changing backbone to efficientnet , Dont know why its giving same caption for every image :(
I would assume it just has to do with tweaking the hyperparameters and training longer. If I recall correctly this was the case for me as well before I found hyperparameters that worked. Although make sure you're outputting the feature vector from EfficientNet so the connection between the CNN and RNN makes sense
@@AladdinPersson How did you tune parameters?
By Random Search?
It really look difficult problem to tune to me .
I am thinking may be I do Random search over 25 jobs for 2 epochs ,The parameter with lowest loss I will try it .
Will that work?
@@haideralishuvo4781 Yeah random search works. But let's get back to basics first, did you follow my common mistakes video #1 of didn't overfit a batch first. Have you made sure your model can overfit a single batch of 1, 2, 32, 64, etc before trying anything else? That would be what I would check first then when you find something that works there most likely it's going to work fine for the entire training set too but random search can help in that process
i'm regretting to not watching this vdo for few weeks ago...:'(
Hi training is completed but it's running 2nd time it's self why
What do you mean it's running 2nd time?
@@AladdinPersson how much time is requires to complete training
@@balachakradharrocksagayara2610 Depends a lot on your hardware, I think this was trained for a couple of hours on a gtx 1060
@@AladdinPerssonis it possible to train in another machine and copy that files to another machine
@@balachakradharrocksagayara2610 just copy the checkpoint file to the desired machine and load the model with it
Could you provide your source code in this video? Thanks.
Hi, the Github link is in the description of the video although it's linking to the machine learning repository and there you can find Image Captioning. Here is the direct link for the code in the video: github.com/AladdinPerzon/Machine-Learning-Collection/tree/master/ML/Pytorch/more_advanced/image_captioning
@@AladdinPersson Copy it. Thank you very much.
How to do in TENSORFLOW
u fuckin idiot go google it and watch the damn shit