@outliier 26:34 I suspect that the colors are off due to decoded_images.add(1).mul(0.5) in the visualization, which maps the colors from [-1, 1] to [0, 1], but is only applied to the decoded images and not the original images for some reason.
Regarding the training part for the VQGAN at 24:24, from what I understand the following is happening: 1. VQGAN grads are zeroed, grads are then propagated over the Discriminator (because of g_loss) and over the VQGAN for the rest of the losses (and the g_loss); retain_graph = True is added in order to keep the previously computed forward pass values, otherwise calling backward again on the same losses would raise an error; 2. Discriminator grads are zeroed to remove what was previously added by the g_loss.backward(), and another backward call is done on the gan_loss to propagate the grads for the proper loss function (d_loss_fake and d_loss_real); 3. The optimizers are called one after another to update the weights with the accumulated values in the leaf-tensors .grad property. One possible error might have occured at step 2 which led to the bad reconstruction seen a few minutes later in the video. The gan_loss.backward() propagates the following: - d_loss_real which was computed by applying the Discriminator over the real images. - d_loss_fake which was computed by using the disc_fake images *generated* by VQGAN. Here is where the issue might lie. The disc_fake_images were obtained by a forward pass through the VQGAN model, as a result the computational graph will retain these forward values and when gan_loss.backward() will be called the d_loss_fake will be propagated over the Discriminator and the VQGAN. In turn, this will adjust the VQGAN's weights to also minimize Discriminator's loss which will be something along the lines of "Generate images such that the Discriminator will be able to easily tell that they are fake". A possible cause for which the VQGAN is still able to reconstruct the images albeit not very well, because of the perturbing loss propagation, might be due to 2 factors: - the reconstruction loss is still present - the discriminator is turned off until the treshold is hit, but after that perturbation comes into place. A solution would be to: - (not as optimum) use two tensors for the fake images: disc_fake_1 = decoded_images and disc_fake_2 = decoded_images.detach() which will not propagate grads through the VQGAN. Pass them both through the Discriminator where disc_fake_1 will be used in g_loss to update the VQGAN and disc_fake_2 will be used in gan_loss to update the Discriminator. - (better as 1 single pass in Discriminator is required) before doing the gan_loss.backward() call, use self.vqgan.requires_grad(False) => this will disable the accumulation of gradients in VQGAN, so only the discriminator will receives values in its .grad property. After the backward() call reactivate the grads self.vqgan.requires_grad(True). I am a beginner in the field so I might be wrong in both my understanding and explanation. Source: - pytorch.org/docs/stable/notes/autograd.html#setting-requires-grad
I think that this is not necessary, since opt_disc.step() should only modify the discriminator parameters (and opt_vq.step() should only modify the vqgan parameters).
But loss_fake.backward() should add grad on the layers of generator. Not sure whether the vq.step() would take two part of grad to update or not @@NirDodge
@@csoRoBeRt Right, I see... While opt_disc.step() would not affect vqgan weights, it looks like self.vqgan.requires_grad(False) is needed so that gan_loss.backward() will not accumulate gradients on vqgan, that would affect the update in opt_vq.step().
Great! But in solution 1, I think just changing line 56 'disc_fake = self.discriminator(decoded_images)' to 'disc_fake = self.discriminator(decoded_images.detach())' is ok.
Hello, I trained my model w I had a good result I will download an image and used the training model with the extension .pt to see the image reconstituted
Dude, you are making a YT video, not a class presentation, you have all the time in the world to take your time and explain each module step by step. Especially since your implementation has quite a few bugs… But overall, you did a decent job.
Hey I looked through your code book. VQVAEs perform a one hot encoding. Is that something from the paper or just something you personally included. Nice video.
Hi, your video is great! I can't find a second one as good as yours. I have a question I would like to ask, how do I add conditions in the form of pictures when training the second stage transformer
I wonder the same question. I can only suppose that it is needed for not losing much precision when calculating a square of the difference, since (a-b) values can be very small.
z-flat.shape = [1024,256] , embad.shape = [1024,256] , when you do (a-b)**2 you will get shape [1024,256] which mean ==> for each feature of the 1024 we get nearest 256 code vector, however if we use long term (a**2 +b**2 - 2ab) you will get [1024,1024] ==>for each feature of the 1024 we get nearest 1024 code vector (because of dot product operation). so you will think 256 still going to be better because still going to be nearest features yet its not entirely correct because if we have only 256 as selected features the model when back-propagate will only optimize 256 feature . try it your self : add in __init__ function ==> self.l2 = nn.MSELoss(reduction='none') add in forward function => d=self.l2(z_flattened,self.embedding.weight)
Never mind I was being stupid. However, there does indeed exist a way to do it more elegantly: embedding = torch.rand((256,512)) # embedding_size, latent_dim z_flattened = torch.randn((10, 512, 16, 16)).view(2560, 512) # N*h*w, channel diff = z_flattened.repeat(256, 1, 1) - embedding[:,None,:] diff_squared = torch.sum(diff**2, dim=2) min_index = diff_squared.argmin(dim=0)
@@MrXboy3x this still does not make sense to me, because mathematically the function that gives f(a,b) the loss from the inputs a and b is the same no matter how you decompose it. So the gradients on the inputs, or the minimum index should be the same... Am I missing something ? I suppose you may make the argument one is more numerically stable than the other, but I heard the (a-b)^2 version is more numerically stable..
Thank you for the great explanation! Out of curiosity - what is the purpose of implementing the blocks (GroupNorm) as a separate class instead of using the predefined class in the torch (torch.nn.GroupNorm) ?
Hello. First of all, many thanks for the video and source files. I want to develop a midjourney-like system to improve myself and I would like to ask you a few questions for your guidance. With the process I did in the video, we redrawn an existing image. When we make a system like this midjourney, at what stage will it work for us? I have seen projects written with VQGAN and CLIP over colab, but I want to write a system myself. What would you recommend? Which systems do you think I should use? Another question is, I remember doing faster tutorials with tensorflow. Would you suggest using tensorflow instead of pytorch? Thank you.
Hey there, first of all I would recommend using pytorch. There is a much greater community out there in the generative field that is using pytorch. Second of all a VQGAN usually represents the first stage to compress data and remove redundancies. You would now need to learn a model which learns in this compressed stage. Thats what the transformer in the second stage is doing. I dont know exactly how midjourney is doing it, but for example stable diffusion uses the same approach of first learning a VQGAN and then learning a diffusion model in the latent space. So usually text-to-image tasks are done using transformers or diffusion models. You can watch my videos on diffusion models and maybe train them and eventually combine them with VQGAN which gives you latent diffusion (the method that stable diffusion is using). Let me know if you have further questions.
marvelous implementation. it's much clearer than looking into the original code
@outliier 26:34 I suspect that the colors are off due to decoded_images.add(1).mul(0.5) in the visualization, which maps the colors from [-1, 1] to [0, 1], but is only applied to the decoded images and not the original images for some reason.
Excellent Video. Thank you very much for making this video
Regarding the training part for the VQGAN at 24:24, from what I understand the following is happening:
1. VQGAN grads are zeroed, grads are then propagated over the Discriminator (because of g_loss) and over the VQGAN for the rest of the losses (and the g_loss); retain_graph = True is added in order to keep the previously computed forward pass values, otherwise calling backward again on the same losses would raise an error;
2. Discriminator grads are zeroed to remove what was previously added by the g_loss.backward(), and another backward call is done on the gan_loss to propagate the grads for the proper loss function (d_loss_fake and d_loss_real);
3. The optimizers are called one after another to update the weights with the accumulated values in the leaf-tensors .grad property.
One possible error might have occured at step 2 which led to the bad reconstruction seen a few minutes later in the video. The gan_loss.backward() propagates the following:
- d_loss_real which was computed by applying the Discriminator over the real images.
- d_loss_fake which was computed by using the disc_fake images *generated* by VQGAN. Here is where the issue might lie. The disc_fake_images were obtained by a forward pass through the VQGAN model, as a result the computational graph will retain these forward values and when gan_loss.backward() will be called the d_loss_fake will be propagated over the Discriminator and the VQGAN. In turn, this will adjust the VQGAN's weights to also minimize Discriminator's loss which will be something along the lines of "Generate images such that the Discriminator will be able to easily tell that they are fake".
A possible cause for which the VQGAN is still able to reconstruct the images albeit not very well, because of the perturbing loss propagation, might be due to 2 factors:
- the reconstruction loss is still present
- the discriminator is turned off until the treshold is hit, but after that perturbation comes into place.
A solution would be to:
- (not as optimum) use two tensors for the fake images: disc_fake_1 = decoded_images and disc_fake_2 = decoded_images.detach() which will not propagate grads through the VQGAN. Pass them both through the Discriminator where disc_fake_1 will be used in g_loss to update the VQGAN and disc_fake_2 will be used in gan_loss to update the Discriminator.
- (better as 1 single pass in Discriminator is required) before doing the gan_loss.backward() call, use self.vqgan.requires_grad(False) => this will disable the accumulation of gradients in VQGAN, so only the discriminator will receives values in its .grad property. After the backward() call reactivate the grads self.vqgan.requires_grad(True).
I am a beginner in the field so I might be wrong in both my understanding and explanation.
Source:
- pytorch.org/docs/stable/notes/autograd.html#setting-requires-grad
I was having the same confusion of the grad propagation the I saw your answer!
I think that this is not necessary, since opt_disc.step() should only modify the discriminator parameters (and opt_vq.step() should only modify the vqgan parameters).
But loss_fake.backward() should add grad on the layers of generator. Not sure whether the vq.step() would take two part of grad to update or not @@NirDodge
@@csoRoBeRt Right, I see... While opt_disc.step() would not affect vqgan weights, it looks like self.vqgan.requires_grad(False) is needed so that gan_loss.backward() will not accumulate gradients on vqgan, that would affect the update in opt_vq.step().
Great! But in solution 1, I think just changing line 56 'disc_fake = self.discriminator(decoded_images)' to 'disc_fake = self.discriminator(decoded_images.detach())' is ok.
Nice one 👍🏼👍🏼
Hello, I trained my model w I had a good result I will download an image and used the training model with the extension .pt to see the image reconstituted
@Outlier can you help me to test a model vqgan
Dude, you are making a YT video, not a class presentation, you have all the time in the world to take your time and explain each module step by step. Especially since your implementation has quite a few bugs…
But overall, you did a decent job.
@@vinc6966 :(
Hey I looked through your code book. VQVAEs perform a one hot encoding. Is that something from the paper or just something you personally included. Nice video.
awesome!thanks for the video!
Hi, your video is great! I can't find a second one as good as yours. I have a question I would like to ask, how do I add conditions in the form of pictures when training the second stage transformer
Hey thank you! I answered you question on github
Very nice. Thanks for the video. QQ: 10:30 Why do you use the expanded version and not just (a-b)**2?
I wonder the same question. I can only suppose that it is needed for not losing much precision when calculating a square of the difference, since (a-b) values can be very small.
z-flat.shape = [1024,256] , embad.shape = [1024,256] , when you do (a-b)**2 you will get shape [1024,256] which mean ==> for each feature of the 1024 we get nearest 256 code vector, however if we use long term (a**2 +b**2 - 2ab) you will get [1024,1024] ==>for each feature of the 1024 we get nearest 1024 code vector (because of dot product operation). so you will think 256 still going to be better because still going to be nearest features yet its not entirely correct because if we have only 256 as selected features the model when back-propagate will only optimize 256 feature .
try it your self :
add in __init__ function ==> self.l2 = nn.MSELoss(reduction='none')
add in forward function => d=self.l2(z_flattened,self.embedding.weight)
@@MrXboy3x how about (a-b)*(a-b)^T
Never mind I was being stupid. However, there does indeed exist a way to do it more elegantly:
embedding = torch.rand((256,512)) # embedding_size, latent_dim
z_flattened = torch.randn((10, 512, 16, 16)).view(2560, 512) # N*h*w, channel
diff = z_flattened.repeat(256, 1, 1) - embedding[:,None,:]
diff_squared = torch.sum(diff**2, dim=2)
min_index = diff_squared.argmin(dim=0)
@@MrXboy3x this still does not make sense to me, because mathematically the function that gives f(a,b) the loss from the inputs a and b is the same no matter how you decompose it. So the gradients on the inputs, or the minimum index should be the same... Am I missing something ? I suppose you may make the argument one is more numerically stable than the other, but I heard the (a-b)^2 version is more numerically stable..
it is a great code, but do you or anybody has the link where to download the flowers dataset?
Thanks. Just look for oxford flower dataset and you should find it
great video
真的很棒 很详细
Hello. Thank you for this tutorial. Can you add to VQGAN+CLIP please.
How to add CLIP for this code
Thank you for the great explanation! Out of curiosity - what is the purpose of implementing the blocks (GroupNorm) as a separate class instead of using the predefined class in the torch (torch.nn.GroupNorm) ?
Hi! Do you have a profile on kaggle?
No I don’t :c
I should have learned pytorch
Hello. thank you so much for the video and source files. can you please add the test code or can you help me to create the test code
What kind of test code are you talking about? All the code is on github. Did you see that?
Cool
thank you
Gregoria Dam
Hello, I did 500 epoch training. But I only want the 500th epoch to generate 5000 images. How can I do?
Boehm Alley
Abshire Cliff
Emanuel Underpass
Bashirian Turnpike
Alexzander Locks
Rempel Spring
Hello. First of all, many thanks for the video and source files. I want to develop a midjourney-like system to improve myself and I would like to ask you a few questions for your guidance.
With the process I did in the video, we redrawn an existing image. When we make a system like this midjourney, at what stage will it work for us?
I have seen projects written with VQGAN and CLIP over colab, but I want to write a system myself. What would you recommend? Which systems do you think I should use?
Another question is, I remember doing faster tutorials with tensorflow. Would you suggest using tensorflow instead of pytorch?
Thank you.
Hey there, first of all I would recommend using pytorch. There is a much greater community out there in the generative field that is using pytorch. Second of all a VQGAN usually represents the first stage to compress data and remove redundancies. You would now need to learn a model which learns in this compressed stage. Thats what the transformer in the second stage is doing. I dont know exactly how midjourney is doing it, but for example stable diffusion uses the same approach of first learning a VQGAN and then learning a diffusion model in the latent space. So usually text-to-image tasks are done using transformers or diffusion models. You can watch my videos on diffusion models and maybe train them and eventually combine them with VQGAN which gives you latent diffusion (the method that stable diffusion is using). Let me know if you have further questions.
Glover Squares
Adams Valleys
Keeling Cape
Hills Meadows
Schneider Bridge
Ned Well
Labadie Cape
Bailey Point
Odessa River
Shanel Station
Danyka Locks
这老外真牛逼,b站没一个讲的有你一半好的
谢谢
Abbott Inlet
Kuhlman Mills
Pfeffer River
Pietro Union
Thompson Michelle Miller Ruth Thomas Mark
Martin Melissa Young Joseph Young Christopher