TransGAN: Two Transformers Can Make One Strong GAN (Machine Learning Research Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 19 มิ.ย. 2024
  • #transformer #gan #machinelearning
    Generative Adversarial Networks (GANs) hold the state-of-the-art when it comes to image generation. However, while the rest of computer vision is slowly taken over by transformers or other attention-based architectures, all working GANs to date contain some form of convolutional layers. This paper changes that and builds TransGAN, the first GAN where both the generator and the discriminator are transformers. The discriminator is taken over from ViT (an image is worth 16x16 words), and the generator uses pixelshuffle to successfully up-sample the generated resolution. Three tricks make training work: Data augmentations using DiffAug, an auxiliary superresolution task, and a localized initialization of self-attention. Their largest model reaches competitive performance with the best convolutional GANs on CIFAR10, STL-10, and CelebA.
    OUTLINE:
    0:00 - Introduction & Overview
    3:05 - Discriminator Architecture
    5:25 - Generator Architecture
    11:20 - Upsampling with PixelShuffle
    15:05 - Architecture Recap
    16:00 - Vanilla TransGAN Results
    16:40 - Trick 1: Data Augmentation with DiffAugment
    19:10 - Trick 2: Super-Resolution Co-Training
    22:20 - Trick 3: Locality-Aware Initialization for Self-Attention
    27:30 - Scaling Up & Experimental Results
    28:45 - Recap & Conclusion
    Paper: arxiv.org/abs/2102.07074
    Code: github.com/VITA-Group/TransGAN
    My Video on ViT: • An Image is Worth 16x1...
    Abstract:
    The recent explosive interest on transformers has suggested their potential to become powerful "universal" models for computer vision tasks, such as classification, detection, and segmentation. However, how further transformers can go - are they ready to take some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs)? Driven by that curiosity, we conduct the first pilot study in building a GAN \textbf{completely free of convolutions}, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed \textbf{TransGAN}, consists of a memory-friendly transformer-based generator that progressively increases feature resolution while decreasing embedding dimension, and a patch-level discriminator that is also transformer-based. We then demonstrate TransGAN to notably benefit from data augmentations (more than standard GANs), a multi-task co-training strategy for the generator, and a locally initialized self-attention that emphasizes the neighborhood smoothness of natural images. Equipped with those findings, TransGAN can effectively scale up with bigger models and high-resolution image datasets. Specifically, our best architecture achieves highly competitive performance compared to current state-of-the-art GANs based on convolutional backbones. Specifically, TransGAN sets \textbf{new state-of-the-art} IS score of 10.10 and FID score of 25.32 on STL-10. It also reaches competitive 8.64 IS score and 11.89 FID score on Cifar-10, and 12.23 FID score on CelebA 64×64, respectively. We also conclude with a discussion of the current limitations and future potential of TransGAN. The code is available at \url{this https URL}.
    Authors: Yifan Jiang, Shiyu Chang, Zhangyang Wang
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    TH-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 88

  • @Ronnypetson
    @Ronnypetson 3 ปีที่แล้ว +189

    Yannic is like that nurse that mashes the potato with a spoon and gives it to you so that you toothless nerds can get fed

    • @YannicKilcher
      @YannicKilcher  3 ปีที่แล้ว +45

      This made me laugh so hard :D

    • @cerebralm
      @cerebralm 3 ปีที่แล้ว

      LOL

    • @theoboyer3812
      @theoboyer3812 3 ปีที่แล้ว +2

      That's a funny summary of what a teacher is

    • @cerebralm
      @cerebralm 3 ปีที่แล้ว

      @@theoboyer3812 I heard concise explanation described as "you have to make an ikea furniture without using all the pieces, but it still has to be sturdy when your done"

    • @swordwaker7749
      @swordwaker7749 3 ปีที่แล้ว +1

      Ahh... more like a chef. The papers in the original form can be hard to digest without... some help. BTW, the paper is like dragon meat.

  • @finlayl2505
    @finlayl2505 3 ปีที่แล้ว +42

    Relationship ended with conv nets, transformers are my best friend now

    • @hoaxuan7074
      @hoaxuan7074 3 ปีที่แล้ว

      A Fast Transform fixed filter bank neural network trained as an autoencoder works quite well as a GAN. Noise in image out. I guess with filter in the title...

    • @lunacelestine8574
      @lunacelestine8574 3 ปีที่แล้ว

      That made my day

  • @dasayan05
    @dasayan05 3 ปีที่แล้ว +32

    25:57 convolutions are for loosers, we're all for locally applied linear tranformation .. 😂

  • @wilsonthurmanteng9
    @wilsonthurmanteng9 3 ปีที่แล้ว +5

    Hi Yannic, fast reviews as usual! I would just like your thoughts on the loss functions of the recent Continuous Conditional GAN paper that was accepted into ICVR 2021.

  • @puneetsingh5219
    @puneetsingh5219 3 ปีที่แล้ว +10

    Yannic is on fire 🔥🔥

  • @tnemelcfljdsqkf9529
    @tnemelcfljdsqkf9529 3 ปีที่แล้ว +2

    Thank you a lot for your work, it's helping me a lot ! Which software are you using to take some notes on top of the paper like this ? :)

  • @rallyram
    @rallyram 3 ปีที่แล้ว +6

    Why do you think they go with the wgan grad penalty instead of the spectral normalization as per Appendix A.1?

  • @xtli5965
    @xtli5965 3 ปีที่แล้ว +3

    They actually updated the paper so that: they no long use super-resolution co-training and locality-aware initialization, but instead using relative positional embedding and modified normalization. Also they tried larger images with local self-attention to reduce memory bottleneck. The most confusing part in this paper for me is the UpScale and AvgPool operation, since outputs from a transformer are suppose to be global feature, so it feels strange to directly upsample or pool as we do to convolution features.

  • @WhatsAI
    @WhatsAI 3 ปีที่แล้ว

    Hey Yannic, love the video! May I ask what tools are you using to read this paper, highlighting the lines and for recording it? Thanks! :)

    • @CosmiaNebula
      @CosmiaNebula 2 ปีที่แล้ว +1

      likely a Microsoft Surface with a pen. Then any of the pdf annotator would work, even Microsoft Edge's pdf reader has that.
      As for screen recording, try OBS Studios.
      Would you make your own paper reading videos?

    • @WhatsAI
      @WhatsAI 2 ปีที่แล้ว

      @@CosmiaNebula Thank you for the answer! I would indeed like to try that style of videos sometimes, but my initial question was mainly because I would love to use something similar in meetings to show explanations, math and etc

  • @etiennetiennetienne
    @etiennetiennetienne 3 ปีที่แล้ว

    cool bag of tricks! instead of this hardcoded mask, could it be just a initialization problem? if the probability to predict a positional-encoding vector aggreeing with far-away vector is low at beginning of training?

  • @Snehilw
    @Snehilw 3 ปีที่แล้ว

    Great explanation!

  • @G12GilbertProduction
    @G12GilbertProduction 3 ปีที่แล้ว

    12:51 Wait... 3 samples for the 1 × 156 pixel upsampled patch of data is corigates between the r² (alpha) and r² (beta) + ... r² (omega) channel transformers, or even 156 layer architecture base to finitely decoding he was recreating themself upper to 9 samples, right?

  • @chuby35
    @chuby35 3 ปีที่แล้ว +1

    Could this be used with vq-vae2, so the lower res "images" that fed into this TransGAN are actually the latent space representations produced by vq-vae2?

  • @MightyElemental
    @MightyElemental ปีที่แล้ว +1

    I was attempting to build a TransGAN for a university project and ended up with a very similar method. Only thing that was missing was the localized attention. No way was I gonna get that 💀

  • @tedp9146
    @tedp9146 3 ปีที่แล้ว +1

    How exaclty is the classification head attached to the last transformer head?

  • @MrMIB983
    @MrMIB983 3 ปีที่แล้ว

    Great video

  • @romanliu4629
    @romanliu4629 3 ปีที่แล้ว

    An arithmetic question: what are the parameter size of the linear transforms before the "Scaled Dot-Product Attention" which produce Q, K, and V, when synthesizing 256^2 images?
    If we reverse the role of the "flattened" spatial axes and channel axis, how is it related to or different from 1×1 convolution? Why flattening and reshaping features and upscaling images via pixel-shuffle, which can disrupt spatial information and lead to checkerboard artefacts?

  • @florianhonicke5448
    @florianhonicke5448 3 ปีที่แล้ว +2

    I like your jokes a lot. It is much easier to me to learn something when it is fun!

  • @raunaquepatra3966
    @raunaquepatra3966 3 ปีที่แล้ว +1

    I didn't get the point of data agumentatios in generators. Isn't the number of input samples practically infinite? I mean I can feed as many random vectors and get as many samples as needed?

  • @simonstrandgaard5503
    @simonstrandgaard5503 3 ปีที่แล้ว

    Impressive

  • @pastrop2003
    @pastrop2003 3 ปีที่แล้ว

    For the generator network do I understand correctly that when you are using an example of a 4-pixel image that starts the generation and then say that every pixel is a token going into a transformer, you imply that each of this token has an embedding with the dimensionality equal to a number of channels? I.E. if one starts with a 2x2 image with 64 channels, every pixel (token) has 64-dimensional embedding going into the transformer?

  • @tylertheeverlasting
    @tylertheeverlasting 3 ปีที่แล้ว +2

    What would have been the issue with an ImageGPT like Generator? .. Would it be too slow to train due to serial generation?

    • @YannicKilcher
      @YannicKilcher  3 ปีที่แล้ว +5

      Apparently, transformer generators have just been unstable for GANs so far

  • @minhuang8848
    @minhuang8848 3 ปีที่แล้ว +10

    Dang, you learned some Chinese phonemes, didn't you? Pronunciation was pretty on point!

    • @dasayan05
      @dasayan05 3 ปีที่แล้ว +31

      YannicNet has been trained for several years now on AuthorName dataset. No wonder output quality is good

    • @minhuang8848
      @minhuang8848 3 ปีที่แล้ว +1

      @@dasayan05 That's the good-ass Baidu language models lol

  • @lucasferreira7654
    @lucasferreira7654 3 ปีที่แล้ว

    Thank you

  • @nguyenanhnguyen7658
    @nguyenanhnguyen7658 2 ปีที่แล้ว

    There is no high-res benchmark for TransGAN vs StyleGANV2 so we do not know if it is worth trying.

  • @hk2780
    @hk2780 3 ปีที่แล้ว +6

    So why should we not use the conv when we use the locally linear function? I do not get any point from that. Also why they use 16 crop things. To be honest it is almost same as 16 stride 16 x 16 kernel size conv. And then they said that we do not use the convolution. Well they use the same thing to do with convolution. Sounds like it becomes more con artist tihg.

    • @dl9926
      @dl9926 2 ปีที่แล้ว

      but that would be so expensive isn't it ?

  • @raunaquepatra3966
    @raunaquepatra3966 3 ปีที่แล้ว

    How in the SuperResolution auxiliary task the LR image is calculated? especially how the not of channels is matched with the input channel of the Network?
    eg SR image 64x64x3 , LR image 8x8x3???? (but the network need 8x8x192)

    • @raunaquepatra3966
      @raunaquepatra3966 3 ปีที่แล้ว

      couldn't find anything in the paper also😔

  • @user-qu2oz2ut2h
    @user-qu2oz2ut2h 3 ปีที่แล้ว

    what if we change feedforward layer in transformer to another transformer?
    like a nested transformer

  • @array5946
    @array5946 3 ปีที่แล้ว

    17:15 - is cropping a differential operation?

    • @udithhaputhanthri2002
      @udithhaputhanthri2002 2 ปีที่แล้ว

      I think what he says is, if cropping is a differentiable operation, we can use it.

  • @timoteosozcelik
    @timoteosozcelik 3 ปีที่แล้ว

    How do they give LR image as input of Stage 2? What I’ve understood so far that over the stages number of channel is decreasing (which means that more than 3 channel will be in the Stage 2) but LR will have only 3 channel.

    • @chuby35
      @chuby35 3 ปีที่แล้ว

      Probably using the same trick used for the upsample, the other way around, so scaling down the image by moving the information to more channels. (Since this aux task is used only for teaching the upsample to work properly, I don't think the LR images losing any information just rearranging it to these "super-pixels") But I haven't looked at the code yet, so my guess is as good as any. :)

    • @timoteosozcelik
      @timoteosozcelik 3 ปีที่แล้ว

      @@chuby35 It makes sense, thanks. But applying directly what you said made me questioning the necessity (meaning) of Stage 3 for such cases. I checked the code, but couldn’t see anything about that.

  • @Kram1032
    @Kram1032 3 ปีที่แล้ว

    Can't help but think that the upsampling stuff is kinda like implicit convolutions...
    Not that it'd be particularly reasonable to not do this but it's setting up a similar localized attention type deal.

  • @user-on9bl2gi5h
    @user-on9bl2gi5h ปีที่แล้ว

    Sir can you please provide Transgan training code and testing code

  • @seonwoolee6396
    @seonwoolee6396 3 ปีที่แล้ว +1

    Wow

  • @syslinux2268
    @syslinux2268 3 ปีที่แล้ว +3

    What is your opinion on MIT's new "Liquid Neural Network" ?

    • @YannicKilcher
      @YannicKilcher  3 ปีที่แล้ว +2

      Haven't looked at it yet, but I will

    • @syslinux2268
      @syslinux2268 3 ปีที่แล้ว +2

      @@YannicKilcher Similar to an RNN but instead of scaling into billions of parameters, focuses more on higher quality neurons.
      Less parameters but with good or even better results than average sized neural networks.

    • @dasayan05
      @dasayan05 3 ปีที่แล้ว +1

      @@syslinux2268 paper link ? is it public yet ?

    • @shivamraisharma1474
      @shivamraisharma1474 3 ปีที่แล้ว

      Do you mean the paper liquid time constant neural networks?

    • @syslinux2268
      @syslinux2268 3 ปีที่แล้ว +1

      @@shivamraisharma1474 Yep. It's just too long to type.

  • @dasayan05
    @dasayan05 3 ปีที่แล้ว +24

    1:11 "which bathroom do the TransGANs go to ?"

  • @phoenixwithinme
    @phoenixwithinme 3 ปีที่แล้ว

    Yannic gets them, the ml papers, like in the targeted distance. 😂

  • @TaylorAlexander
    @TaylorAlexander 3 ปีที่แล้ว

    Thanks for the video about this paper. Just what I was looking for. I will kindly suggest that the comment about the bathrooms would likely make some trans people uncomfortable. It is an unfortunate name for this research. Maybe best to leave it at that. Cheers and thanks for your work.

  • @ihoholko9522
    @ihoholko9522 3 ปีที่แล้ว

    Hi, what program are you using for papers readig?

  • @G12GilbertProduction
    @G12GilbertProduction 3 ปีที่แล้ว

    Discriminator sounds like more cancelingly. ×D

  • @xanderx8289
    @xanderx8289 3 ปีที่แล้ว

    TransGen

  • @pratik245
    @pratik245 2 ปีที่แล้ว

    Ai papers are like news articles now.. So many and so similar

  • @nasenach
    @nasenach 3 ปีที่แล้ว

    Actually FID score is also kinda wrong, since the D stands for distance here... Ok, nerd out.

  • @siquod
    @siquod 3 ปีที่แล้ว

    But what will transGANs do to the ganetics of organic crops if their pollen gets into the wild?

  • @UglyFatDwarf
    @UglyFatDwarf 3 ปีที่แล้ว +1

    yannic içimiz dışımız transformer oldu ya artık moruk, başka yok mu keyifli teorik kağıt; üzüyorsun."

  • @paiwanhan
    @paiwanhan 3 ปีที่แล้ว

    Gan actually means the act of copulation in Mandarin. So TransGAN is even more unfortunate.

  • @SHauri-jb4ch
    @SHauri-jb4ch 3 ปีที่แล้ว +1

    Ich mag deine Videos, aber wenn du das Wort "Trans" liest und deine erste Assoziation ist, dass das etwas mit WCs zu tun hat, solltest du mal über deine Vorurteile nachdenken. Wenn du so viele Leute erreichst, sollte dir klar sein, dass du ein diverses Publikum haben 'könntest'. Leider sind wir da im Bereich ML noch nicht so weit und solche Mikroaggressionen tragen meiner Meinung nach dazu bei, dass es so bleiben wird.

  • @bluestar2253
    @bluestar2253 3 ปีที่แล้ว +2

    Convnets are dead, long live transformers! -- reminded me of the late 80s "AI is dead, long live neural nets!" Karma is a bitch.

  • @tarmiziizzuddin337
    @tarmiziizzuddin337 3 ปีที่แล้ว +1

    "Convolutions are for losers".. 😅

  • @panhuitong
    @panhuitong 3 ปีที่แล้ว

    "convolution is for the loser"..... feeling sad about that

  • @JFIndustries
    @JFIndustries 3 ปีที่แล้ว +4

    The joke about the name was really unnecessary

  • @FreakFolkerify
    @FreakFolkerify 3 ปีที่แล้ว

    Can it turn my male dog into Female?

  • @circuitguy9750
    @circuitguy9750 3 ปีที่แล้ว +2

    For the sake of your colleagues and students, I hope you realize how your "trans bathroom" joke is harmful, disrespectful, and unprofessional.

  • @GiannhsPanop
    @GiannhsPanop 2 ปีที่แล้ว

    Very nice transphobic joke! #unsub