Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Gabriel Mongaras

มุมมอง 1 834

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 20 ก.ย. 2024

ความคิดเห็น • 17

@hjups 5 หลายเดือนก่อน ⁺⁶
They didn't discuss a proof of linear scaling with size, only generation time. My guess is that their linear scaling comes from training the VQVAE in tandem with the generation model, which DiT does not do. The frozen VAE sets a minimum limit for the FID score, which I believe for ImageNet 256x256 is somewhere around 1.4, and would require perfect latents. Pixel-space models wouldn't have that issue though, but are much more expensive to train and run.
That aside, VAR is a clever idea and the generation speed is impressive - I do wonder if they could achieve better results (perhaps with smaller models) if they combined the idea with MaskGiT. It would be a little slower (although a smaller model could make up for that), but it would allow for a self-correction step.
@keyutian 3 หลายเดือนก่อน
As the author of this work, I'd like to provide two more details XD: 1. VAR uses a fronzen VQVAE too. Both VAR's VQVAE and DiT's VAE are trained only before the generation model training phase. Once trained, the VQVAE/VAE will be frozen.
2. MaskGiT has no chance to do a self-correction as each token is only generated once. But VAR can potentially do this, because the tokens of each scale are eventually added together to get an output. So if some mistakes are made in early scales, late scale generation can correct them due to the autoregressive nature.
@hjups 3 หลายเดือนก่อน ⁺¹
@@keyutian Thanks for responding!
1) That was originally unclear from your paper given how you started section 3. But I see now that you split up the training approach in sub-headings. That certainly makes things easier to train!
2) That depends on how MaskGiT is sampled, correct? You can choose to re-mask an already un-masked token each step, which would allow the model to self-correct. I do not believe MaskGiT did that, but I believe the Token-Critic paper proposed such behavior (although maybe they also respected the unmasked tokens). Regardless, re-masking those types of models does work during sampling.
My point about self-correction was that image models (at least for diffusion) tend to be highly biased toward the low-resolution structure. If an error is generated at a smaller scale, then it tends to propagate forward regardless of the conditioning methodology (e.g. in super-resolution models using concat conditioning or cross-attention). Reducing those errors before propagation would likely yield better results.
On another note, what are your thoughts on VAR's FID behavior compared to the image quality, especially in the context of diffusion models?
I am wondering if the use of the quantized latent space is both a blessing and a curse. A VAE and a VQVAE may be trained to achieve a similar rFID, but in practice it would be impossible to achieve that level with a VAE due to the continuous nature of the latent space (except in the case of pure reconstruction). However, a VQVAE has a finite number of codebook entries, meaning that the rFID score could be achieved if those tokens were generated exactly, making the problem easier.
However, VQVAEs suffer from fine-detail noise / artifacts (VAEs do too, but not to the same degree), which comes from the quantization of the continuous image space. This appears to be a tradeoff of structural generation for fine image quality (which can be desirable in certain application spaces), however, this tradeoff is not properly captured by the most fidelity metrics.
But I did notice how VAR does not handle fine detail well, especially in the online demo - this is especially apparent in high-noise classes like "Fountain".
@alexandernanda2261 5 หลายเดือนก่อน ⁺¹
Best explanation on youtube
@fusionlee844 4 หลายเดือนก่อน ⁺¹
Thanks, looking for some new generative structures recently for reseach these days
@gabrielmongaras 4 หลายเดือนก่อน
Yea it's always nice to see new generative structures other than the standard autoregressive next-token prediction model or diffusion model!
@bruceokla 5 หลายเดือนก่อน ⁺¹
Nice work
@lawrencephillips786 4 หลายเดือนก่อน ⁺¹
Where are the actual learnable NN parameters in Algorithm 1 and 2? In the interpolation step? Also, you depict r1, r2 and so on as sets of 1, 2, an so on tokens, but shouldnt it be the square (1, 4, and so on)?
@NgOToraxxx 4 หลายเดือนก่อน
In the encoder and decoder, and a few in the convs after the upscaling (phi).
The token count is a square for each resolution scale yes, although they are predicted in parallel.
@lawrencephillips786 3 หลายเดือนก่อน
@@NgOToraxxx Thanks! What do you mean the square is predicted in parallel? Do you mean this as a causal masking step like with GPT
@NgOToraxxx 3 หลายเดือนก่อน
@@lawrencephillips786 There's causal masking yes, but instead of predicting just 1 token from all previous ones, it predicts the tokens of an entire scale from the previous ones. And the scale can attend itself too (it's initialized by upscaling the previous scale). There are some more details on issue #1 on VAR's github repo.
@xplained6486 4 หลายเดือนก่อน
Isnt this basically a Diffusion model but instead of noising they do blurring (through downsampling) and try to revert the blur (instead of the noise in DM). And the vector quantitzation is similar to the one from stable diffusion as far as I understand. But how does it compare to the general concept of scores matching?
@gabrielmongaras 4 หลายเดือนก่อน ⁺²
I like to think of diffusion models as reversing some sort of transformation (like in the Cold Diffusion paper). That's kind of why I think of this as similar to a diffusion process. Where diffusion reverses the process of corrupting an image with noise, this model reverses the process of making an image smaller resolution. However, the objective is significantly different. In diffusion, we train the model to predict all the noise, whereas here we train it to autoregressively predict the next step. The nice thing about diffusion objective is it allows an arbitrary number of steps for generation, this model does not since it's forced to predict the next step.
@marinepower 4 หลายเดือนก่อน ⁺²
This paper literally makes no sense. The whole point of autoregressive modeling is to independently sample one token at a time, and condition future tokens on past tokens. This method breaks all that by having the model sample hundreds (perhaps thousands) of tokens independently in one inference step, despite all tokens being in one large joint probability space. The only way this works is if you overtrain your model to such an absurd degree that there is basically no ambiguity anywhere in your generation step, and then you can simply take the argmax over every single logit and have it work out.
And, if you look at their methodology, that's exactly what you find. This model was trained for **350** epochs over the same data. That is **absurd**. So yeah, don't expect this method to work unless you wildly overtrain it. It has some good ideas (e.g. hierarchical generation), but the rest of its claims are dubious at best.
@keyutian 3 หลายเดือนก่อน ⁺¹
I strongly disagree with this.
First, if "independently generation" makes no sense, that's basically saying BERT, MaskGIT, UniLM and more models would make no sense, which is obviously not true.
If you think about it, when VAR/BERT/MaskGIT/UniLM generates a bunch of tokens in parallel, these tokens **can attend to each other**, which may alleviate that ambiguity to a large extent.
Second, for that 350-epoch training, well, DiT was trained on the same data for **1400** epochs.
It's common to train a generative model on ImageNet for hundreds epochs because ImageNet can be a small dataset today.
@hjups 3 หลายเดือนก่อน
Most models trained on benchmark datasets are over-trained, this is due to a combination of the smaller dataset sizes (as keyutian mentioned), and a minimum number of training samples required to establish coherent image features. For reference, ImageNet contains around 1.28M images, whereas SD1.5 trained on a 128M image subset of LAION-5B (100x more images). While adding more images would be ideal, it's not possible with a fixed benchmark, while Stability could easily add more images to the training set (which they did for SDXL). That said, if you need proof of working in practice, look at Pixart-alpha. They trained on only 28M images (which included ImageNet pre-training for ~240 epochs to establish the initial image structures and features).
@marinepower 3 หลายเดือนก่อน
@@keyutian Just because it works in practice doesn't mean it's good science. I think the entire field of ML is plagued with techniques that just barely work and aren't theoretically justifiable but people use them anyway. Maybe its unfair to single this paper out since others (BERT, DiT, etc) do it too, but until papers are held to a higher standard nothing will change.
One can easily think of ways to improve sampling that don't need to do this (e.g. have multiple predictions per pixel along with confidence values, iterative partial sampling (where we commit to a set amount of data each iteration, perhaps on the predicted confidence values, and predict an image over a set number of steps), etc). There's pretty basic things that can be done and yet no one does them because it's easier to just follow the herd and train for an absurd number of epochs.

ต่อไป

เล่นอัตโนมัติ

OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers