Visual Autoregressive Modeling

hu-po

มุมมอง 2 839

101

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 4 ก.พ. 2025

ความคิดเห็น • 10

@arashakbari6986 27 วันที่ผ่านมา
Enjoyed watching this! Gived me a lot of motivation!! Keep doing these comprehensive reviews man
@minifull_ 21 วันที่ผ่านมา ⁺¹
The story in Chinese social media (I believe it's true since I have some friends doing intern in ByteDance) is that Keyu Tian just randomly changed the gradient the others' training pipeline by a bug in Huggingface (which is fixed already, but exists in former version, he used the bug to insert code to attack others' training), and the gradient random change is set by him to happen only if the code is running on a cluster with over 256 GPUs so it's pretty hard to debug. This wastes others' time and also GPUs, since with the attack, others' training was nonsense..
@wolpumba4099 หลายเดือนก่อน ⁺²
*Visual Autoregressive Modeling: A New Approach to Image Generation*
* *4:13** Introduction:* The paper "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" introduces a novel method for image generation.
* *4:21** Key Idea:* Unlike traditional autoregressive models that predict the next token in a sequence, this approach predicts the next *scale* or *resolution* of an image.
* *6:28** Context:* The first author, a former intern at ByteDance, is involved in a legal dispute with the company regarding alleged disruption of internal model training.
* *9:13** Performance:* The model achieves state-of-the-art results on the ImageNet 256x256 benchmark, particularly in Fréchet Inception Distance (FID) and Inception Score, with significantly faster inference speed.
* *15:06** Traditional Approach:* Current methods typically convert images into a 1D sequence of tokens using a raster scan order, feeding them into models like transformers.
* *17:48** Proposed Method:* This paper introduces a hierarchical, multi-scale approach, akin to how convolutional neural networks (CNNs) process images, eliminating the need for positional embeddings used in traditional models.
* *19:13** Analogy to CNNs:* The multi-scale approach is analogous to how CNNs use receptive fields to progressively aggregate information across layers, a concept inspired by the human visual system.
* *23:59** Advantages:* This approach offers better results, a well-written paper, and a conceptually simple yet effective idea, contributing to its recognition as the best paper at a major conference.
* *27:43** Tokenization:* Uses a standard VQ-VAE (Vector Quantized Variational Autoencoder) to convert images into discrete tokens.
* *41:35** Core Innovation:* The main innovation lies in how these tokens are processed - not in a linear sequence, but in a multi-scale hierarchy.
* *54:22** Implementation Detail:* Different resolutions of the token map are achieved through interpolation, a technique to estimate values between known data points.
* *56:05** Key Takeaway:* This method demonstrates that simpler, more intuitive approaches can outperform complex ones, and it is likely to be widely adopted in various applications, including image and video generation.
* *59:08** Efficiency:* Parallel processing at each resolution level, similar to how CNNs operate on GPUs, leads to a 20x speedup compared to traditional autoregressive models.
* *1:01:49** Complexity Analysis:* The time complexity is reduced from O(n^6) for traditional models to O(n^4) for the new approach, making it more scalable.
* *1:02:45** Shared Codebook:* Interestingly, the same vocabulary (codebook) of tokens is used across all scales, which is counterintuitive but contributes to the model's effectiveness.
* *1:12:55** Scaling Laws:* The paper demonstrates scaling laws, meaning that increasing model size predictably improves performance, a crucial property for training larger and more powerful models.
* *1:20:23** Conclusion:* The paper's success is attributed to both luck (choosing the right idea) and skill (well-written paper, good figures, and strong results).
* *1:33:16** Complexity Proof:* The video discusses the mathematical proof of the model's time complexity, highlighting the clever use of geometric series to simplify the analysis.
* *1:39:31** Limitations:* The discussion acknowledges the limitations of current language models in understanding and reasoning about the physical world, as exemplified by the "mosquito test."
* *1:42:41** Future Work:* Potential future directions include improving the tokenizer, applying the method to text-to-image and video generation, and exploring its use in other domains beyond images.
I used gemini-exp-1206 on rocketrecap dot com to summarize the transcript.
Input tokens: 42402
Output tokens: 868
@Elikatie25 หลายเดือนก่อน
3:18 Start of the stream
@xx1slimeball หลายเดือนก่อน ⁺²
cool vid, cool paper
@MilesBellas หลายเดือนก่อน
22:00
Hubel and Weisel implanted electrodes but didn't "murk" cats.
Correct ?
MIT Course on CNN by EVA 6.S191 in 2018 = great video
@thivuxhale หลายเดือนก่อน
1:48:20 if we already has enough innovations in research to reach AGI, would i make bigger of an impact if i go into industry rather than research? feel like when doing research days, you have a really small chance of creating something impactful and fundamental, most of the research is incremental
@deathfighter1111 หลายเดือนก่อน ⁺¹
In equation 18, the notation is wrong, by passing from ni to ak it should be ai, by summing you get, the author made a mistake with the notation
@EobardUchihaThawne หลายเดือนก่อน
how does it handle image inputs? i saw on the image they show as
s e1 1 2 3 4 e2 1 2 .... 9 .... is it flattening the image?

ต่อไป

เล่นอัตโนมัติ