Computer Vision Meetup: No "Zero-Shot" Without Exponential Data
ฝัง
- เผยแพร่เมื่อ 9 ก.พ. 2025
- Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during “zero-shot” evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets?
Through thorough experiments, we consistently find that, far from exhibiting “zero-shot” generalization, multimodal models require exponentially more data to achieve linear improvements in downstream “zero-shot” performance, following a sample inefficient log-linear scaling trend. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. Taken together, our study reveals an exponential need for training data which implies that the key to “zero-shot” generalization capabilities under large-scale training paradigms remains to be found.
Read the paper: arxiv.org/abs/...
#computervision #ai #artificialintelligence #machinevision #machinelearning #datascience #NeurIPS #NeurIPS2024