Vision LLMs are Blind 👀

1littlecoder

มุมมอง 2 972

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 5 ก.ย. 2024

ความคิดเห็น • 33

@ilianos หลายเดือนก่อน ⁺⁴
Haha, somehow I knew/feared this already... cause every single VLM (even the best ones) I've tried so far in the past, MISERABLY FAILED in describing the venn diagram from the paper "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning" and answering questions about it. Especially when it came to counting, colors and intersections.
@bigglyguy8429 หลายเดือนก่อน
And yet, show it my old jetty bridge in need of repair, and it describes it in great detail, the surroundings and the fact it looks like it needs repairing?
@marcfruchtman9473 หลายเดือนก่อน ⁺²
Thank you for the video. I think the some of the tests were very valid. But some might have some significant issues.
When I asked GPT4o to identify the circled letters for "Acknowledgement" "Subdermatoglyphic" and "tHyUiKaRbNqWeOpXcZvM". I randomly circled 4 letters in red, and copy pasted over.
It got 100% correct.
However, it didn't get "touching" and "overlapping" circles perfectly. It didn't get it right at first. For example it claimed in 1 image that two overlapping circles were "Touching" "But not overlapping" I asked it why? and then it changed its mind and said that "re-examining" it confirmed that they were both overlapping and touching.
Note that after the first "bad" attempt, and correction, it did much better, and identified 3 in a row.
So, I think "one shotting" the definition of "OverLapping" and "Touching" without providing example of what we considered as "touching" vs "not touching" and overlap vs not overlap... is a bit of a stretch. Remember that "technically" a circle can only "touch" on a infinitely small point in space... which means that it has to "widen" that viewpoint for us humans. How much it allows for in touching but not overlapping, could be considerable without training.
Additionally, when I asked it to identify 3 different shapes filled with 3 different colors, it got 100% right.
When asked to count 3 overlapping circles vs 5 overlapping, it got 100% correct on my test.
Unfortunately, 4o totally got intersecting lines wrong tho. I could analyze this more later, but it will take time to discern why it is getting this wrong.
Overall, I think GPT4o has better "vision" than claimed... or at least has "improved" since this paper was released.
@1littlecoder หลายเดือนก่อน
That's amazing that you put them through those tests!
@marcfruchtman9473 หลายเดือนก่อน
@@1littlecoder LoL! It only takes a few minutes to make a simple diagram... but, thanks!
@briangbhawaii หลายเดือนก่อน
I've had much better luck myself, but I'm grateful to see these failures so I can be more cautious. Thank you.
@flo0778 หลายเดือนก่อน ⁺¹
4:35 the primary reason why this task is easier is that there are only 2 possible answers.
@aritrachakraborty7864 หลายเดือนก่อน ⁺⁹
After watching this video I think we should forget AGI. If algorithmic improvements don't happen soon at a blistering pace we will soon end up in an AI winter as investors loose confidence in these large companies to deliver a digital mind.
@ilianos หลายเดือนก่อน ⁺³
I tend to agree. But well, maybe the research was focused on language for too long and needs to catch up on visual now...
@arashputata หลายเดือนก่อน ⁺¹
AGI was planted in our mind by big companies, because they can get more funding from Investors if they say we are getting close to agi. It's all about money baby
@mirek190 หลายเดือนก่อน
Seems like llm were never properly learned shapes and lines ....
Human perception is based on lines and shapes we are born with it.
Putting bunch of labeled picture is not a proper learning scheme.
I think problem is just a proper learning techniques.
@sammathew535 หลายเดือนก่อน ⁺³
At times, the model development seem to chase these tests. What they fail to do now, they might soon be made to do tomorrow. But to call them blind is an over-statement!
@Ishaheennabi หลายเดือนก่อน
True that❤
@swarnavasamanta2628 หลายเดือนก่อน ⁺⁴
All it proves is that Claude 3.5 is the leading VLM and LLM
@coletcyre หลายเดือนก่อน
Took over one year before I swapped my main subscription from GPT to Claude, 3.5 Sonnet is the king right now
@tusharbhatnagar8143 หลายเดือนก่อน
Haha. Thanks for the video. Really needed someone to share my opinion with.
@310gowthamsagar5 หลายเดือนก่อน ⁺¹
what if these tasks will be achieved after specific finetuning ? and with this thought in mind, i wonder, if we look at main idea of neural networks, its just bunch of mathematical operations, with a very specific set of weights. If by finetuning, we change those weights, doesnt it mean we are somehow changing those numbers which performed better for sometask and now they wont ? is there any reasearch done in this direction ? does fine-tuning disturb model's previous capabilities.
@arashputata หลายเดือนก่อน
That's also my question. Does fine tuning fuck up previous capabilities of an llm ?
@free_thinker4958 หลายเดือนก่อน
Yes fine tuning kinda fu**cks up to some extent previous capabilities, i've heard about a way to get around that using another type of neural networks i don't actually remember their names i think theyr'e called kolmogorov arnold networks
@310gowthamsagar5 หลายเดือนก่อน
@@free_thinker4958 ohh should check them out.
@thenoblerot หลายเดือนก่อน
Failure modes aside, I am frequently completely amazed at what vision models *are* able to perceive! It still seems like magic to me. I think video and synthetic visual training data is going to improve vision model performance in the long term... Remember, this tech is the worst it will ever be!
@KevinKreger หลายเดือนก่อน
Wow! I've been doing the grid counting test on a lot of vision models. They all suck at it, and they are selectively blind depending on training data. For example, phi 3 vision is GREAT at OCR and handwriting recognition -- enough so that this and coming open source vision models will disrupt the OCR Market.
@bigglyguy8429 หลายเดือนก่อน
LLMs can't count
@MichealScott24 หลายเดือนก่อน
❤
@maheshh3005 หลายเดือนก่อน
Hey I've been following you from a long time, wanted to ask this I'm into robotics research but I don't really see any future in India do you think people really hire from undergrad in AI feild, will the AI bubble brust?
@unclecode หลายเดือนก่อน ⁺²
Fascinating research! Combining all four models yields nearly perfect answers most of the time, indicating that at least one model typically provides the correct answer. Contrary to what I was expecting, uniting them results in almost always obtaining the right answer from at least one model. This suggests the potential for creating a semantic router, leveraging syntactic data to train the router on selecting the best model for each image and question, especially when incorporating both images and text in embedding. Perhaps (not sure) by utilizing this router with the four models collectively, high accuracy can be achieved without fine-tuning any vision language models, similar to the approach taken by LLMSYS. Impressive findings - thank you for sharing!
@1littlecoder หลายเดือนก่อน ⁺¹
Yeah honestly it's a fascinating takeaway for me how the models are complementing each other. That can make Devs' life easier.
@unclecode หลายเดือนก่อน ⁺¹
@@1littlecoder Very True. One another thing while skimming the paper (just first 20 pages), I noticed they didn't cover the various image encoding techniques used in the 4 models, which I find crucial. Additionally, there is no detailed analysis of how model architecture influences performance on these tasks. Tbh It's soon to say anything, their appendix only is 40 pages :)) I'll need to delve into the entire paper, it's a nice research idea.
@dove8998 หลายเดือนก่อน
You don't know the correct answer. The failure, if combined, is even more miserable. Majority vote has only 4 from 42 correct answers, 9 tie and 29 wrong.
@unclecode หลายเดือนก่อน
@@dove8998 What I mean is we generate a synthetic dataset where we know the correct answer, then train a semantic router on top of that. The router will take an image and text as input, then output which model to choose for a higher chance of getting the correct answer.
@carlkim2577 หลายเดือนก่อน
Isn't this mainly a training problem? I don't believe the models are inherently unable to do these tasks.
Using sonnet 3.5, I've been very impressed with the practical applications.
@mirek190 หลายเดือนก่อน
Seems like llm were never properly learned shapes and lines ....
Human perception is based on lines and shapes we are born with it.
Putting bunch of labeled picture is not a proper learning scheme.
I think problem is just a proper learning techniques.
Also can b also not enough data training as well. We as humans watching through eyes PB of data i do not think llms have enough data training as well.

ต่อไป

เล่นอัตโนมัติ