Haha, somehow I knew/feared this already... cause every single VLM (even the best ones) I've tried so far in the past, MISERABLY FAILED in describing the venn diagram from the paper "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning" and answering questions about it. Especially when it came to counting, colors and intersections.
And yet, show it my old jetty bridge in need of repair, and it describes it in great detail, the surroundings and the fact it looks like it needs repairing?
Thank you for the video. I think the some of the tests were very valid. But some might have some significant issues. When I asked GPT4o to identify the circled letters for "Acknowledgement" "Subdermatoglyphic" and "tHyUiKaRbNqWeOpXcZvM". I randomly circled 4 letters in red, and copy pasted over. It got 100% correct. However, it didn't get "touching" and "overlapping" circles perfectly. It didn't get it right at first. For example it claimed in 1 image that two overlapping circles were "Touching" "But not overlapping" I asked it why? and then it changed its mind and said that "re-examining" it confirmed that they were both overlapping and touching. Note that after the first "bad" attempt, and correction, it did much better, and identified 3 in a row. So, I think "one shotting" the definition of "OverLapping" and "Touching" without providing example of what we considered as "touching" vs "not touching" and overlap vs not overlap... is a bit of a stretch. Remember that "technically" a circle can only "touch" on a infinitely small point in space... which means that it has to "widen" that viewpoint for us humans. How much it allows for in touching but not overlapping, could be considerable without training. Additionally, when I asked it to identify 3 different shapes filled with 3 different colors, it got 100% right. When asked to count 3 overlapping circles vs 5 overlapping, it got 100% correct on my test. Unfortunately, 4o totally got intersecting lines wrong tho. I could analyze this more later, but it will take time to discern why it is getting this wrong. Overall, I think GPT4o has better "vision" than claimed... or at least has "improved" since this paper was released.
After watching this video I think we should forget AGI. If algorithmic improvements don't happen soon at a blistering pace we will soon end up in an AI winter as investors loose confidence in these large companies to deliver a digital mind.
AGI was planted in our mind by big companies, because they can get more funding from Investors if they say we are getting close to agi. It's all about money baby
Seems like llm were never properly learned shapes and lines .... Human perception is based on lines and shapes we are born with it. Putting bunch of labeled picture is not a proper learning scheme. I think problem is just a proper learning techniques.
At times, the model development seem to chase these tests. What they fail to do now, they might soon be made to do tomorrow. But to call them blind is an over-statement!
what if these tasks will be achieved after specific finetuning ? and with this thought in mind, i wonder, if we look at main idea of neural networks, its just bunch of mathematical operations, with a very specific set of weights. If by finetuning, we change those weights, doesnt it mean we are somehow changing those numbers which performed better for sometask and now they wont ? is there any reasearch done in this direction ? does fine-tuning disturb model's previous capabilities.
Yes fine tuning kinda fu**cks up to some extent previous capabilities, i've heard about a way to get around that using another type of neural networks i don't actually remember their names i think theyr'e called kolmogorov arnold networks
Failure modes aside, I am frequently completely amazed at what vision models *are* able to perceive! It still seems like magic to me. I think video and synthetic visual training data is going to improve vision model performance in the long term... Remember, this tech is the worst it will ever be!
Wow! I've been doing the grid counting test on a lot of vision models. They all suck at it, and they are selectively blind depending on training data. For example, phi 3 vision is GREAT at OCR and handwriting recognition -- enough so that this and coming open source vision models will disrupt the OCR Market.
Hey I've been following you from a long time, wanted to ask this I'm into robotics research but I don't really see any future in India do you think people really hire from undergrad in AI feild, will the AI bubble brust?
Fascinating research! Combining all four models yields nearly perfect answers most of the time, indicating that at least one model typically provides the correct answer. Contrary to what I was expecting, uniting them results in almost always obtaining the right answer from at least one model. This suggests the potential for creating a semantic router, leveraging syntactic data to train the router on selecting the best model for each image and question, especially when incorporating both images and text in embedding. Perhaps (not sure) by utilizing this router with the four models collectively, high accuracy can be achieved without fine-tuning any vision language models, similar to the approach taken by LLMSYS. Impressive findings - thank you for sharing!
@@1littlecoder Very True. One another thing while skimming the paper (just first 20 pages), I noticed they didn't cover the various image encoding techniques used in the 4 models, which I find crucial. Additionally, there is no detailed analysis of how model architecture influences performance on these tasks. Tbh It's soon to say anything, their appendix only is 40 pages :)) I'll need to delve into the entire paper, it's a nice research idea.
You don't know the correct answer. The failure, if combined, is even more miserable. Majority vote has only 4 from 42 correct answers, 9 tie and 29 wrong.
@@dove8998 What I mean is we generate a synthetic dataset where we know the correct answer, then train a semantic router on top of that. The router will take an image and text as input, then output which model to choose for a higher chance of getting the correct answer.
Isn't this mainly a training problem? I don't believe the models are inherently unable to do these tasks. Using sonnet 3.5, I've been very impressed with the practical applications.
Seems like llm were never properly learned shapes and lines .... Human perception is based on lines and shapes we are born with it. Putting bunch of labeled picture is not a proper learning scheme. I think problem is just a proper learning techniques. Also can b also not enough data training as well. We as humans watching through eyes PB of data i do not think llms have enough data training as well.
Haha, somehow I knew/feared this already... cause every single VLM (even the best ones) I've tried so far in the past, MISERABLY FAILED in describing the venn diagram from the paper "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning" and answering questions about it. Especially when it came to counting, colors and intersections.
And yet, show it my old jetty bridge in need of repair, and it describes it in great detail, the surroundings and the fact it looks like it needs repairing?
Thank you for the video. I think the some of the tests were very valid. But some might have some significant issues.
When I asked GPT4o to identify the circled letters for "Acknowledgement" "Subdermatoglyphic" and "tHyUiKaRbNqWeOpXcZvM". I randomly circled 4 letters in red, and copy pasted over.
It got 100% correct.
However, it didn't get "touching" and "overlapping" circles perfectly. It didn't get it right at first. For example it claimed in 1 image that two overlapping circles were "Touching" "But not overlapping" I asked it why? and then it changed its mind and said that "re-examining" it confirmed that they were both overlapping and touching.
Note that after the first "bad" attempt, and correction, it did much better, and identified 3 in a row.
So, I think "one shotting" the definition of "OverLapping" and "Touching" without providing example of what we considered as "touching" vs "not touching" and overlap vs not overlap... is a bit of a stretch. Remember that "technically" a circle can only "touch" on a infinitely small point in space... which means that it has to "widen" that viewpoint for us humans. How much it allows for in touching but not overlapping, could be considerable without training.
Additionally, when I asked it to identify 3 different shapes filled with 3 different colors, it got 100% right.
When asked to count 3 overlapping circles vs 5 overlapping, it got 100% correct on my test.
Unfortunately, 4o totally got intersecting lines wrong tho. I could analyze this more later, but it will take time to discern why it is getting this wrong.
Overall, I think GPT4o has better "vision" than claimed... or at least has "improved" since this paper was released.
That's amazing that you put them through those tests!
@@1littlecoder LoL! It only takes a few minutes to make a simple diagram... but, thanks!
I've had much better luck myself, but I'm grateful to see these failures so I can be more cautious. Thank you.
4:35 the primary reason why this task is easier is that there are only 2 possible answers.
After watching this video I think we should forget AGI. If algorithmic improvements don't happen soon at a blistering pace we will soon end up in an AI winter as investors loose confidence in these large companies to deliver a digital mind.
I tend to agree. But well, maybe the research was focused on language for too long and needs to catch up on visual now...
AGI was planted in our mind by big companies, because they can get more funding from Investors if they say we are getting close to agi. It's all about money baby
Seems like llm were never properly learned shapes and lines ....
Human perception is based on lines and shapes we are born with it.
Putting bunch of labeled picture is not a proper learning scheme.
I think problem is just a proper learning techniques.
At times, the model development seem to chase these tests. What they fail to do now, they might soon be made to do tomorrow. But to call them blind is an over-statement!
True that❤
All it proves is that Claude 3.5 is the leading VLM and LLM
Took over one year before I swapped my main subscription from GPT to Claude, 3.5 Sonnet is the king right now
Haha. Thanks for the video. Really needed someone to share my opinion with.
what if these tasks will be achieved after specific finetuning ? and with this thought in mind, i wonder, if we look at main idea of neural networks, its just bunch of mathematical operations, with a very specific set of weights. If by finetuning, we change those weights, doesnt it mean we are somehow changing those numbers which performed better for sometask and now they wont ? is there any reasearch done in this direction ? does fine-tuning disturb model's previous capabilities.
That's also my question. Does fine tuning fuck up previous capabilities of an llm ?
Yes fine tuning kinda fu**cks up to some extent previous capabilities, i've heard about a way to get around that using another type of neural networks i don't actually remember their names i think theyr'e called kolmogorov arnold networks
@@free_thinker4958 ohh should check them out.
Failure modes aside, I am frequently completely amazed at what vision models *are* able to perceive! It still seems like magic to me. I think video and synthetic visual training data is going to improve vision model performance in the long term... Remember, this tech is the worst it will ever be!
Wow! I've been doing the grid counting test on a lot of vision models. They all suck at it, and they are selectively blind depending on training data. For example, phi 3 vision is GREAT at OCR and handwriting recognition -- enough so that this and coming open source vision models will disrupt the OCR Market.
LLMs can't count
❤
Hey I've been following you from a long time, wanted to ask this I'm into robotics research but I don't really see any future in India do you think people really hire from undergrad in AI feild, will the AI bubble brust?
Fascinating research! Combining all four models yields nearly perfect answers most of the time, indicating that at least one model typically provides the correct answer. Contrary to what I was expecting, uniting them results in almost always obtaining the right answer from at least one model. This suggests the potential for creating a semantic router, leveraging syntactic data to train the router on selecting the best model for each image and question, especially when incorporating both images and text in embedding. Perhaps (not sure) by utilizing this router with the four models collectively, high accuracy can be achieved without fine-tuning any vision language models, similar to the approach taken by LLMSYS. Impressive findings - thank you for sharing!
Yeah honestly it's a fascinating takeaway for me how the models are complementing each other. That can make Devs' life easier.
@@1littlecoder Very True. One another thing while skimming the paper (just first 20 pages), I noticed they didn't cover the various image encoding techniques used in the 4 models, which I find crucial. Additionally, there is no detailed analysis of how model architecture influences performance on these tasks. Tbh It's soon to say anything, their appendix only is 40 pages :)) I'll need to delve into the entire paper, it's a nice research idea.
You don't know the correct answer. The failure, if combined, is even more miserable. Majority vote has only 4 from 42 correct answers, 9 tie and 29 wrong.
@@dove8998 What I mean is we generate a synthetic dataset where we know the correct answer, then train a semantic router on top of that. The router will take an image and text as input, then output which model to choose for a higher chance of getting the correct answer.
Isn't this mainly a training problem? I don't believe the models are inherently unable to do these tasks.
Using sonnet 3.5, I've been very impressed with the practical applications.
Seems like llm were never properly learned shapes and lines ....
Human perception is based on lines and shapes we are born with it.
Putting bunch of labeled picture is not a proper learning scheme.
I think problem is just a proper learning techniques.
Also can b also not enough data training as well. We as humans watching through eyes PB of data i do not think llms have enough data training as well.