I tested it with 6 languages with the Talk to Gemini feature. It can seamlessly switch between languages and although the accent in some languages is not perfect, it works insanely well!
Turning an image into an open book is amazing. What we have today with language models is the natural progression of what started with RNNs, then transformers. Over time, things improved-better architectures, scale laws, larger datasets, and now we have these sophisticated language models. It’s a gradual evolution. But for images, it’s a completely different story. This isn’t an extension of conventional image processing techniques like classification, object detection, or segmentation. It’s something entirely new. The process essentially transforms an image into text, enabling us to dig in, ask questions, or extract information, natively, using the same model that processes text and audio. Everything becomes a text sequence. What’s fascinating is that it bypasses all the classical image processing methods: no need for specialized data preparation, binarization, or other traditional steps. It’s a totally different solution to the problem, redefining how we process and understand images. This shift is what truly amazes me, it’s not just an improvement, but a fundamental change.
I agree! having worked with CNNs in the early days and making a cat vs. dog classifier felt like magic without hand written features. This is a whole new level. A single model that can understand different modalities unlock applications that were not possible before.
Would be nice to see how it compares to Sonnet 3.5. Gemini seems to score higher on various benchmarks but it'd be nice to see real problem solving in different fields and how closely it follows the instructions.
I tested it with 6 languages with the Talk to Gemini feature. It can seamlessly switch between languages and although the accent in some languages is not perfect, it works insanely well!
Turning an image into an open book is amazing. What we have today with language models is the natural progression of what started with RNNs, then transformers. Over time, things improved-better architectures, scale laws, larger datasets, and now we have these sophisticated language models. It’s a gradual evolution.
But for images, it’s a completely different story. This isn’t an extension of conventional image processing techniques like classification, object detection, or segmentation. It’s something entirely new. The process essentially transforms an image into text, enabling us to dig in, ask questions, or extract information, natively, using the same model that processes text and audio. Everything becomes a text sequence.
What’s fascinating is that it bypasses all the classical image processing methods: no need for specialized data preparation, binarization, or other traditional steps. It’s a totally different solution to the problem, redefining how we process and understand images. This shift is what truly amazes me, it’s not just an improvement, but a fundamental change.
I agree! having worked with CNNs in the early days and making a cat vs. dog classifier felt like magic without hand written features. This is a whole new level. A single model that can understand different modalities unlock applications that were not possible before.
Very useful!
Would be nice to see how it compares to Sonnet 3.5. Gemini seems to score higher on various benchmarks but it'd be nice to see real problem solving in different fields and how closely it follows the instructions.
working on it :)
They Cooked, And this time it's Tasty.
Very interesting
Can you make a video on How to Use Gemini 2.0 api key for our Own Text to Speech and Speech T Text conversation.
Multimodal V2LMs are the way
agree
is this me or this guy's voice has changed?
just tried chat GPT with vision it's so much better than this garbage it's not even funny opening AI stays winning
Bot deployed by openai?
@sillybilly346 no it's just advanced voice mode with vision it's f****** insane
gemini > chatgpt now
Ya , we can hope 😂😂