People keep underestimating what Ai can do and what it would do. I would say its good not to get into the hype But totally dismissing it or thinking this is some nft hype is flawed.
Somewhat misleading research. It evaluates hallucinations in simple tasks (summarise, report, needle in a haystack, etc.). I took Google's premium package to have access to the latest Gemini models after constantly running out of messages on Claude Pro, and ALL of Gemini's models suck at reasoning. I still prefer building with Claude or GPT and implement procedures that evaluate groundedness of the generated answer, rather than using any of the Gemini models. This "research" feels like a "special olympics" competition organised by mom and pop.
NotebookLM uses Gemini 2.0 Flash experimental. Good to know.
People keep underestimating what Ai can do and what it would do.
I would say its good not to get into the hype
But totally dismissing it or thinking this is some nft hype is flawed.
Somewhat misleading research. It evaluates hallucinations in simple tasks (summarise, report, needle in a haystack, etc.). I took Google's premium package to have access to the latest Gemini models after constantly running out of messages on Claude Pro, and ALL of Gemini's models suck at reasoning. I still prefer building with Claude or GPT and implement procedures that evaluate groundedness of the generated answer, rather than using any of the Gemini models. This "research" feels like a "special olympics" competition organised by mom and pop.
Haha well said
The idea is great.
The benchmark is much needed.
But the whole research seems ungrounded ;)