Glad it was useful, I skipped a lot of details, as I wanted to keep the focus on MHA vs GQA. I will probs do some other videos on some of the other details
it was quite a tough one to record, as i'm trying to avoid explaining the entire transformers architecture and attention fully (i'll do that in another video), but do enough to just show how this architectural change is affecting models output. it was a weird balance and apologies that i never explained it enough
Interesting! Claude 3.5 Sonnet is definitely great for code, much better than cgpt 4-o & has really helped me solve things that are well beyond my brain capacity in the last few days.
Sorry, I watched this all the way through, but I don't think you ever gave much to support your claim that group query attention was the cause of what you and your GPT-4 prompt ranked as worse outputs? It seems like at best you made a case for correlation of many newer models that adopt techniques like GQA to be worse under a metric. Even if the correlation is true, how do you demonstrate that the cause is GQA and not other factors that the same models have all adopted like some kinds of fine tuning over synthetic data or instruct tuning (eg. perhaps the answers you are judging as worse are the result of optimising for some LLM benchmark scores)?
I just attended detailed anatomy of LLM session.. and it’s just wow! Nobody’s telling these details. Thanks very much Chris ❤
Glad it was useful, I skipped a lot of details, as I wanted to keep the focus on MHA vs GQA. I will probs do some other videos on some of the other details
Great video! I don’t understand it fully, had to watch it again, but I‘m getting a idea of what is happening! Thank you!
it was quite a tough one to record, as i'm trying to avoid explaining the entire transformers architecture and attention fully (i'll do that in another video), but do enough to just show how this architectural change is affecting models output. it was a weird balance and apologies that i never explained it enough
LLaMA2-70b uses GQA (only its 7b version used MHA)
fair point
Interesting!
Claude 3.5 Sonnet is definitely great for code, much better than cgpt 4-o & has really helped me solve things that are well beyond my brain capacity in the last few days.
totally agree, much better for code than gpt-4o
Brilliant!
This was very interesting
Glad you enjoyed, definitely a fun rabbit hole
Excelent content! Thanks!
Glad you liked it!
Sorry, I watched this all the way through, but I don't think you ever gave much to support your claim that group query attention was the cause of what you and your GPT-4 prompt ranked as worse outputs? It seems like at best you made a case for correlation of many newer models that adopt techniques like GQA to be worse under a metric.
Even if the correlation is true, how do you demonstrate that the cause is GQA and not other factors that the same models have all adopted like some kinds of fine tuning over synthetic data or instruct tuning (eg. perhaps the answers you are judging as worse are the result of optimising for some LLM benchmark scores)?
Look at me when you talk to Me Boy Look AT ME
You shy too much Love it
Thanks its it really helped in my pretention
hahaha, yeah, i'm really bad at that sometimes
Intel agencies are having their fill first. Its obviously being slowed down so three letter agencies can get ahead of this.
lol, i'm sure 3 letter agencies are having their say but i suspect it's not on MHA vs GQA but would love to hear that conversation if they were
I believe 4o's judges only 90%
interesting, where did you get that info from?