This is exactly the kind of breakdown of this paper I was looking for as I am just getting into the deep end of neural audio generation. The explanation of RVQ is very helpful. Thank you for making these videos!!
@gabrielmongaras: Hi, thanks very much for the very detailed and easy to understand explanation. Just one thing, for me it isn't super clear from the video why with the parallel pattern the error compounds. Thanks again!
MusicGen kind of experiments with three different conditioning methods. Generally, when using a transformer, you can just condition the transformer with a cross-attention block like in the original attention is all you need paper. MusicGen specifically experiments with three different encoders, but the method of conditioning the transformer is the same for all text encoding types.
The text would be put through some sort of encoder transformer architecture and the output of that encoder would then be used to condition the main audio generation model through cross-attention. The audio is generated the same, just with this extra conditioning.
@@gabrielmongaras ok I understand the conditioning but without audio input at inference mode, to what do we add the text-conditioning embedding? maybe a silly question i dont know.
Where do you get the 64? If we have e.g. 4 codebooks, and there are 50 time steps per one second of audio, we need to predict 4 * 50 values per second (4 codebook indices per time step), I believe.
This is exactly the kind of breakdown of this paper I was looking for as I am just getting into the deep end of neural audio generation. The explanation of RVQ is very helpful. Thank you for making these videos!!
Hi, what I am still wondering is why we cant just use teacher forcing on the codebook dimension? Thanks.
What an amazing video, finally I fully understood the paper! Thank you so much!
thank you so much! Your explanation is so clear, and it is excatly what I'm looking for.
Fantastic overview Gabriel, thanks for sharing
Glad you found it helpful!
Thanks! Just what I needed.
Great video! Is the transformer essentially predicting the indices of each codebook?
Amazing overview
Nicely Done!
nice illustration
@gabrielmongaras: Hi, thanks very much for the very detailed and easy to understand explanation. Just one thing, for me it isn't super clear from the video why with the parallel pattern the error compounds. Thanks again!
Great explanation on the paper and audio encoding and decoding. But where does the text prompt fit in to this?
Where do the text tokens fit in?
MusicGen kind of experiments with three different conditioning methods. Generally, when using a transformer, you can just condition the transformer with a cross-attention block like in the original attention is all you need paper. MusicGen specifically experiments with three different encoders, but the method of conditioning the transformer is the same for all text encoding types.
@@gabrielmongaras The model is supposed to accept text as input right? When it generates music based on a text prompt what is the model's input?
The text would be put through some sort of encoder transformer architecture and the output of that encoder would then be used to condition the main audio generation model through cross-attention. The audio is generated the same, just with this extra conditioning.
@@gabrielmongaras ok I understand the conditioning but without audio input at inference mode, to what do we add the text-conditioning embedding? maybe a silly question i dont know.
Hey, if the frame rate is equal to 50*64, that mean that each second the model need to predict 50*64 values? (without counting the dim)
the model works with the index of the codebook no?
Where do you get the 64? If we have e.g. 4 codebooks, and there are 50 time steps per one second of audio, we need to predict 4 * 50 values per second (4 codebook indices per time step), I believe.