This was a great tutorial on applying Transformers for Symbolic Music Generation! Can you please share pointers on how it can be done for raw audio directly?
Once again, a wonderful video. I am always infinitely grateful to Valerio for making this knowledge more engaging and digestible. I have two questions that might be a bit naive: 1. In the 'sinusoidal_position_encoding' function, shouldn't the sines and cosines be interleaved? 2. I understand that during training, it would be redundant to use the same input for both the encoder and decoder. But, in this case, doesn't it create a discrepancy between training and inference? Because in the predictions at 51:05, the same input is used for both.
Awesome video Valerio! Question: I'm using this Transformer architecture with the keras.model.save() and .load_model() functions in order to be able to quickly test the model on many different starting sequences. However, I'm running into some issues with the Transformer's call() function that we are overwriting - it is preventing me from successfully calling .load_model() because it says it is getting unexpected arguments to the call() function. All I'm trying to do is essentially more/less "pickle" (or save) this model to a file and then run it on many different starting note sequences to test it. Have you run into this sort of issue before? Thanks!
Hi Valerio, thanks a lot for the awesome content! A question about the train step function (around 1:00:00): could the target_input sequence be the same as the encoder_input one, and only the target_real sequence be shifted by 1 token? I mean, to follow your example, I would do encoder_input = target_input = [1, 2, 3, 4] and target_real=[2,3,4,5] ? It would be more consistent with the inference phase, in which both the encoder and decoder outputs are the same? Thanks again!
It would be redundant. If you go for that approach, it may be best just having a decoder-only architecture. We use the encoder for conditioning the generation on something different from the target input. In the simple case of the video, this will coincide with the shifted value of the sequence.
Thank you Valerio for this incredible course! I found it super inspiring. Just one question: why didn't you cover GANs? You simply didn't have time or you believe that they are not as good as the methods you presented for music?
If the dataset is in MIDI format, you'll have to map it onto a textual encoding. If you mean you only have 1 midi file, I'm afraid you can't do much :D
Last things I've done in this space are not public-facing unfortunately. Some of it has been for music tech companies. Other stuff still in the making ;)
good to understand theory, and that is it!!!... do not bother if you want to grasp it from experience the coding part. You wont be able to hear the output... Fantastic "tutorial!" without practical feedback.
If you ACTUALLY followed the tutorial you have ended with a text sequence that is easily transformed into Midi. In my case i just tranformed the text sequence to midi using Music21 so then I can hear the output in any DAW.
in waiting for part 3 of the course!! thanks!!
This was a great tutorial on applying Transformers for Symbolic Music Generation! Can you please share pointers on how it can be done for raw audio directly?
thanks a lot for providing this code for free! And for all the comments! really helpful ❤
Thank you very much Valerio, fantastic work.
amazing content, keep up the good work bro!
Once again, a wonderful video. I am always infinitely grateful to Valerio for making this knowledge more engaging and digestible. I have two questions that might be a bit naive:
1. In the 'sinusoidal_position_encoding' function, shouldn't the sines and cosines be interleaved?
2. I understand that during training, it would be redundant to use the same input for both the encoder and decoder. But, in this case, doesn't it create a discrepancy between training and inference? Because in the predictions at 51:05, the same input is used for both.
Awesome video Valerio!
Question: I'm using this Transformer architecture with the keras.model.save() and .load_model() functions in order to be able to quickly test the model on many different starting sequences. However, I'm running into some issues with the Transformer's call() function that we are overwriting - it is preventing me from successfully calling .load_model() because it says it is getting unexpected arguments to the call() function. All I'm trying to do is essentially more/less "pickle" (or save) this model to a file and then run it on many different starting note sequences to test it.
Have you run into this sort of issue before? Thanks!
Hey I'm encourtering the same issue, did find how to resolve it ? Thanks !
Hi Valerio, thanks a lot for the awesome content!
A question about the train step function (around 1:00:00): could the target_input sequence be the same as the encoder_input one, and only the target_real sequence be shifted by 1 token?
I mean, to follow your example, I would do encoder_input = target_input = [1, 2, 3, 4] and target_real=[2,3,4,5] ? It would be more consistent with the inference phase, in which both the encoder and decoder outputs are the same? Thanks again!
It would be redundant. If you go for that approach, it may be best just having a decoder-only architecture. We use the encoder for conditioning the generation on something different from the target input. In the simple case of the video, this will coincide with the shifted value of the sequence.
Awesome video!
i believe you need to use python 3.10 to get that version 2.13 of tensorflow, otherwise the karas text preprocessor is deprecated.
Why not use a decoder only stack like GPT?
Hey @Valerio, How can we play the music generated by this model? Because at the end the output are just in notations
Thank you Valerio for this incredible course! I found it super inspiring. Just one question: why didn't you cover GANs? You simply didn't have time or you believe that they are not as good as the methods you presented for music?
Time constraint is definitely a reason. Also, they are significantly less capable than transformers for music generation, and way harder to train ;)
Again thank you Velerio for this innovative work, but what if the dataset is a midi file
If the dataset is in MIDI format, you'll have to map it onto a textual encoding.
If you mean you only have 1 midi file, I'm afraid you can't do much :D
Hi Valerio, where can I listen to some music you have generated / created with your tools? I am curious.
Last things I've done in this space are not public-facing unfortunately. Some of it has been for music tech companies. Other stuff still in the making ;)
good to understand theory, and that is it!!!... do not bother if you want to grasp it from experience the coding part. You wont be able to hear the output...
Fantastic "tutorial!" without practical feedback.
?
If you ACTUALLY followed the tutorial you have ended with a text sequence that is easily transformed into Midi. In my case i just tranformed the text sequence to midi using Music21 so then I can hear the output in any DAW.