I forgot to mention that this model is trained like a normal transformer and since everything is causal, you should be able to train using the efficient parallel technique that the transformer uses, a single forward pass for an entire sequence of data.
I just opened your channel to ask your for Mamba video and here I see this video. You are awesome dude, I can express how much you contribute to my life. Thank you many times!!!
Wow Gabrial, Great job! I like your calm attitude and simple way of explaining this complex subject! As electrical engineer and as a data scientist I highly appreciate your content!
19:50 I think A is DxN because they use diagonal matrix. They mention S4D, and that paper has example of also linear initialization: "A = -0.5 + 1j * np.pi * np.arange(N//2) # S4D-Lin initialization". It's structured after all.
I was actually wrong about the HiPPO "A" matrix being learnable. I think this matrix is actually static, which makes sense as it adds some basic structure to the model.
I forgot to mention that this model is trained like a normal transformer and since everything is causal, you should be able to train using the efficient parallel technique that the transformer uses, a single forward pass for an entire sequence of data.
I just opened your channel to ask your for Mamba video and here I see this video. You are awesome dude, I can express how much you contribute to my life. Thank you many times!!!
Please don't stop with this videos. They are extremely useful to go through with you. Much love
Wow Gabrial, Great job!
I like your calm attitude and simple way of explaining this complex subject!
As electrical engineer and as a data scientist I highly appreciate your content!
19:50 I think A is DxN because they use diagonal matrix. They mention S4D, and that paper has example of also linear initialization: "A = -0.5 + 1j * np.pi * np.arange(N//2) # S4D-Lin initialization". It's structured after all.
Thanks for the vid. I Can't wait to see if it's overhyped or not hehe. TriDao knows his attention mechanisms.
thx for doing this paper, was a bit lost on state space models
I was a bit lost.. now I'm more lost. ;)
@@acasualviewer5861haha, i did watch some lectures by the first author tho
점점 물리학의 개념들에 가까워지는 기분이.. 🙂
I think it's independent because you can diagonalize the state transition matrix and then each value only interacts with itself.
shouldn't 24:28 A,B, and C be LxN not LxD ?
If all the matrices are learnable, I wonder why the authors use the HiPPO matrix to initialize A? What's the point?
I was actually wrong about the HiPPO "A" matrix being learnable. I think this matrix is actually static, which makes sense as it adds some basic structure to the model.