This is brilliant !! the way you have combined encoder-decoder attention computation to this self-attention is really cool, honestly I have not come across anything like this in any of the blogs/writeups. I have a doubt prof, traditionally to compute the e_{tj} , on top of the linear transformation, we have used the tanh for non-linearity right. Here in the case of self-attention, though we are doing linear transformation, but we aren't applying non-linearity , can you please suggest why is that ? Thank you once again !!
This is brilliant !! the way you have combined encoder-decoder attention computation to this self-attention is really cool, honestly I have not come across anything like this in any of the blogs/writeups. I have a doubt prof, traditionally to compute the e_{tj} , on top of the linear transformation, we have used the tanh for non-linearity right. Here in the case of self-attention, though we are doing linear transformation, but we aren't applying non-linearity , can you please suggest why is that ? Thank you once again !!
Softmax is the only non-linear thing in the whole set-up