Wword should have the dimensions as |v| x k, because the hidden layer has the dimension as k x 1, and so to result in the output layer with dimensions as |v| x 1, it makes sense to have dimensions of Wword as |v| x k. Therefore, it should be jth column of Wcontext and ith row of Wword and not ith column of Wword. So while considering to have the word embedding from such a model, Wcontext will have columns representing the word vectors and Wword will have rows representing the word vecrtors.
One can intuitively say that the context and the word will have similar vectors, without going into too much mathematics. Since the value of the softmax is dependent on the dot product of context and word vector, one can say that the numerator of the softmax is maximized when the cosine similarity is close to 1. As long as it is not close to 1, the network still has room for optimization. The side effect of keeping on optimizing, till gets the highest value , is that the two vectors will be forced to come closer to each other to maximize the numerator.
Amazing explanation. Only thing is that I got confused near 32:27. As the title of the video suggests, it is a Continuous bag of words. But at the marked time, it was stated that the order does not matter which makes it just a simple BOW instead of CBOW. Please if someone can provide an explanation. Thanks in advance. Overall the video was very clear.
The selected column vector u_{c} of the matrix W_{context} is also an optimization parameter right?, so the gradient has to be computed for this term as well, isn't it?
Please verify that @15:31 it should i_th row not i_th column otherwise matrix multiplication does not make sense, therefore its the i_th row of Wword and j_th column of Wcontext, if this is not the case the please reply with an explanation as to where I'm wrong or what am i missing??
the dimension of w word is k x v here, so every target word is represented as a column vector. Whereas the dimension of w word is v x k in previous slides
Yes, any language can be properly vectorized as long as there are sufficient training data ( novels , text books etc.) . The language here is irrelevant because we are never looking at the word itself, we have merely assigned some random weights to a word, and are trying to optimize it so that the neighbor has a high probability in the output. The word in itself is irrelevant. You could potentially have a long list of related pictures ( and arrange them so that their relation is maximized) and do the same thing to get a vector for a picture.
cant we just add the one hot representations of input words and then do the forward pass rather taking i'th and j'th columns of the word weight matrix.
Dr .Mitesh , u r one of the finest lecturers. I have undergone cs224d but u r much better
Wword should have the dimensions as |v| x k, because the hidden layer has the dimension as k x 1, and so to result in the output layer with dimensions as |v| x 1, it makes sense to have dimensions of Wword as |v| x k. Therefore, it should be jth column of Wcontext and ith row of Wword and not ith column of Wword. So while considering to have the word embedding from such a model, Wcontext will have columns representing the word vectors and Wword will have rows representing the word vecrtors.
Thanks for the lecture. It really helped me in understanding the concepts behind word2vec.
the best explanation I have ever had
Has anybody else noticed that the corpus is the script of Interstellar?
Absolutely amazing clarity.
15:31 should be i_th row of matrix with vector
yes
ith column
@@yelchurivenkatavasavamba3250 why not i_th row??
@@yelchurivenkatavasavamba3250u are wrong.. it is i_th row only
Yes! It should be ith row of Wword and jth column of Wcontext.
One can intuitively say that the context and the word will have similar vectors, without going into too much mathematics. Since the value of the softmax is dependent on the dot product of context and word vector, one can say that the numerator of the softmax is maximized when the cosine similarity is close to 1. As long as it is not close to 1, the network still has room for optimization. The side effect of keeping on optimizing, till gets the highest value , is that the two vectors will be forced to come closer to each other to maximize the numerator.
What do W word and W context contain?
I've seen the previous lectures but haven't understood the use of W word and W context.
Please Explain.
Amazing explanation. Only thing is that I got confused near 32:27. As the title of the video suggests, it is a Continuous bag of words. But at the marked time, it was stated that the order does not matter which makes it just a simple BOW instead of CBOW. Please if someone can provide an explanation. Thanks in advance.
Overall the video was very clear.
Should not we apply a activation function for the middle layer ¿???
umm.. f(x)=x?
The selected column vector u_{c} of the matrix W_{context} is also an optimization parameter right?, so the gradient has to be computed for this term as well, isn't it?
Please verify that @15:31 it should i_th row not i_th column otherwise matrix multiplication does not make sense, therefore its the i_th row of Wword and j_th column of Wcontext, if this is not the case the please reply with an explanation as to where I'm wrong or what am i missing??
unless, if the values in h are normalized then it makes sense that the i_th column of Wword is the representation
the dimension of w word is k x v here, so every target word is represented as a column vector. Whereas the dimension of w word is v x k in previous slides
@@sriharsha8802 thanks Harsha, I've come to realize that and made a note on the side in my notebooks
How to implement continuous bag of words model using knn algorithm (using python).
15:42 shouldn't it be i'th row of Wword?
Is it possible to test word2vec model in any other language like Hindi or Urdu except English?
Yes, any language can be properly vectorized as long as there are sufficient training data ( novels , text books etc.) . The language here is irrelevant because we are never looking at the word itself, we have merely assigned some random weights to a word, and are trying to optimize it so that the neighbor has a high probability in the output. The word in itself is irrelevant. You could potentially have a long list of related pictures ( and arrange them so that their relation is maximized) and do the same thing to get a vector for a picture.
cant we just add the one hot representations of input words and then do the forward pass rather taking i'th and j'th columns of the word weight matrix.
why log(y_pred) , why not y_true * log(y_pred)
Y_true = 1 for the current scenario, so it becomes log(y_pred)