Great presentation, thank you! As someone else said, explaining a paper in such a short amount of time is a rare skill. Commendable! I have a quite a basic question. All of this involves representing a node into a vector so that we can implement linear algebraic operations onto it for whatever analysis we wish to do, right? But the numbers in the vector representing a node aren't actually quantities but node ids. They might as well be names/words. So, how can we gain insights from doing math on them? Maybe I am missing something obvious. Please help, thanks!
Thanks for the feedback. You can apply node2vec on any geometric data. You might want to see my video on “DeepWalk” there I have explained karate club graph. The same example holds here as well. Cheers!
Yeah. You can look at “edge2vec”. Also, one more interesting way to have edge representations i think is by using existing methods (such as node2vec, let’s say) over the transformed graph. A transformed graph is a graph where all the existing edges are considered as nodes and all existing nodes are considered as edges. I.e you basically switch nodes and edges roles. ^ btw just a thought!
@@TechVizTheDataScienceGuy Yeah I had the same thought as well. Will try implementing that and see how it works. Thank you so much for the videos and for your response :)
alias sampling refers to sampling node from neighbouring set of any current node N based on (pi) - distribution that you get by multiplying initial prob distribution (w) and (alpha) which depends on p,q values
@@TechVizTheDataScienceGuy Yes this is clear, however, how to decide when to sample nodes based on pi? how I am visualizing is pi gets calculated for the "sentence"/"random walk" generated, using w*alpha, for all the nodes in that walk. w*alpha gets calculated , for weighted edges, as the edge weight between two nodes multiplied with alpha (based on p and q), and then it gets multiplied like this for all the nodes in that random walk?
So as per the algorithm 3.2.3 - pi matrix is calculated prior to starting random walks over the graph. At any stage of random walk you calculate neighbors of current node and then refer to pi matrix to sample next node based on distribution that it holds. And then We repeat this process
@@TechVizTheDataScienceGuy I assumed that during SGD, the edge weights are going to change as per optimization for Skipgram , iteratively, hence there could be a need to calculate pi every iteration, since pi = w*alpha, where w would change
Okay.. so sgd updates the weight of NN that we train post we get random walks based on pi sampling at every step. It’s just like training skipgram on sentences. As in the algo you can see.. sgd step is outside scope of walk computation.
Since they have written that d(t,x) is shortest path and should be one of 0,1,2. The weight of the edges will come into play when you calculate pi(v,x). The w(v,x) will be based on wights of edge weight as initial probability distribution. Rest shortest path seems to be calculated based on hop count.
@@TechVizTheDataScienceGuy okay so there is one idea , I initialised the weights of the edges equal to the tf-idf scores between the two nodes(or "words"), and since the goal is to learn homophily, I set p = 0.5 and q = 2, and then the latent representation of these nodes is what is now my word embeddings
so in this case, higher tf-idf means these two nodes are important for each other/more related to each other, hence higher tf-idf also equates to higher edge weight, so does this imply more distance between the two nodes, which then negates my purpose since I wanted the nodes which are highly related to each other to be "closer" to each other in the graph ?
I am assuming here the edge weight refers to the co-occurrence property between two node/words. like some sort of point wise mutual information score. I did get how will you put tfidf weight on edges.. based on tfidf of nodes... because then what will edge weight signify firstly.. secondly, tfidf is associated to every word.. puting it as edge weight as a relation between two nodes won’t make sense. So ideally you can opt for semantic scores, cooccurence score, pmi, etc
So considering we have some measure over edge that captures importance of nodes to each other.. which will help us build our initial distribution w.. considering authors mentioned that shortest distance can be just 0,1,2 means they are focusing on hop length. So still the final pi distribution that you get after multiplying w and alpha will tend to give more chances for you to sample high edge weight terms.. hence learning representation that gives more weightage to words associated with high edge weights. So the logic will remain intact, it’s not doing opposite of what we expect it to do. Btw Nice question. I hope I answered your query
Watch more Research Paper Walkthroughs - th-cam.com/video/ykClwtoLER8/w-d-xo.html
Excellent explanation! Appreciate the efforts taken to teach a complex paper in less than 15 mins. That's a skill not all have.
Thank you for appreciating Pratik.
Really appreciate the walkthrough and explanations, thanks.
Thank you for appreciating.
Great presentation, thank you! As someone else said, explaining a paper in such a short amount of time is a rare skill. Commendable!
I have a quite a basic question. All of this involves representing a node into a vector so that we can implement linear algebraic operations onto it for whatever analysis we wish to do, right? But the numbers in the vector representing a node aren't actually quantities but node ids. They might as well be names/words. So, how can we gain insights from doing math on them? Maybe I am missing something obvious. Please help, thanks!
Hey, thanks! :)
Node vectors are not ids.
The node is initialised with random random vector and soon gets meaningful when loss converges.
Well Explained.. Thanks
Thank you Sanjay. :)
Great, it could be better if you gave some example for any node to vec
Thanks for the feedback. You can apply node2vec on any geometric data. You might want to see my video on “DeepWalk” there I have explained karate club graph. The same example holds here as well.
Cheers!
Are there other graph embedding implementations that focus on representing information of graph edges instead of nodes?
Yeah. You can look at “edge2vec”.
Also, one more interesting way to have edge representations i think is by using existing methods (such as node2vec, let’s say) over the transformed graph. A transformed graph is a graph where all the existing edges are considered as nodes and all existing nodes are considered as edges. I.e you basically switch nodes and edges roles.
^ btw just a thought!
@@TechVizTheDataScienceGuy Yeah I had the same thought as well. Will try implementing that and see how it works.
Thank you so much for the videos and for your response :)
Great! Thank you!
amazing video, can you share the implementation code of node2vec.
Thanks for appreciating. There are couple of GitHub repos. that already implement it. For ex. You can check out one at github.com/eliorc/node2vec
Hi there 👍
Also, what is "alias samping" step in node2vec
alias sampling refers to sampling node from neighbouring set of any current node N based on (pi) - distribution that you get by multiplying initial prob distribution (w) and (alpha) which depends on p,q values
@@TechVizTheDataScienceGuy Yes this is clear, however, how to decide when to sample nodes based on pi? how I am visualizing is pi gets calculated for the "sentence"/"random walk" generated, using w*alpha, for all the nodes in that walk. w*alpha gets calculated , for weighted edges, as the edge weight between two nodes multiplied with alpha (based on p and q), and then it gets multiplied like this for all the nodes in that random walk?
So as per the algorithm 3.2.3 - pi matrix is calculated prior to starting random walks over the graph. At any stage of random walk you calculate neighbors of current node and then refer to pi matrix to sample next node based on distribution that it holds. And then We repeat this process
@@TechVizTheDataScienceGuy I assumed that during SGD, the edge weights are going to change as per optimization for Skipgram , iteratively, hence there could be a need to calculate pi every iteration, since pi = w*alpha, where w would change
Okay.. so sgd updates the weight of NN that we train post we get random walks based on pi sampling at every step. It’s just like training skipgram on sentences. As in the algo you can see.. sgd step is outside scope of walk computation.
how is the distance d, calculated in case of weighted edges?
Since they have written that d(t,x) is shortest path and should be one of 0,1,2.
The weight of the edges will come into play when you calculate pi(v,x). The w(v,x) will be based on wights of edge weight as initial probability distribution. Rest shortest path seems to be calculated based on hop count.
@@TechVizTheDataScienceGuy okay so there is one idea , I initialised the weights of the edges equal to the tf-idf scores between the two nodes(or "words"), and since the goal is to learn homophily, I set p = 0.5 and q = 2, and then the latent representation of these nodes is what is now my word embeddings
so in this case, higher tf-idf means these two nodes are important for each other/more related to each other, hence higher tf-idf also equates to higher edge weight, so does this imply more distance between the two nodes, which then negates my purpose since I wanted the nodes which are highly related to each other to be "closer" to each other in the graph ?
I am assuming here the edge weight refers to the co-occurrence property between two node/words. like some sort of point wise mutual information score. I did get how will you put tfidf weight on edges.. based on tfidf of nodes... because then what will edge weight signify firstly.. secondly, tfidf is associated to every word.. puting it as edge weight as a relation between two nodes won’t make sense. So ideally you can opt for semantic scores, cooccurence score, pmi, etc
So considering we have some measure over edge that captures importance of nodes to each other.. which will help us build our initial distribution w.. considering authors mentioned that shortest distance can be just 0,1,2 means they are focusing on hop length. So still the final pi distribution that you get after multiplying w and alpha will tend to give more chances for you to sample high edge weight terms.. hence learning representation that gives more weightage to words associated with high edge weights. So the logic will remain intact, it’s not doing opposite of what we expect it to do. Btw Nice question. I hope I answered your query