NOTES/ Important Points :- 1.) Graphlets are undirected motifs. 2.) The purpose of graph representation learning is to allow automatic feature engineering by embedding each node into d-dimensional space such that similarities in the network(*) correspond to closeness(cosine/Euclidean distance) in the embedding space. 3.) Images==Grid graphs ; Text, audio == Chain graphs . Adjacency matrix representation of a graph is not permutation(of rows/columns) invariant. But the graph is. 4.) *We define this similarity in the network on the basis of random walks. Similarity(u,v) = probability of visiting node v on a random walk starting at node u. Similarity depends on the random walk strategy (R) that you choose to use. 5.) You take fixed size random walks from each node, to make it's neighborhood sets. 6.) You try to maximize the probability of neighborhood sets given the node. Where, probability of node v occurring in neighborhood set of node u(depends on the similarity b/w embeddings of node u and v) is defined as a soft-max that operates over cosine similarities between embedding of node u and embeddings of all other nodes. [ 26:20 ] 7.) We use negative sampling to approximate the soft-max with difficult to compute denominator. 8.) In negative sampling, nodes are sampled randomly in proportion to their degree. 9.) But, random walks from a node, only encode similarities that correspond to "nearness" of two nodes in the network.. so how to encode other types of similarities ? 10.) Node2Vec :- 11.) We can have 2 different types of exploration:- DFS(macro view) like walk. Or BFS(micro view) like walk. 12.) More precisely, 2nd order random walks (i.e., walk has one unit of memory, which tell where you came from to the current node). Also you need to estimate the distances(from the starting node) of all nodes in vicinity of starting node. And then, you choose to move towards, away, or remain at the same distance from starting node in the ratio 1/p : 1/q : 1 respectively. [ 45 :00 ] 13.) Depending on the values of p and q we can get embeddings where communities of nodes have nearby embeddings in embedding space.. or where nodes with similar structural roles are closer together in embedding space. [ 48:02 ] 14.) These embeddings are very consistent. That is, removing/adding ~50% edges leads to only small drop in accuracy of node classification problem. 15.) The node embeddings can be aggregated over various nodes to classify a set of nodes. 16.) It has been observed Node2Vec performs better on node classification task than link prediction. 17.) Must choose definition of node similarity(i.e., random walk(possibly with p,q) neighborhood, normal neighborhood etc.) that matches your application. 18.) EMBEDDING ENTIRE GRAPHS :- 19.) For molecules.. or where lots of small networks are available. 20.) Just calculate normal node embeddings, and sum all of them. 21.) Make a super-node that connects to a sub-graph. The embedding of super-node, is embedding of sub-graph. 22.) :- These are transformations of random walks that are independent of what nodes actually occur on the path ; but try to capture the local structure around a node in the repetitions that occur on the random walk. So, A->B->C is transformed as 1->2->3 . A->B->A is transformed as 1->2->1 , and so on. The number of possible anonymous walks grows exponentially with number of steps in random walk and the size of the set of nodes reachable from a particular node. 23.) Approach 1 :- Graph feature vector = vector having frequencies of different possible anonymous walks. 24.) How many walks needed to get good estimate [ 1:08:10 ] 25.) Approach 2 :- Learn embeddings of all possible anonymous walks(of fixed length) and concatenate/avg them to get embedding of graph. To learn these anonymous walk embeddings, try to predict the probability distribution over next random walk from a node, using anonymous walk embeddings corresponding to the anon. walks corresponding to previous [delta] random walks. As in [ 1:14:02 ]. 26.) Anonymous walks are based on the assumption that all important properties/structures of the graph can be recovered from the statistics of the random anonymous walks that can be undertaken from various nodes.
Great lecture. 37:50 It is a misnomer to call this process "the finding of similarity between nodes" when in fact what you are doing is finding node groups. In a group, nodes are rarely similar to each other: you won't find 2 CEOs in a company, atom nuclei are surrounded by electrons rather than other nuclei, and so on.
Great lecture. But just one point: in deepwalk part, he says frequently that he wants to maximize the cost function, but really he wants to minimize it. In addition, his explanation of negative sampling is not completely correct. In fact, he should explain it by mentioning that the multi-class problem is changed to a two-class problem.
like sigmoid is smooth approximation to sign function or step function , the softmax is smooth approximation for argmax function , like in argmax the output will be 1 for index of input having max value and zero otherwise , in case of softmax the output for index with max value is closer to one but difference is that the second largest or third index will also have non zero based on their values while argmax has one hot output , the softmax has smooth distribution where all index have some value greater than zero based on their inputs but as exponents are used if the max value is far greater than other value , then it basically becomes like the argmax. This is particularly used so that the loss function is differentiable which the argmax is not.
misleading title as no (generative) representation is discussed. E.g. "Supervised learning on graphs", or "learning features of graphs" would have been more suitable imho.
NOTES/ Important Points :-
1.) Graphlets are undirected motifs.
2.) The purpose of graph representation learning is to allow automatic feature engineering by embedding each node into d-dimensional space such that similarities in the network(*) correspond to closeness(cosine/Euclidean distance) in the embedding space.
3.) Images==Grid graphs ; Text, audio == Chain graphs . Adjacency matrix representation of a graph is not permutation(of rows/columns) invariant. But the graph is.
4.) *We define this similarity in the network on the basis of random walks. Similarity(u,v) = probability of visiting node v on a random walk starting at node u. Similarity depends on the random walk strategy (R) that you choose to use.
5.) You take fixed size random walks from each node, to make it's neighborhood sets.
6.) You try to maximize the probability of neighborhood sets given the node. Where, probability of node v occurring in neighborhood set of node u(depends on the similarity b/w embeddings of node u and v) is defined as a soft-max that operates over cosine similarities between embedding of node u and embeddings of all other nodes. [ 26:20 ]
7.) We use negative sampling to approximate the soft-max with difficult to compute denominator.
8.) In negative sampling, nodes are sampled randomly in proportion to their degree.
9.) But, random walks from a node, only encode similarities that correspond to "nearness" of two nodes in the network.. so how to encode other types of similarities ?
10.) Node2Vec :-
11.) We can have 2 different types of exploration:- DFS(macro view) like walk. Or BFS(micro view) like walk.
12.) More precisely, 2nd order random walks (i.e., walk has one unit of memory, which tell where you came from to the current node). Also you need to estimate the distances(from the starting node) of all nodes in vicinity of starting node. And then, you choose to move towards, away, or remain at the same distance from starting node in the ratio 1/p : 1/q : 1 respectively. [ 45 :00 ]
13.) Depending on the values of p and q we can get embeddings where communities of nodes have nearby embeddings in embedding space.. or where nodes with similar structural roles are closer together in embedding space. [ 48:02 ]
14.) These embeddings are very consistent. That is, removing/adding ~50% edges leads to only small drop in accuracy of node classification problem.
15.) The node embeddings can be aggregated over various nodes to classify a set of nodes.
16.) It has been observed Node2Vec performs better on node classification task than link prediction.
17.) Must choose definition of node similarity(i.e., random walk(possibly with p,q) neighborhood, normal neighborhood etc.) that matches your application.
18.) EMBEDDING ENTIRE GRAPHS :-
19.) For molecules.. or where lots of small networks are available.
20.) Just calculate normal node embeddings, and sum all of them.
21.) Make a super-node that connects to a sub-graph. The embedding of super-node, is embedding of sub-graph.
22.) :- These are transformations of random walks that are independent of what nodes actually occur on the path ; but try to capture the local structure around a node in the repetitions that occur on the random walk. So, A->B->C is transformed as 1->2->3 . A->B->A is transformed as 1->2->1 , and so on. The number of possible anonymous walks grows exponentially with number of steps in random walk and the size of the set of nodes reachable from a particular node.
23.) Approach 1 :- Graph feature vector = vector having frequencies of different possible anonymous walks.
24.) How many walks needed to get good estimate [ 1:08:10 ]
25.) Approach 2 :- Learn embeddings of all possible anonymous walks(of fixed length) and concatenate/avg them to get embedding of graph. To learn these anonymous walk embeddings, try to predict the probability distribution over next random walk from a node, using anonymous walk embeddings corresponding to the anon. walks corresponding to previous [delta] random walks. As in [ 1:14:02 ].
26.) Anonymous walks are based on the assumption that all important properties/structures of the graph can be recovered from the statistics of the random anonymous walks that can be undertaken from various nodes.
Graph is so general, explainable, convenient and elegant. Also it's pretty difficult. : )
Pretty difficult indeed :D
i never knew Slavoj Zizek used to teach Machine learning when he was young
:D :D :D
Thicc Slovenian accent
Great lecture. 37:50 It is a misnomer to call this process "the finding of similarity between nodes" when in fact what you are doing is finding node groups. In a group, nodes are rarely similar to each other: you won't find 2 CEOs in a company, atom nuclei are surrounded by electrons rather than other nuclei, and so on.
@Machine Learning TV Can you please give some numbering to the lecture videos? So that we can figure out at which order we should watch those.
That is assignment number 3: create a program that go through the lectures and sort them based on the download of the transcript.
Great lecture, he knows how to explain complicated ideas, thanks a lot!
He is a such a great explainer
Great lecture, thanks a lot, The article is also so much interesting and infromative.
amazing great to see some good content again thank yt algorithm keep it up
Great lecture and Q&A sessions
Great lecture. But just one point: in deepwalk part, he says frequently that he wants to maximize the cost function, but really he wants to minimize it. In addition, his explanation of negative sampling is not completely correct. In fact, he should explain it by mentioning that the multi-class problem is changed to a two-class problem.
Hi Hossein,
Yes. It was indeed great lecture.
I also didn't understand the part of negative sampling. Can you please elaborate it more?
Exactly. log(probability) is negative infinity to 0, taking negative of this would have a range of 0 to infinity. You want to minimize instead
What kind of data can have graph representations and what types of data can not be represented as a graph?
Great lecture, very informative and clear.
32:40 shouldn't this be: log of sum_k instead of sum_k of log?
What if make the number of random walks also random? And simulate n ramdom wolks with constant number of walks over m random constant numbers of walks
Awesome explanation
Great lecture, props to Jure Leskovec :)
Great lecture. It helps me a lot.
This is a great talk.
Hi, I don't get the intuition behind softmax. Can someone please help me understand?
you can read the wikipedia I think it is well explained there.
like sigmoid is smooth approximation to sign function or step function , the softmax is smooth approximation for argmax function , like in argmax the output will be 1 for index of input having max value and zero otherwise , in case of softmax the output for index with max value is closer to one but difference is that the second largest or third index will also have non zero based on their values while argmax has one hot output , the softmax has smooth distribution where all index have some value greater than zero based on their inputs but as exponents are used if the max value is far greater than other value , then it basically becomes like the argmax. This is particularly used so that the loss function is differentiable which the argmax is not.
This guy is awesome
This is about learning *from* graphs but how do we learn the actual graph structures just from raw data? This is the question AI should be answering.
Is this guy from Switzerland ?
Surprisingly, I can follow his discussion. Or maybe I'm just tripping.
where are the left class videos?
Watch the new video we uploaded yesterday.
@@MachineLearningTV Hello @MLTV can we get the whole videos on cs224w?
@@georgeigwegbe7258 snap.stanford.edu/class/cs224w-2018/ find "resource" part on this page, you can find lecture notes and videos.
Thanks a lot for sharing
which video you uploaded yesterday
Great lecture, thanks a lot
awesome! thanks for sharing!
misleading title as no (generative) representation is discussed. E.g. "Supervised learning on graphs", or "learning features of graphs" would have been more suitable imho.
Random walking is somehow like the random firing of biological neurons
Ah nu Cheeki Breeki iv Damke
Lg
This is a great talk.
Thanks a lot for sharing