UPDATE: I realized that the shapes for the edge information didn't match when training the model. I had to add the edge features twice for each bond (once for each direction) to fix this. I updated the code accordingly.
Thank you, can you clarify whether we can use 2 graphs to predict an output; for example given 2 molecules in smiles format and binary output 0-1 whether there is an interaction or no. your help would be appreciated.
Hi! Yes this is possible. Pytorch has a special data type for this called PairData. See more here: pytorch-geometric.readthedocs.io/en/latest/notes/batching.html?#pairs-of-graphs
@@DeepFindr I had a doubt, like in custom dataset for node values and edges and adjacency matrix how to incorporate information of 2 molecules, Im not getting the documentation on graph pairs. We need to prepare data right? using the method shown in this video
Oh my! Helped my thesis so much. I don't understand how is this possibly free? This is pure gold, the best and the easiest to follow after watching 10+videos. You channel is helping the humanity and science and AI!! Thanks a lot!
I absolutely loved when you were explaining the parameters of the Data class, that you also visually added to what you were referring to by showing first the nodes, then the edges, etc. In my opinion this is a great way of teaching. Keep up the good work, this is invaluable.
I really enjoyed watching this series about creating a custom dataset for Graph Neural Network operations. I have learned a lot from this video. Thank you!
many thanks for this tutorial ... Have you any tutorial how to create this csv ? I mean how to convert normal data from machine learning data to graph?
Hi! The csv simply consists of SMILES strings, which is one possible representation for molecules. I don't fully understand the question, could you clarify please :)
@@DeepFindr @DeepFindr @DeepFindr Imagine I have a CSV with: Some features to represent a car or a bike, where my classes is 1 for car and 2 for bike like this below Feature_1, feature_2, feature_3, classes 1, 4,5,1 1,3,2,2 I would like to know how to convert it to SMILES representation or any type of data gnn could handle Or if not possible.. how can I extract this informations from a graph database .. or even create it manually ... What I would like to know is how I can create this graph representation from information I have on database or in any other place .. Maybe an example of... if I have no data ... How can I create this data converting to this representation to use in a gnn classifier ... I don't know if it is clear enough
Hi! I see. To transform a database into graphs you need two things: 1. Nodes: you need to have entities for the graph 2: Edges: between these entities you need to have connections. In a database this could be represented like this: Car_ID, Connected_To_ID 1, 2 5, 7 ... This gives you cars and the information about how cars are connected for example. Here you could simply loop over all cars and connect the relations to other cars. Let me know if you have further questions :) PS: SMILES is only suitable for molecules and describes a molecular structure (which atoms and which connections)
Thank you for the great videos! I have one question regarding using PyTorch Geometric Temporal from the other videos. In the traffic forecasting videos, you mentioned that it is possible to create a custom dataset for the library PyTorch Geometric Temporal by the method from this video, but as far as I can understand in this way 'temporal' layer is missing. Can you suggest any simple way to use this method to create a custom data that also contains temporal layer in the nodes?
Hello! The basic idea to create the Dataset is the same as in this video. For the missing temporal component you need to decide how your individual (time series) graphs are added together. Basically you need to create a list of graph objects. Pytorch geometric also has support for temporal graphs. The documentation has good examples: pytorch-geometric.readthedocs.io/en/latest/modules/data.html?highlight=Temporal#torch_geometric.data.TemporalData
Thanks for the video. I have a question about construing the data set. I have some data in Data format in PYG. I wonder how I can combine them into one dataset that captures all datasets in order to apply GCN? I watched the video, however, I could not understand how the final data includes all individual data that are defined in Data format.
Hi! In the video I created a "standard" dataset which doesn't hold a list of Data object or something. Instead all of them were stored on the hard drive and loaded whenever needed, which is done with the get function. If you want to put all the data objects together in one dataset, you need to have a look at the InMemory Dataset class: pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html This means, you still have to have a class where you collect all your data objects. Let me cite the documentation here: "The real magic happens in the body of process(). Here, we need to read and create a list of Data objects and save it into the processed_dir. Because saving a huge python list is rather slow, we collate the list into one huge Data object via torch_geometric.data.InMemoryDataset.collate() before saving . The collated data object has concatenated all examples into one big data object and, in addition, returns a slices dictionary to reconstruct single examples from this object." Hope this is what you are looking for :)
thanks for sharing different types of how to load the dataset. I have a question though: is there a way to get the node features and edge_index without this mol_obj = Chem.MolFromSmiles(mol["smiles"])?
Hi! In a later video I also mention the deepchem featurizer as an alternative. Do you have problems with Rdkit or why would you want to get rid of it? :)
@@DeepFindr I wanted to try out other datasets Proteins, and Enzymes from the network repository so if I want to get the node features and edge_inde, is there a different approach I follow?
So we are not providing any node identification label ri8, so im assuming in the node feature whateever index you gave is how the node is indentified and that same index has to be used for adjacent list. is my understanding correct?
Yes exactly, the node indices are arranged according to the node feature matrix. The first "row" corresponds to the first element. In the edge_index a 0 corresponds to this node. You can of course safe a mapping of your preferred indexing and always map it back.
Hi, thanks for the video! I want to create my own dataset but with the TemporalDataset from Pytorch Geometric. Can you advise me, at which places the code must be modified to create a TemporalDataset?
Hi! Did you watch my video on how to convert a tabular dataset to a graph dataset? I used pytorch geometric temporal for this, so I have no experience with plain PyG
Hi, thanks for the video. I would like to ask about the edge array shape. The DataLoader function requires the size of [2, number of edges] and must be coo format. Could you tell me how to make this edges array? Thank you
Hii, I have been trying to make a dataset of protein and ligand complex for GCN , which intends to predict their binding sites , would you mind explaining me about how to make a dataset for the same?
According to the documentation of torch_geometric.data.Data the node features should be of size (num_nodes, num_of_node_features) but what should you do if your node features are multi-dimensional, for example, image data?
Hi, If you have images on the nodes there are several things you could do: - covert the image into features using a pretrained CNN model e.g. Inception - flatten the image including all channels. So (512,512,3) would go into one vector of size 512x512x3. However I think this will not yield good results, as CNNs were especially designed to handle such inputs. - aggregate over the dimensions that are "too much" with mean /max... - model the dimensions as several 1D nodes that are connected with each other But I get your point - you basically want to have tensors as nodes. To be honest I haven't seen any implementation in that direction so far. It would require the message passing functions to operate on data such as (num_nodes, feature_dim_1, feature_dim_2, feature_dim_3... ). Have you checked for papers in that direction? Maybe it's possible to create some custom components that are able to do this.
Great video. May i ask a question: i have a KG (h, r, t) stored in networkx graph G without any embediing (plain english words for nodes and relations). How should i load it in PyG to perform link prediction?
Hello! There is a function in PyG that converts networkx into PyG graph objects (from_networkx): pytorch-geometric.readthedocs.io/en/latest/modules/utils.html Is that what you are looking for?
Super helpfull! Great video. I habe one question. It would be wonderful if you could nudge me in the right direction: I am trying to build a model to predict the property of a binary chemical mixture. So my data is basicaly in the form [SMILES1, SMILES2, target_value]. How do I construct the dataset, so that I can handle the two graphs separately in my forward() and then only combine their embeddings. Just concatenating them would't not work so easy, as a global pooling would happen over all the nodes, even if not connected right?
Hi! Have a look at PairData here: pytorch-geometric.readthedocs.io/en/latest/notes/batching.html I think this migh be what you are looking for :) in the GNN you can then simply pass the separate graphs through different layers, so one for the "left" and one for the "right" molecule. Hope this helps :)
how do I create a custom dataset which has different features size, let's say one node has 5 features and 2nd node has 3 features. In this case, how we can create a dataset. Do we have to use embedding space?
Hello, I have project related with fake news detection using GNN, I want to know how I process raw Dataset in Pytorch Geometric for example the gossip cop_fake.csv and gossip cop real.csv.
hey! great video!! can somebody explain to me though why use yor own datset if there is a datset in pytorch from the same link? is there some drawbacks from using predefined dataset?
Hi! This tutorial is mainly for when you want to use your own dataset (which doesn't yet exist). Of course you can use MoleculeNet from Pytorch Geometric, it is the same implementation :)
Amazing videos, I have a question I have history of illegal parking tickets (street address, time,day,year) I want to predict the illegal parking activity in the feature (for specific day and hour). Could I make it with graph neural networks? if yes, that will be "node prediction" or "graph level prediction" or something else that i miss?
Hi Nick! Thanks :) The first question you have to ask is what are the nodes and what are the edges, if you want to model this with a GNN. Typically for traffic things, the streets are treated as edges and specific sections on streets as nodes (crossings or subsections of longer streets). You could maybe treat the parking spots as nodes somehow. Alternatively, treat the streets as nodes and introduce an edge between two streets if they are connected. The next thing is that you have a temporal component in your feature space. One way could be to have a global feature for the whole graph, that specifies the point in time. You could either use things like the day of the week or even which part of the day (morning, evening...) and attach it to each node. With all that would basically perform node-level prediction for the whole graph. That means you predict for every parking spot / street in your graph, if there is illegal acitivty going on at a specific point in time. Your dataset would then basically be a collection of snapshots of a street map and a data point would be a graph for a specific point in time, labeled with all the illegal acitivity at the different parking spots / streets. That's just one idea, certainly there are different ways to model this. I think the challenging part is to get a proper representation of the input space (of the street network). For this maybe have a look at this collection of road networks: networkrepository.com/road.php Hope this helps, let me know if you need more :)
@@DeepFindr Thank you very much for your response, I hope my professors act like you. if I understood correctly, you mean: - I have to draw a graph (for example, graph = g) with all the points with the parking spaces as nodes. -Εach line of the dataset will every time have the same graph (g) in "Y". And in "X" I will have something like time point. -Every line has the graph (g) in "Y" . But The difference of each graph (g) is that in different cases some of the nodes in the graph will be labeled as "illegal" or "legal" So after the training if I go and give a time point in "X" the model will have to predict which of the nodes in the graph is illegal or not. Let me know if I understood you correctly. Thanks in advance
Hi! I hope they will :) Yes this would be my approach if I had to model it as a graph problem. Did your X and Y refer to features and labels like in classical machine learning? Then I would say X are: the graph and the node features, Y are just the labels (illegal/legal). However, I am not sure if the overall problem is suitable for GNNs. Typically you would either have a different graph structure for different data points (like for molecules), or you would have information about other nodes in the graph and then infer on unlabeled nodes (like in a social network). Here you always have the same graph structure and always try to predict all nodes. Another way to formulate the problem might be to label a few nodes and then predict the others. For example, given that parking spot A and B have an illegal acitivity, predict the label ob parking spot C. The first question when thinking about this is "what influences if there is an illegal acitivity". The strongest features are probably time and day of the week. On the weekend there might be more people around and therefore more illegal acitivity. Another feature is certainly the parking spot itself, because people usually don't do illegal acitivity in crowded areas and some spots will always have more frauds than others. Besides that there are many external factors that we cannot consider, such as if a person is in a rush or if a person has no money to pay the meter ect. With that you could also treat it as a tabular datast and use any sci-kit learn model to fit it. The only thing that would really justify the use of GNNs is if there is some sort of interaction between the nodes. For example if node A (= parking spot) has illegal acitivity then you could use this information to predict the acitivity on node B. For example if they are close to each other, it might be that the illegal acitivity "motivates" someone to also not pay. Thus, there might be some patterns in the data, for which GNNs make sense, but again only if you don't predict all nodes at the same time. Because if you do that, all nodes have the same features and it makes no sense to share information. Best regards :)
Hey Florian. I like your videos. Sorry to ask the question here, but what type of software do you use to produce your videos, if I may ask (referring to the other videos where you present theory)?
Hi! Thanks! At the moment I use active presenter in the free version. It's nothing fancy :) a couple of functionalities are missing in the free version so I might consider switching to movie maker or so. :)
Hello! Thanks for this series! It's so useful! In line 133, any reason why you reshaped it instead of transposing it? I feel like reshaping the coo matrix will smush some of the rows and columns together, and transposing it feels more intuitive to me. Thanks!
Hi! Thanks for the feedback, I appreciate it! Yes you are right, transpose is probably the safer option there. Actually I should adjust that on Github, thanks for pointing out!
@@DeepFindr I have a doubt, so for edges you are transposing resulting in 2 x a dimesnion and for node features we are having b x constant dimension. where a and b are varying dimesnions in dim 1 and dim 0 respectively wont this coz problems. (I am running GNN EXplainer on the whole graph and im bumping into tensor dim problems but didnt face them while training, hence this doubt). As far as my understanding only dim 0 can vary ri8
Hi! Actually this is handled implicitly by pytorch geometric. Have a look at this article: pytorch-geometric.readthedocs.io/en/latest/notes/batching.html Edge_index will be extended in the second dimension.
great video indeed, please help me in graph formulation of data, where nodes are locations (longitude and latitude), along with zip code etc. and edges are roads that connect them having distance and traffic as weights. I want to convert this csv data into graph. is there any package like rdk for extracting edge and node features .
Hi! No this needs to be done manually. I have a video coming out in the next days / week that explains this in detail. If you need earlier help pls send me an email :)
Hi, great video! You say you added edge features twice for each direction - meaning you're using directed graphs. Is there a reason for that? Since molecules are undirected graphs by nature.
Hi and thanks! Yes that's right. In Pytorch geometric it is required to add both directions separately. This allows the GNN to pass information in both directions. If you for instance only add [0,1] and not [0,1] and [1,0], then the model will only share information from node 0 to node 1, but not vice versa. This is also illustrated in the introduction example in the documentation. There it says for an undirected graph: "...Although the graph has only two edges we need to define four index tuples..." Best regards
Amazing as usual, thanks for your effort .just small wondering, this the way to prepare data of molecules only .if another type of data as for example citations. Should use other? And the purpose of expirment change this ,if i have to make graph classification or node prediction.
Hi, Yes this example is only for molecules and graph level predictions. But generally for every dataset you need to construct these things such as node features, Adjacency info ect. For example for a citation network you would put text features (bag of words or so) as node feature vector. But the general structure of the code stays the same :) And if you have node level predictions you additionally need to define a mask that you can pass to pytorch geometric. But there are many example available :)
Thank you, your videos are really helpful, I have a question if you don't mind. If we have another dataset let's say a traffic data of a network system, how can we decide on which nodes are to be connected?
Hi! Sure, could you quickly provide some more information? For a traffic network I could image that the nodes are intersections and the edges are the connections between them. You could have a look at this collection, they also have road datasets: snap.stanford.edu/data/. I'm not entirely sure what you try to model, but essentially you need to convert your traffic network into a graph somehow.
@@DeepFindr hi thank you for replying to me, So I am trying to structure a botnet detection benchmark data (ISCX-2014 dataset) into a graph data. The idea is to classify nodes whether they participate into a botnet or not using GNNs.
Ahh okay, I understand now. First I thought you refer to actual road traffic networks. In that case I guess you have nodes (ip addresses) and their traffic, so connection pairs to other nodes. You could build the edges based on the traffic between two nodes and also add all sorts of information about the connection as edge features (how much traffic, connection type..). The nodes could additionally have node features that describe information about each of the peers. Also there is a paper that applies GNNs for botnet classification, which is called "Automating Botnet Detection with Graph Neural Networks", but probably you already know it.
@@DeepFindr Thank you so much that is so kind of you. I just wanted to say keep going you are really making an impact and helping fellow learners and students. KEEP it up
@@DeepFindr İt would be so nice video. because I created the graph from NetworkX, there is a method to calculate PageRank in NetworkX and I try to train the GNN based on this model to predict the rest of the PageRanks instead of calculating them in classical way. My problem is to preprocess page ranks node parameters into torch_geometric.data.Data object. Would be nice project.
had the same question. @Sam did you get any solution so far? My guess is we need to read .gpickle (or whatever format you are using) in the same way he read the CSV file.
It is possible to convert pytorch geometric graph objects to networkx, if that helps. There are a couple of helper functions in the utils section of the documentation
NOTCO is a startup that make plant based products like milk, mayonnaise, etc. they have use an AI (its called Giuseppe) to get same features from plant based products like real ones. Can you research on that and do a video on that please.
@DeepFindr This is awesome. Helping me a lot. How can I call the class as I am using google colab? MoleculeDataset needs two arguments. Thanks. dataset = MoleculeDataset('what would be the root?', 'what would be file name?')
The root should be the folder where the directories "raw" and "preprocessed" are present (as described in the video). File name is the name of your file, e. g. dataset.csv
UPDATE: I realized that the shapes for the edge information didn't match when training the model.
I had to add the edge features twice for each bond (once for each direction) to fix this. I updated the code accordingly.
Thank you, can you clarify whether we can use 2 graphs to predict an output; for example given 2 molecules in smiles format and binary output 0-1 whether there is an interaction or no. your help would be appreciated.
Hi! Yes this is possible. Pytorch has a special data type for this called PairData. See more here: pytorch-geometric.readthedocs.io/en/latest/notes/batching.html?#pairs-of-graphs
@@DeepFindr thanka a lot, I will go through this if possible please do upload video on this complex concept 😅
@@DeepFindr I had a doubt, like in custom dataset for node values and edges and adjacency matrix how to incorporate information of 2 molecules, Im not getting the documentation on graph pairs. We need to prepare data right? using the method shown in this video
Kindly request you for assistance as I am stuck and really need some help please
Oh my! Helped my thesis so much. I don't understand how is this possibly free? This is pure gold, the best and the easiest to follow after watching 10+videos. You channel is helping the humanity and science and AI!! Thanks a lot!
Thank you for your kind words! I'm very happy if it is helpful :)
No words can describe how much I owe you!! the most valuable playlist on the entire TH-cam, this helped me pretty much...please upload more videos!!
I absolutely loved when you were explaining the parameters of the Data class, that you also visually added to what you were referring to by showing first the nodes, then the edges, etc. In my opinion this is a great way of teaching. Keep up the good work, this is invaluable.
I really enjoyed watching this series about creating a custom dataset for Graph Neural Network operations. I have learned a lot from this video. Thank you!
Watching this series is an adventure, thank you for helping us with quality content
Thank you for sharing this amazing work with us! Eagerly waiting for further updates!
This is the first video about custom dataset! Awesome! rather than Other videos, just go around with processed dataset or talk about the process.
Thanks :)
many thanks for this tutorial ... Have you any tutorial how to create this csv ? I mean how to convert normal data from machine learning data to graph?
Hi!
The csv simply consists of SMILES strings, which is one possible representation for molecules. I don't fully understand the question, could you clarify please :)
@@DeepFindr @DeepFindr @DeepFindr Imagine I have a CSV with:
Some features to represent a car or a bike, where my classes is 1 for car and 2 for bike like this below
Feature_1, feature_2, feature_3, classes
1, 4,5,1
1,3,2,2
I would like to know how to convert it to SMILES representation or any type of data gnn could handle
Or if not possible.. how can I extract this informations from a graph database .. or even create it manually ...
What I would like to know is how I can create this graph representation from information I have on database or in any other place ..
Maybe an example of... if I have no data ... How can I create this data converting to this representation to use in a gnn classifier ...
I don't know if it is clear enough
Hi! I see.
To transform a database into graphs you need two things:
1. Nodes: you need to have entities for the graph
2: Edges: between these entities you need to have connections.
In a database this could be represented like this:
Car_ID, Connected_To_ID
1, 2
5, 7
...
This gives you cars and the information about how cars are connected for example. Here you could simply loop over all cars and connect the relations to other cars.
Let me know if you have further questions :)
PS: SMILES is only suitable for molecules and describes a molecular structure (which atoms and which connections)
Thanks! Your tutorial is much more clear than the docs' one.
Thank you sir for this amazingly well-structured video!
Hi do you have any citation for the code you're presenting? please let me know. Thanks
Thank you for your video. I have a question. After I ceated Dataset in Pytorch Geometric, How I can save Dataset and then load it again? Thank you.
If I run this on Google Colab or Kaggle notebook will this Processed folder be created?
how to add features next to edge features, for example smiles (converted to edge) + x2 + x3 to predict y?
Thank you for the great videos! I have one question regarding using PyTorch Geometric Temporal from the other videos. In the traffic forecasting videos, you mentioned that it is possible to create a custom dataset for the library PyTorch Geometric Temporal by the method from this video, but as far as I can understand in this way 'temporal' layer is missing. Can you suggest any simple way to use this method to create a custom data that also contains temporal layer in the nodes?
Hello! The basic idea to create the Dataset is the same as in this video. For the missing temporal component you need to decide how your individual (time series) graphs are added together. Basically you need to create a list of graph objects.
Pytorch geometric also has support for temporal graphs. The documentation has good examples:
pytorch-geometric.readthedocs.io/en/latest/modules/data.html?highlight=Temporal#torch_geometric.data.TemporalData
@@DeepFindr Ah I get it! I'll read through the documentation and try them out. Thank you very much :)
Let me know if I can help you :)
Thanks for the video. I have a question about construing the data set. I have some data in Data format in PYG. I wonder how I can combine them into one dataset that captures all datasets in order to apply GCN? I watched the video, however, I could not understand how the final data includes all individual data that are defined in Data format.
Hi! In the video I created a "standard" dataset which doesn't hold a list of Data object or something. Instead all of them were stored on the hard drive and loaded whenever needed, which is done with the get function.
If you want to put all the data objects together in one dataset, you need to have a look at the InMemory Dataset class:
pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html
This means, you still have to have a class where you collect all your data objects. Let me cite the documentation here: "The real magic happens in the body of process(). Here, we need to read and create a list of Data objects and save it into the processed_dir. Because saving a huge python list is rather slow, we collate the list into one huge Data object via torch_geometric.data.InMemoryDataset.collate() before saving . The collated data object has concatenated all examples into one big data object and, in addition, returns a slices dictionary to reconstruct single examples from this object."
Hope this is what you are looking for :)
Thank you so much for such a great video, I have learnt a lot !
thanks for sharing different types of how to load the dataset. I have a question though: is there a way to get the node features and edge_index without this mol_obj = Chem.MolFromSmiles(mol["smiles"])?
Hi! In a later video I also mention the deepchem featurizer as an alternative.
Do you have problems with Rdkit or why would you want to get rid of it? :)
@@DeepFindr I wanted to try out other datasets Proteins, and Enzymes from the network repository
so if I want to get the node features and edge_inde, is there a different approach I follow?
i have an error "dataset not defined" what i am doing wrong . please help me....
So we are not providing any node identification label ri8, so im assuming in the node feature whateever index you gave is how the node is indentified and that same index has to be used for adjacent list. is my understanding correct?
Yes exactly, the node indices are arranged according to the node feature matrix. The first "row" corresponds to the first element. In the edge_index a 0 corresponds to this node.
You can of course safe a mapping of your preferred indexing and always map it back.
@@DeepFindr Thanks a lot for your time sir.
Hi, thanks for the video! I want to create my own dataset but with the TemporalDataset from Pytorch Geometric. Can you advise me, at which places the code must be modified to create a TemporalDataset?
Hi! Did you watch my video on how to convert a tabular dataset to a graph dataset? I used pytorch geometric temporal for this, so I have no experience with plain PyG
Many Thanks @DeepFindr
Hi, thanks for the video. I would like to ask about the edge array shape. The DataLoader function requires the size of [2, number of edges] and must be coo format. Could you tell me how to make this edges array? Thank you
Hello! Please have a look at my latest uploads. They are about how to get the edge Index and other graph attributes.
:)
Hii,
I have been trying to make a dataset of protein and ligand complex for GCN , which intends to predict their binding sites , would you mind explaining me about how to make a dataset for the same?
Hi,
I would have a look at pair data as described here: pytorch-geometric.readthedocs.io/en/latest/advanced/batching.html :)
According to the documentation of torch_geometric.data.Data the node features should be of size (num_nodes, num_of_node_features) but what should you do if your node features are multi-dimensional, for example, image data?
Hi,
If you have images on the nodes there are several things you could do:
- covert the image into features using a pretrained CNN model e.g. Inception
- flatten the image including all channels. So (512,512,3) would go into one vector of size 512x512x3.
However I think this will not yield good results, as CNNs were especially designed to handle such inputs.
- aggregate over the dimensions that are "too much" with mean /max...
- model the dimensions as several 1D nodes that are connected with each other
But I get your point - you basically want to have tensors as nodes. To be honest I haven't seen any implementation in that direction so far. It would require the message passing functions to operate on data such as
(num_nodes, feature_dim_1, feature_dim_2, feature_dim_3... ).
Have you checked for papers in that direction? Maybe it's possible to create some custom components that are able to do this.
Great video. May i ask a question: i have a KG (h, r, t) stored in networkx graph G without any embediing (plain english words for nodes and relations). How should i load it in PyG to perform link prediction?
Hello!
There is a function in PyG that converts networkx into PyG graph objects (from_networkx): pytorch-geometric.readthedocs.io/en/latest/modules/utils.html
Is that what you are looking for?
@@DeepFindr thanks a ton! I will try this..seems like what i want.. :)
Could you please make a video on preparing dgl dataset for link prediction...it would be really helpful
@DeepFindr how do you prepare a custom dataset for Heterogeneous graphs
For that please have a look at this article: pytorch-geometric.readthedocs.io/en/latest/notes/heterogeneous.html
Super helpfull! Great video. I habe one question. It would be wonderful if you could nudge me in the right direction: I am trying to build a model to predict the property of a binary chemical mixture. So my data is basicaly in the form [SMILES1, SMILES2, target_value]. How do I construct the dataset, so that I can handle the two graphs separately in my forward() and then only combine their embeddings. Just concatenating them would't not work so easy, as a global pooling would happen over all the nodes, even if not connected right?
Hi!
Have a look at PairData here: pytorch-geometric.readthedocs.io/en/latest/notes/batching.html
I think this migh be what you are looking for :) in the GNN you can then simply pass the separate graphs through different layers, so one for the "left" and one for the "right" molecule. Hope this helps :)
Exactly what I was looking for, thank you so much!
how do I create a custom dataset which has different features size, let's say one node has 5 features and 2nd node has 3 features. In this case, how we can create a dataset. Do we have to use embedding space?
For this you need to build a heterogenous graph as described here: pytorch-geometric.readthedocs.io/en/latest/notes/heterogeneous.html
Geat video. I have a question. Whai is the "filename" in __init__()? Thank you
Thanks! If I remember correctly this is the name of your dataset csv file
thank you very much. Have a nice day.@@DeepFindr
Hello, I have project related with fake news detection using GNN, I want to know how I process raw Dataset in Pytorch Geometric for example the gossip cop_fake.csv and gossip cop real.csv.
Hi, have you seen my videos on how to convert a tabular dataset into a graph? I think this is what you are looking for :)
hey! great video!! can somebody explain to me though why use yor own datset if there is a datset in pytorch from the same link? is there some drawbacks from using predefined dataset?
Hi! This tutorial is mainly for when you want to use your own dataset (which doesn't yet exist).
Of course you can use MoleculeNet from Pytorch Geometric, it is the same implementation :)
@@DeepFindr oh I get it. Thanks!
Amazing videos, I have a question
I have history of illegal parking tickets (street address, time,day,year) I want to predict the illegal parking activity in the feature (for specific day and hour). Could I make it with graph neural networks? if yes, that will be "node prediction" or "graph level prediction" or something else that i miss?
Hi Nick! Thanks :)
The first question you have to ask is what are the nodes and what are the edges, if you want to model this with a GNN.
Typically for traffic things, the streets are treated as edges and specific sections on streets as nodes (crossings or subsections of longer streets). You could maybe treat the parking spots as nodes somehow. Alternatively, treat the streets as nodes and introduce an edge between two streets if they are connected.
The next thing is that you have a temporal component in your feature space. One way could be to have a global feature for the whole graph, that specifies the point in time. You could either use things like the day of the week or even which part of the day (morning, evening...) and attach it to each node.
With all that would basically perform node-level prediction for the whole graph. That means you predict for every parking spot / street in your graph, if there is illegal acitivty going on at a specific point in time. Your dataset would then basically be a collection of snapshots of a street map and a data point would be a graph for a specific point in time, labeled with all the illegal acitivity at the different parking spots / streets.
That's just one idea, certainly there are different ways to model this. I think the challenging part is to get a proper representation of the input space (of the street network). For this maybe have a look at this collection of road networks: networkrepository.com/road.php
Hope this helps, let me know if you need more :)
@@DeepFindr Thank you very much for your response, I hope my professors act like you.
if I understood correctly, you mean:
- I have to draw a graph (for example, graph = g) with all the points with the parking spaces as nodes.
-Εach line of the dataset will every time have the same graph (g) in "Y". And in "X" I will have something like time point.
-Every line has the graph (g) in "Y" . But The difference of each graph (g) is that in different cases some of the nodes in the graph will be labeled as "illegal" or "legal"
So after the training if I go and give a time point in "X" the model will have to predict which of the nodes in the graph is illegal or not.
Let me know if I understood you correctly.
Thanks in advance
Hi! I hope they will :)
Yes this would be my approach if I had to model it as a graph problem.
Did your X and Y refer to features and labels like in classical machine learning? Then I would say X are: the graph and the node features, Y are just the labels (illegal/legal).
However, I am not sure if the overall problem is suitable for GNNs. Typically you would either have a different graph structure for different data points (like for molecules), or you would have information about other nodes in the graph and then infer on unlabeled nodes (like in a social network).
Here you always have the same graph structure and always try to predict all nodes.
Another way to formulate the problem might be to label a few nodes and then predict the others. For example, given that parking spot A and B have an illegal acitivity, predict the label ob parking spot C.
The first question when thinking about this is "what influences if there is an illegal acitivity". The strongest features are probably time and day of the week. On the weekend there might be more people around and therefore more illegal acitivity. Another feature is certainly the parking spot itself, because people usually don't do illegal acitivity in crowded areas and some spots will always have more frauds than others.
Besides that there are many external factors that we cannot consider, such as if a person is in a rush or if a person has no money to pay the meter ect.
With that you could also treat it as a tabular datast and use any sci-kit learn model to fit it.
The only thing that would really justify the use of GNNs is if there is some sort of interaction between the nodes. For example if node A (= parking spot) has illegal acitivity then you could use this information to predict the acitivity on node B. For example if they are close to each other, it might be that the illegal acitivity "motivates" someone to also not pay.
Thus, there might be some patterns in the data, for which GNNs make sense, but again only if you don't predict all nodes at the same time. Because if you do that, all nodes have the same features and it makes no sense to share information.
Best regards :)
Amazing video! Helped a lot!
Thank you very much, this is hugely appreciated mate
Thanks! :)
Hey Florian. I like your videos. Sorry to ask the question here, but what type of software do you use to produce your videos, if I may ask (referring to the other videos where you present theory)?
Hi! Thanks!
At the moment I use active presenter in the free version. It's nothing fancy :) a couple of functionalities are missing in the free version so I might consider switching to movie maker or so. :)
@@DeepFindr, alright, thanks :)
thank you so much sir for this video😊 please make a video on "how to setup different libraries and virtual environment before running of code"
Very helpful, thank you a lot!
Hello! Thanks for this series! It's so useful! In line 133, any reason why you reshaped it instead of transposing it? I feel like reshaping the coo matrix will smush some of the rows and columns together, and transposing it feels more intuitive to me. Thanks!
Hi! Thanks for the feedback, I appreciate it!
Yes you are right, transpose is probably the safer option there.
Actually I should adjust that on Github, thanks for pointing out!
Hi! I just checked the code and realized that I already fixed this some time ago. So on Github I'm also using a transpose.
Best regards
@@DeepFindr I have a doubt, so for edges you are transposing resulting in 2 x a dimesnion and for node features we are having b x constant dimension. where a and b are varying dimesnions in dim 1 and dim 0 respectively wont this coz problems. (I am running GNN EXplainer on the whole graph and im bumping into tensor dim problems but didnt face them while training, hence this doubt). As far as my understanding only dim 0 can vary ri8
Hi! Actually this is handled implicitly by pytorch geometric. Have a look at this article: pytorch-geometric.readthedocs.io/en/latest/notes/batching.html
Edge_index will be extended in the second dimension.
great video indeed, please help me in graph formulation of data, where nodes are locations (longitude and latitude), along with zip code etc. and edges are roads that connect them having distance and traffic as weights. I want to convert this csv data into graph. is there any package like rdk for extracting edge and node features .
Hi! No this needs to be done manually. I have a video coming out in the next days / week that explains this in detail. If you need earlier help pls send me an email :)
@DeepFindr
did you get my reply? sorry for bothering you!
Hi! Sorry I didn't get the reply :) next week I will upload 2-3 videos on this topic. Best regards
did you upload these videos?@@DeepFindr
Hi, great video! You say you added edge features twice for each direction - meaning you're using directed graphs. Is there a reason for that? Since molecules are undirected graphs by nature.
Hi and thanks!
Yes that's right. In Pytorch geometric
it is required to add both directions separately.
This allows the GNN to pass information in both directions. If you for instance only add [0,1] and not [0,1] and [1,0], then the model will only share information from node 0 to node 1, but not vice versa.
This is also illustrated in the introduction example in the documentation. There it says for an undirected graph: "...Although the graph has only two edges we need to define four index tuples..."
Best regards
@@DeepFindr aaa i see, guess i missed that part. Thanks a lot!
Amazing as usual, thanks for your effort .just small wondering, this the way to prepare data of molecules only .if another type of data as for example citations. Should use other? And the purpose of expirment change this ,if i have to make graph classification or node prediction.
Hi,
Yes this example is only for molecules and graph level predictions.
But generally for every dataset you need to construct these things such as node features, Adjacency info ect.
For example for a citation network you would put text features (bag of words or so) as node feature vector. But the general structure of the code stays the same :)
And if you have node level predictions you additionally need to define a mask that you can pass to pytorch geometric.
But there are many example available :)
@@DeepFindr thanks alot ,
Thank you, your videos are really helpful, I have a question if you don't mind. If we have another dataset let's say a traffic data of a network system, how can we decide on which nodes are to be connected?
Hi! Sure, could you quickly provide some more information?
For a traffic network I could image that the nodes are intersections and the edges are the connections between them. You could have a look at this collection, they also have road datasets: snap.stanford.edu/data/.
I'm not entirely sure what you try to model, but essentially you need to convert your traffic network into a graph somehow.
@@DeepFindr hi thank you for replying to me, So I am trying to structure a botnet detection benchmark data (ISCX-2014 dataset) into a graph data. The idea is to classify nodes whether they participate into a botnet or not using GNNs.
Ahh okay, I understand now. First I thought you refer to actual road traffic networks.
In that case I guess you have nodes (ip addresses) and their traffic, so connection pairs to other nodes. You could build the edges based on the traffic between two nodes and also add all sorts of information about the connection as edge features (how much traffic, connection type..).
The nodes could additionally have node features that describe information about each of the peers.
Also there is a paper that applies GNNs for botnet classification, which is called "Automating Botnet Detection with Graph Neural Networks", but probably you already know it.
@@DeepFindr Thank you so much that is so kind of you. I just wanted to say keep going you are really making an impact and helping fellow learners and students. KEEP it up
Thanks man! Good luck with your project!
Damm I’m addicted
How to create your own dataset from networkx library?
Hi! Sorry I have never used that library, so I don't know.
@@DeepFindr İt would be so nice video. because I created the graph from NetworkX, there is a method to calculate PageRank in NetworkX and I try to train the GNN based on this model to predict the rest of the PageRanks instead of calculating them in classical way. My problem is to preprocess page ranks node parameters into torch_geometric.data.Data object. Would be nice project.
had the same question. @Sam did you get any solution so far? My guess is we need to read .gpickle (or whatever format you are using) in the same way he read the CSV file.
It is possible to convert pytorch geometric graph objects to networkx, if that helps. There are a couple of helper functions in the utils section of the documentation
can we have adverse drug event with GNN
Have a look at this paper: arxiv.org/abs/2004.00407
NOTCO is a startup that make plant based products like milk, mayonnaise, etc. they have use an AI (its called Giuseppe) to get same features from plant based products like real ones. Can you research on that and do a video on that please.
Interesting! I note it down and will have a look :)
Have you seen this video: th-cam.com/video/f8Hb7yjF1Hc/w-d-xo.html :)
Thank you so much... I will take a look
The last step takes too much time even in colab gpu :(
I mean after calling the object :(
What exactly do you mean? Which step?
@@DeepFindr hey its solved! I might have some issue with the function def. Now ok! Thanks for replying😀
OK perfect :)
@DeepFindr This is awesome. Helping me a lot.
How can I call the class as I am using google colab? MoleculeDataset needs two arguments. Thanks.
dataset = MoleculeDataset('what would be the root?', 'what would be file name?')
The root should be the folder where the directories "raw" and "preprocessed" are present (as described in the video). File name is the name of your file, e. g. dataset.csv
Next part please :(
Today or tomorrow! :) sorry for the delay!!
+