GNN Project #3.1 - Graph-level predictions

DeepFindr

มุมมอง 24 711

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 10 ธ.ค. 2024

ความคิดเห็น • 90

@casualgamer91 2 ปีที่แล้ว ⁺⁶
Hi - just wanted to say thank you for making such fantastic educational content! I am amazed by your effort to teach the community - even more so when you mentioned you aren't a computational chemist by training! Your videos are helping me understand the nebulous (to me) nature of graph neural networks and actually applying it via code. Please keep up the good work!
@DeepFindr 2 ปีที่แล้ว ⁺¹
Happy that you find it useful :)
@MaryamSadeghi-u6u 2 หลายเดือนก่อน
You have put a lot of time into creating this videos and it is really valuable that after 3 years it is still very useful
@vijayalaxmiise1504 5 หลายเดือนก่อน
Very Nice Explanations Sir. GNN clearly explained very well for beginners like me. Thank You so much
@raziehrezaei3156 2 ปีที่แล้ว ⁺¹
amazing video with clear and detailed explanations! Thank you so much!!
@sharzilkhan9150 3 ปีที่แล้ว
finally got it working to some extent. Thank you for your guidance.. :)
@borauyar 3 ปีที่แล้ว ⁺²
Great video, very well explained. Thank you!
@ahmadzobairsurosh5832 2 ปีที่แล้ว
Absolutely Amazing!
Thanks for such a great explanation
Cheers!
@DeepFindr 2 ปีที่แล้ว
Thank you!
@florianhonicke5448 3 ปีที่แล้ว
Good job explaining everything!
@LeilaHajiloo 2 ปีที่แล้ว ⁺²
Great work! Thanks for sharing.
I have a quick question for you. I use the featurizer function instead but I'm getting an error from
train_dataset = MoleculeDataset(root = "ِdata/", filename="HIV_train_oversampled.csv")
It says:
... in to_pyg_graph(self)
edge_attr=edge_features,
pos=node_pos_features,
--> **kwargs)
def to_dgl_graph(self, self_loop: bool = False):
TypeError: type object got multiple values for keyword argument 'pos'
My f[0] value is GraphData(node_features=[75, 30], edge_index=[2, 162], edge_features=[162, 11], pos=[0]).
Also, I'm running the code in Google Collab notebook.
@CLABEATLE ปีที่แล้ว
I am also running into this same issue! If anyone found a fix it would be great if they could share, as I am completely stuck with this :(
@mahsayazdani5887 ปีที่แล้ว
I have the same issue too...
@thegimel 3 ปีที่แล้ว ⁺⁴
Great video, thanks for making it. If you oversample positive samples, then why do you consider the sets to be imbalanced still? or did you not train/evaluate on the oversampled sets?
@DeepFindr 3 ปีที่แล้ว ⁺²
Good point.
At first I didn't oversample and then I needed a weighted loss function.
Now that I oversample I could also remove the weights.
However the test data is still imbalanced and for that I want to put more emphasis on getting the positive class correctly. That's why I still weight the training to put more focus on the minority class.
In my experiments I found that oversampling on train data + keeping a weighted loss gives me better results on the test set.
But I also found that there is a bug in my code somewhere, because at the moment the class distribution for the train data is not even somehow :D
@kaan608 3 ปีที่แล้ว
Thank you for the new update :)
@鲍灵杰 3 ปีที่แล้ว
Thank you for your talking
@ZephyrineFreiberg 2 ปีที่แล้ว
Thanks for your video. I'd like to ask if there are any papers about graph classification/regression tasks with graphs with edge properties?
@DeepFindr 2 ปีที่แล้ว
Hi! I have a video about how to handle edge features in graph neural networks that talks about a couple of those :)
@juanete69 หลายเดือนก่อน
Why do you need to use weight if you have already oversampled the data?
@sanataj1383 ปีที่แล้ว
Hi, Can you make video with the GNN model for classification of Alzheimer's disease
@nicolasf1219 5 หลายเดือนก่อน
For some reason I have only 15 million parameters, instead of 17 million. I double checked everything. What could be the reason? Could it simple be because of the latest versions that I use?
@koolgal722 ปีที่แล้ว
Hello..do this works for multi-class classification as well ..e.g.in SIDER data from MoleculeNet we have 27 classes? please reply!
@jianxianghuang1275 3 ปีที่แล้ว
Great talk!!!
@jerryjohnthomas4908 2 ปีที่แล้ว
Could you give an idea about what explainable procedures you will use to explain the model as in this video.
@DeepFindr 2 ปีที่แล้ว
Hi! What about the GNN Explainer? I also have a video about it :)
@ziruisu5990 2 ปีที่แล้ว
awesome video! thank you!
@sdwysc 2 ปีที่แล้ว
precious.
@seza1231 2 ปีที่แล้ว
Hi doesn't know if you'll see this but I want to make the model predict
How could I do it?
@gabrielpeter4544 3 ปีที่แล้ว
Great content! Learning a lot on this topic :)
@juanete69 หลายเดือนก่อน
Why do you say that x1+x2+x3 is a concatenation?
Isn't it a element-wise addition?
@soodabehghaffari2323 2 ปีที่แล้ว
After you save the data in *.pt as Cashe Dataset for reloading, how did you reload the data? How did you create the train and test data from all *.pt files?
@DeepFindr 2 ปีที่แล้ว
Hi! There is a get function in the Dataset class. Please have a look at the latest code on Github: github.com/deepfindr/gnn-project/blob/main/dataset.py
There also test and train set are separated.
@riyajatar6859 2 ปีที่แล้ว
Great video.
Can you make some videos on GNN for textual data also.
That would be great fun
@DeepFindr 2 ปีที่แล้ว
Hi! That would be Transformers in the end, or not?
Or how exactly is your text data represented?
@ajwadakil6892 2 ปีที่แล้ว
Hello, can I get the commit of this video, as the commits on your repo seem to be only for the final model that uses Graph Transformers. Also you mentioned in an earlier video that if edge_attr is included in a layer, then we can use edge features, as per documentation, I have seen it included. Has it been included in the current implementation and can edge features be used with GAT right now?
@DeepFindr 2 ปีที่แล้ว ⁺¹
Hi!
You can get previous commits easily on Github:
github.com/deepfindr/gnn-project/commit/d91f4d272294698c410e0caeee0f333e2a55efd7#diff-fada037ad086638e65c7ae77e3d223963e9afaa26326aab0ea718f4013176e43
This is a previous commit of the model file.
There were many changes in the library meanwhile, GAT also supports edge features:
pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GATConv
Good luck and let me know if you have further questions :)
@nerdinvestdor 3 ปีที่แล้ว
Quick question given that the dataset is already oversampled there is still need to use a weighted loss to push the positive samples even further ? Also does the graph attention going to need more data for training compared to a simpler layer ?
@DeepFindr 3 ปีที่แล้ว
Hi! Nope, not needed, you are right. First I tried the weighted loss and after that oversampling. As a result I was still using the same loss function but I'm pretty sure that the unweighted version produces the same results. I think this is already adjusted in the code, I'll check later.
About the second question I'm not sure - GAT will most likely have more weights than GCN for example, as it has more matrices to fit (I didn't check this, but it should be true).
Generally, the more weights a NN has the more data would need to be required to get a good fit. But there is a lot of discussion around the required number of samples for a specific network size.
@alvinanil 3 ปีที่แล้ว
Great video and great work. !!
Can make some videos on temporal link prediction of dynamics graph ?
@TheFlamencor ปีที่แล้ว
Hello! This is amazing content! I wanted to ask about the number or parameters in your model. With so many (millions) don't you worry about overfitting? Can't the model just "memorize" the data with so many degrees of freedom?
Thank you again!
@lyt743 3 ปีที่แล้ว
Did you ever end up adding edge features? How hard do you think it would be (for me) to take your code, change the operator to GINConv, add edge features, and get the model to train?
@DeepFindr 3 ปีที่แล้ว
Hi! Have a look at the next video (graph transformer), I've added edge features there :)
Let me know if this is what you are looking for
@lyt743 3 ปีที่แล้ว
@@DeepFindr It is, thanks! Another question: When I tried to run the code in GitHub (train.py), I got an error in the line: % self.top_k_by_n because it says top_k_every_n = 0 and it can't modulo by 0. The only change I made is disabling mlflow which I think is unrelated. Do you know what the issue could be?
@DeepFindr 3 ปีที่แล้ว
Hi, in the config.py for HYPERPARAMETERS simply change the top_k_every.. to 1 or 2 or whatever number except for 0 :)
Alternatively, in line 191 in train change HYPERPARAMETERS to BEST_PARAMETERS.
:)
@clayouyang2157 3 ปีที่แล้ว
if i want to extract advanced feature from my original graph, whether i can use hierarchical pooling and represent something that can sum up the the part of original graph in the middle layer?
@DeepFindr 3 ปีที่แล้ว
Hi, can you explain it a bit more precise? :) I don't fully understand the question. Thx!
@clayouyang2157 3 ปีที่แล้ว
@@DeepFindr do you have twitter account? it is a better way to describe my what i want to say
@DeepFindr 3 ปีที่แล้ว
You can write an email to deepfindr@gmail.com :)
@alejandroochoa9541 3 ปีที่แล้ว
Hi! I'm trying to implement my own model over the same dataset but from OGB. I'm currently struggling with making the model generalize, since basically all the predictions are "inactive". I thinck that is due the imbalance of the dataset, so used a weigthed CrossEntropyLoss but the model still failing, loss decrease basically nothing and test accuracy did not change. Did you face the same problems?
@DeepFindr 3 ปีที่แล้ว ⁺¹
Hi!
Yes I was facing the same issues. Did you try oversampling?
In 3.2 I have tried some additional things. In the end the model was not good but at least sort of worked.
I think also that a hyperparameter search really helps, because a lot of parameters need to be tweaked.
In the papers the people always report a ROC of 0.8 and higher, but this does not reflect the precision and recall, why I am doubting if their results are much better :D
@alejandroochoa9541 3 ปีที่แล้ว
@@DeepFindr I'm trying now to oversampling the active ones thanks to your video, but since I'm using the PyG version of the dataset, I'm not sure how to do that because the dataset is al ready converted into a PyG Dataset Object. But anyway, I will try to manually adjust the hyperparaneters. And yeah, I think you did a great job. What I can see is that is not that easy as it looks
@sebbecht 2 ปีที่แล้ว
Hey DeepFindr, first, i think the level of production on these videos is awesome! I hope you have success with this channel. I am curious what you think of the number of parameters in your model in relation to your dataset size. The original size is 80K molecules but even with oversampling, is it enough to warrent a 17M parameter model? for simple ANNs some say there is a rule of thumb that you want 50x more samples than parameters and while I know you cant transfer that rule to other domains directly, there still seem to be a large discrepancy in your case. for vision I have trained a CNN with >20M parameters with 140K images, but here we also have access to many more data transformations. If you have found any resources on understanding this I would much appreciate a read :)
@DeepFindr 2 ปีที่แล้ว ⁺¹
Hi! First of all thanks for the kind words.
I agree, the gap between model parameters and train samples might be a bit large in this case.
I have however trained a similar (molecule GNN) model with around 10M parameters and only 25k samples, that performed extremely well. Other network architectures with less layers and less neurons didn't achieve this performance.
So I think it really depends on the problem and as molecules represent a rather complex input space, it might be good to choose a more complex network architecture.
I've also heard of rules of thumb but in practice my approach is to simply run a hyperparameter search to see what works best. :)
@sebbecht 2 ปีที่แล้ว
@@DeepFindr Thanks for the reply! One thing i wonder is: how do we know when its too many then? optimized performance could also mean overfitting and reduced generalizability? Are the benchmark datasets big and diverse enough to identify these things? Ill have to dig further :)
@DeepFindr 2 ปีที่แล้ว
You can always log the test error on a holdout set to find out if your model overfits.
You will also quite quickly find out if your model is too big - usually it doesn't learn or takes very long to converge.
What you can do is to train 10 different network sizes and compare their test error (on unseen data) to find out how size corresponds with performance.
A commonly used dataset with all sorts of properties is ZINC, which has up to 230 Mio molecules. I typically use ZINC250k. It is quite diverse but I cannot express it in numbers. For this you would need to find a metric for diversity.
I recently uploaded a series on uncertainty quantification. I think this can also be helpful to find out if a specific model is good - and in addition to that also make it tell us when it's not good (for which data inputs).
Generally, I think the number of parameters should not be significantly higher than the number of samples, as it usually adds no benefit.
@MrFerdidos 2 ปีที่แล้ว
Hi :D First of all thanks for such an amazing series of videos! I'm currently trying to implement your lesson to my own dataset. I'm also doing graph classification exclusively using: node_attributes, edge_indices (COO format) and label (binary). The sizes of such tensors are respectively [15120, 1], [2, 562486] and [1].
However, I come up with the error of tensors having different number of dimensions (2 and 1) upon model training. Did you also find this issue? I understand it comes from the nature of the original node features and edge list, but at the moment idk if it's a problem with the Dataset class definition or the architecture of the GNN. Is this issue intrinsic to GNN? I'd really appreciate some guidance here! :)
@DeepFindr 2 ปีที่แล้ว
Hi! Is there any stack trace in which part of your code this error occurs?
Do your nodes have only a single node feature?
Also have you tried to expand the dimension of the label to [1, 1].
Its hard to say without seeing the code :) you can also send it to deepfindr@gmail.com if its allowed for your project. Best regards
@MrFerdidos 2 ปีที่แล้ว
@@DeepFindr wow, even without seeing the code! Effectively, my nodes have a single node feature and I had not expanded the dimension of the label. Actually, why is that [1,1]? The first position of this array should be the actual label (0/1), but what about the second position? Or the other way round?
The error occurs in the _collate method, more specifically: value = torch.cat(values, dim=cat_dim or 0, out=out)
I truly believe it was the problem with the labels...
@DeepFindr 2 ปีที่แล้ว
Hi, often the label has an additional batch dimension.
The batching for graphs usually works by building a large disconnected graph. This graph contains all nodes and edges, but the individual graphs don't share connections.
For the labels of each individual graph a batch dimension is added, as the full graph is also a batched variant.
Did it work with [1, 1]?
@DeepFindr 2 ปีที่แล้ว ⁺¹
Ah and there are no additional values.
Its just instead of labels like
[1, 0, 1 0, 1, 1...]
An array like
[[1,
0
1,
0,
1,
1
...]
]
@MrFerdidos 2 ปีที่แล้ว
@@DeepFindr yep, it still didn't solve the problem. I guess problem comes with dimensions of the node features. I sent you an email. Thanks a lot ☺️
@mdtokitahmid2970 3 ปีที่แล้ว
Thanks a lot :D
@sharzilkhan9150 3 ปีที่แล้ว
what does embedding size do?
@DeepFindr 3 ปีที่แล้ว ⁺¹
This is just a hyperparameter. For some problems you might choose 32, for others a larger size like 256.
Generally, the bigger the embeddingsize is, the more "space" the model has to store information. It's like the latent vector in an autoencoder.
@sharzilkhan9150 3 ปีที่แล้ว
@@DeepFindr thank you
@sharzilkhan9150 3 ปีที่แล้ว
@@DeepFindr thank you for your response.. i have another query. i want to use dense layers and add a regulaizer. is it possible to add them is this model or replace linear layers with dense layers?
@DeepFindr 3 ปีที่แล้ว
Hi, where do you want to add them? At the end of the network? Or inside the GNN layers?
Also, what do you mean by "replace linear layers by dense layers". These are typically the same things :)
@sharzilkhan9150 3 ปีที่แล้ว
@@DeepFindr actually i am experimenting so if i can get example of regularizer in both in GNN layers and at the end of network that would be great.
as for other query regarding dense layers.. thanks for the clarification.. :)
@mdtokitahmid2970 3 ปีที่แล้ว
Next part 🙂🙂🙂 maybe on new molecule design🙂
@DeepFindr 3 ปีที่แล้ว ⁺¹
Yes :) the next video comes In a couple of days. And then I start with molecule generation :)
Sorry for the waiting :D
@mdtokitahmid2970 3 ปีที่แล้ว
@@DeepFindr planning to start a research project with these ideas after my exams :p . It would be really great.Thanks a lot😍
@DeepFindr 3 ปีที่แล้ว
@@mdtokitahmid2970 nice! When do you have your exams? :)
@mdtokitahmid2970 3 ปีที่แล้ว
@@DeepFindr ah next month :’)
I am from CS department but my interest lies in computational biology. Although Im only in 2nd term now, research works fascilates me 🥰
@DeepFindr 3 ปีที่แล้ว ⁺¹
Great! Good luck with your exams :)
@chientruong926 3 ปีที่แล้ว
Hello, Could you please upload source code for this video? I'm a begginer, so this version seems to be easier for me to understand and learn. Thank you so much and best wishes!
@DeepFindr 3 ปีที่แล้ว
Hi :) the source code is on Github under the link in the video description :)
@chientruong926 3 ปีที่แล้ว
DeepFindr Yes, thank you. I will check it again.
@clayouyang2157 3 ปีที่แล้ว
on new molecule design, i used to learn it , but it is shallow, i don't understand how to get a molecule with special properties that is you want to get
@DeepFindr 3 ปีที่แล้ว
Hi, yes that's more difficult. I will explain it in the next video. Give me one more week for the upload :)
@clayouyang2157 3 ปีที่แล้ว
@@DeepFindr look forward to it
@stanislavshubin3447 3 ปีที่แล้ว
+
@SirVampyr 2 ปีที่แล้ว
Hey, I've tried to use your code on GitHub and have failed repeatedly and I don't know why.
It fails with a "ZeroDivisionError" and I have no idea why it does that.
Do you have any clue what could be the problem?
@DeepFindr 2 ปีที่แล้ว
Hi! In which file and line do you encounter the error?
@jhanvisaraswat6976 ปีที่แล้ว
TypeError Traceback (most recent call last)
in ()
----> 1 train_dataset = MoleculeDataset(root="data/" , filename = "HIV.csv")
4 frames
/usr/local/lib/python3.10/dist-packages/deepchem/feat/graph_data.py in to_pyg_graph(self)
149 for key, value in self.kwargs.items():
150 kwargs[key] = torch.from_numpy(value).float()
--> 151 return Data(x=torch.from_numpy(self.node_features).float(),
152 edge_index=torch.from_numpy(self.edge_index).long(),
153 edge_attr=edge_features,
TypeError: torch_geometric.data.data.Data() got multiple values for keyword argument 'pos'
did anyone face this issue?
@DeepFindr ปีที่แล้ว
Hi! This is a bug as far as I know. You can find it a Github issue regarding this.
Simply delete the pos property like
del data.pos
And it should work :)
@jhanvisaraswat6976 ปีที่แล้ว
@@DeepFindr oh I didnt know this. I will check it and see. Amazing content BTW.

ต่อไป

เล่นอัตโนมัติ

GNN Project #4.3 - One-shot molecule generation - Part 1