Very informative and clear. Thank you for your effort! The following are the steps for the self-learning algorithm. 1. Train a supervised classifier on the labelled data. 2. Use the resulting classifier to make predictions on the unlabelled data. 3. Add the most confident of these predictions to the labelled data set. 4. Re-train the classifier on both the original labelled data and the newly obtained pseudo-labelled data. 5. Repeat steps 2-4 until no unlabelled data remain. There are two hyperparameters to set, the maximum number of iterations and the number of unlabelled examples to add at each iteration. One issue of self-learning is if we add many examples with incorrect predictions to the labelled data set, the final classifier may be worse than the classifier only trained on the original labelled data. I hope this answer may help someone interested in semi-supervised learning.
Thanks a lot Subhasree, the other Semi Supervised techniques are intense, may be some other time. Keep watching and do share with people who you feel will be interested.
Thanks Sir, for the video which was very easy to understand. However I was thinking if the labelled dataset contains sample of 2 classes only(does not contain a sample of a possible 3rd class) and the unlabeled sample contains that specific sample of 3rd class (without the class), then I think the classifier trained on the labeled data cannot properly predict and its confidence for both classes would be low. Can anything or strategy be adopted in this case?
Any classifier gives probability, from the unlabeled data, we can see where we are more confident and then add them to training set. There is a trade-off though, if you add the ones you are super confident they will add similar observations like those present in training set, not too much diversity, on the other hand if you comprise on confidence you can add noise
@@SaptarsiGoswami Thanks a lot for your response. I am confused what confidence is. Is it a user-defied parameter to find similarity like for example find the majority class of training labeled instances which have similarity measure of confidence (e.g. 75%) or higher? Your help wall save my weeks of effort. Thanks in advance.
@@hamidawan687, well confidence is a term, I have used here. Let's say it's a three-class classification problem. Output for one observation is (0.7,0.2,0.1) where indicates class probabilities of the three classes. The maximum here is 0.7 and at a crude level, I can say this is the confidence of the decision. If in another case we get this output as (0.4,0.3,0.3), the max is 0.4. Of course, my confidence level is low over here.
@@SaptarsiGoswami Got your point. By confidence, you mean wha I call it the majority class in the labeled training instances. Thank you very much for your kind positive response. Actually I have been searching for implementation details of pseudo labeling. This is one of a few techniques I know of. One is known as Expected Maximization (EM). But still I'm unable to find what does it exactly mean. But at least your response has clarified my concept about confidence. Thanks again.
Very informative and clear. Thank you for your effort!
The following are the steps for the self-learning algorithm.
1. Train a supervised classifier on the labelled data.
2. Use the resulting classifier to make predictions on the unlabelled data.
3. Add the most confident of these predictions to the labelled data set.
4. Re-train the classifier on both the original labelled data and the newly obtained pseudo-labelled data.
5. Repeat steps 2-4 until no unlabelled data remain.
There are two hyperparameters to set, the maximum number of iterations and the number of unlabelled examples to add at each iteration.
One issue of self-learning is if we add many examples with incorrect predictions to the labelled data set, the final classifier may be worse than the classifier only trained on the original labelled data.
I hope this answer may help someone interested in semi-supervised learning.
Thanks a lot for the addition
What ist X_train1 and y_train1? you use it but it was never defined
why is there 22 in acc = np.empty(22). I mean can we put some lower no instead of 22?
i am stuck in re-training labelled and pseudo-labelled data
Very informative and crystal clear!
Thank you so much
Very clear and vivid explanation Sir.
Thanks a lot Subhasree, the other Semi Supervised techniques are intense, may be some other time. Keep watching and do share with people who you feel will be interested.
was very helpful thank you good sir
Thanks so much
Thanks Sir, for the video which was very easy to understand.
However I was thinking if the labelled dataset contains sample of 2 classes only(does not contain a sample of a possible 3rd class) and the unlabeled sample contains that specific sample of 3rd class (without the class), then I think the classifier trained on the labeled data cannot properly predict and its confidence for both classes would be low. Can anything or strategy be adopted in this case?
Very nice Sir as always.
Thanks Wazib
Thank you so much!!! Very nice and clear explanation
Thanks a lot, glad it was helpful, please share with your colleagues, friends.
Can anyone please provide the link of the datatset
Can you please tell the formula/equation of predict_probability function? Hiow is unlabeled data used in this function?
Any classifier gives probability, from the unlabeled data, we can see where we are more confident and then add them to training set. There is a trade-off though, if you add the ones you are super confident they will add similar observations like those present in training set, not too much diversity, on the other hand if you comprise on confidence you can add noise
@@SaptarsiGoswami Thanks a lot for your response. I am confused what confidence is. Is it a user-defied parameter to find similarity like for example find the majority class of training labeled instances which have similarity measure of confidence (e.g. 75%) or higher?
Your help wall save my weeks of effort. Thanks in advance.
@@hamidawan687, well confidence is a term, I have used here. Let's say it's a three-class classification problem. Output for one observation is (0.7,0.2,0.1) where indicates class probabilities of the three classes. The maximum here is 0.7 and at a crude level, I can say this is the confidence of the decision. If in another case we get this output as (0.4,0.3,0.3), the max is 0.4. Of course, my confidence level is low over here.
@@SaptarsiGoswami
Got your point. By confidence, you mean wha I call it the majority class in the labeled training instances. Thank you very much for your kind positive response. Actually I have been searching for implementation details of pseudo labeling. This is one of a few techniques I know of. One is known as Expected Maximization (EM). But still I'm unable to find what does it exactly mean. But at least your response has clarified my concept about confidence. Thanks again.