To identify the columns with only a single value, would it have been easier just to check out the variance of that column ? A 0 variance will indicate only a single value in that column. Also, Scikit Learn provides a library function to automate this sklearn.feature_selection.VaruanceThreshold(). However, using this function could mess up the column indexes.
Very true! That's a good idea. Definitely easier than typing out that whole dictionary comprehension. Thanks for the tip! :) I would avoid using VarianceThreshold unless performing feature selection. What you could do instead is get the variance of each column with df.var() and then check which are equal to zero: single_valued_columns = df.columns[df.var() == 0] df = df.drop(single_valued_columns, axis=1)
Is it necessary to have 50% of both the target values , can't we just oversample the minority class and undersample the majority class so that we have our target column in 70:30 or 60:40 ratio. I think that would give more better results , correct me if I am wrong I don't have much practical experience 😅 btw love your videos , I have started watching your every upload❤️
I highly recommend trying it out and seeing! It may be so. The results should dictate which approach to use. In theory, we want both classes to have equal representation so that the model is used to seeing both kinds of training examples in equal quantity, but theory does not always rule. Practice will reveal the truth.
Great question! There are many ways to do this. One way is to use a model with interpretability built in. For example, with logistic regression, you get to see the actual feature contributions just by looking at the weights learned by the model (since there is only one weight per feature). Another interpretable model would be the decision tree. Once you've built the tree, you can look at its structure to see how the model is making the predictions it's making. Another way you can gauge feature importance is by using explanation metrics such as LIME or Shapley values. LIME (Local Interpretable Model-Agnostic Explanations) is a way to get a sense of how your model is making its predictions by building a linear (interpretable) model that approximates your model. Shapley values measure the marginal contributions of each feature with respect to the final output. If any of this is confusing, please let me know! :)
Well it depends on what you are looking for in a model. If accuracy is extremely important to you, you would be better off using a non-interpretable model, because generally, models that have high interpretability have low accuracy relative to other models. For example, neural networks and random forests are really accurate, but it is very difficult if not impossible to interpret their results. In this case, using explanation metrics would better suit your needs. If accuracy is secondary and not as important as interpretability, then opting for a simple linear/logistic regression or a decision tree might be your best option.
To identify the columns with only a single value, would it have been easier just to check out the variance of that column ? A 0 variance will indicate only a single value in that column.
Also, Scikit Learn provides a library function to automate this sklearn.feature_selection.VaruanceThreshold(). However, using this function could mess up the column indexes.
Very true! That's a good idea. Definitely easier than typing out that whole dictionary comprehension. Thanks for the tip! :)
I would avoid using VarianceThreshold unless performing feature selection. What you could do instead is get the variance of each column with df.var() and then check which are equal to zero:
single_valued_columns = df.columns[df.var() == 0]
df = df.drop(single_valued_columns, axis=1)
Is it necessary to have 50% of both the target values , can't we just oversample the minority class and undersample the majority class so that we have our target column in 70:30 or 60:40 ratio. I think that would give more better results , correct me if I am wrong I don't have much practical experience 😅 btw love your videos , I have started watching your every upload❤️
I highly recommend trying it out and seeing! It may be so. The results should dictate which approach to use.
In theory, we want both classes to have equal representation so that the model is used to seeing both kinds of training examples in equal quantity, but theory does not always rule. Practice will reveal the truth.
If you were to find out which features are most important to predict fails, how would you do it?
Great question! There are many ways to do this.
One way is to use a model with interpretability built in.
For example, with logistic regression, you get to see the actual feature contributions just by looking at the weights learned by the model (since there is only one weight per feature).
Another interpretable model would be the decision tree. Once you've built the tree, you can look at its structure to see how the model is making the predictions it's making.
Another way you can gauge feature importance is by using explanation metrics such as LIME or Shapley values.
LIME (Local Interpretable Model-Agnostic Explanations) is a way to get a sense of how your model is making its predictions by building a linear (interpretable) model that approximates your model.
Shapley values measure the marginal contributions of each feature with respect to the final output.
If any of this is confusing, please let me know! :)
@@gcdatkin how would you decide which to use between those methods?
Well it depends on what you are looking for in a model.
If accuracy is extremely important to you, you would be better off using a non-interpretable model, because generally, models that have high interpretability have low accuracy relative to other models.
For example, neural networks and random forests are really accurate, but it is very difficult if not impossible to interpret their results.
In this case, using explanation metrics would better suit your needs.
If accuracy is secondary and not as important as interpretability, then opting for a simple linear/logistic regression or a decision tree might be your best option.