Im new to machine learning, if i want to use this for image classification, given an image of a mushroom, how can i use the model to predict the label of that image. p/s: i know its been 3 years, but if anyone can help, please help
I'm a little rusty with my MATLAB syntax, but I can give it to you in pseudocode: If y is a vector of categorical values, Let unique_vals = empty list of strings for each element in y: if the element is not in unique_vals: Add the element to unique_vals Let encoded_vals = a list (same length as unique_vals) of unique integers for each element in y (indexed by i): for each element in unique_vals (indexed by j): if y[i] == unique_vals[j]: Set y[i] = encoded_vals[j] The above should be easily implemented in MATLAB if you are familiar with the language. Note: This type of encoding is not recommended for the dataset in this video. When I recorded this video, I was not aware of better encoding schemes. The problem with using this kind of encoding on the columns in the mushroom dataset, is that after encoding, the model assumes an order between the values in the columns (since the values are mapped to integers in a continuous range). So the model will assume that certain categorical values are "higher" than others (for example, the model will assume that a value of "x" is larger than a value of "b" in the cap-surface column). One-hot encoding is a way to work around this problem. It encodes each unique value as its own column. I know that we achieved 100% accuracy on this dataset, but in general I recommend you look into it further if you will be encoding categorical variables with no inherent order. Cheers!
Hii…!! In preprocessing step the " for loop " is being used to encode each column in one go by iterating. Does this for loop has any effect on mappings_dict ?? I mean, is it necessary to create mappings_dict and mappings.append (mappings_dict) inside for loop or can we create it outside the for loop and it will still work?? Actually I thought that for loop is only for encoding each column and I created mappings_dict and append operation outside the for loop and when i printed mappings it only gave me mappings for last column.
That's a great idea Lucas! I assume you mean writing my own implementations of logistic regression, support vector machine classification, and neural networks. I have coded my own implementations in the past, but the only problem is sklearn and other libraries are so highly optimized and use so many numerical computation "tricks," that my own version would be much slower and probably would not perform as well. However, you make an interesting point. I may make a video in the future on how to code these models from scratch, although it does not have much practical application, as there are so many refined and optimized options available already.
Not all models benefit from scaled data (such as tree models), but for models that do (such as logistic regression), I would recommend scaling all the columns. The reason is that we want all features to be treated by equally by the model. For example, in logistic regression, each feature gets a single weight that is learned by the model. If the features take different ranges of values, the model will have to adjust the weights accordingly. If you are using L2 regularization (which is enabled by default), this becomes even more of a problem. As long as all the columns take on a similar range of values, there shouldn't be a problem. By centering the data at 0 and ensuring all columns have the same variance, we can guarantee the model will give each feature equal attention before weighting.
Of course! The link to the code is in the description of all my videos :) Here is the link for this one: www.kaggle.com/gcdatkin/deadly-mushroom-classification-100-accuracy
Thank you, Standard scaler really helped me in increasing accuracy.
but here for encoding we used label encoder whereas in sklearn documentation it is stated that it should be explicitly used for labels in the data
Thanks a lot for the EDA.. you were of great help.
No problem, Pravinkumar! :)
can you make a video on thyroid Disease classification ML Problem, it will be very helpfull for us.
hey why did you used Label Encoder instead of One Hot Encoder?
Likely to minimize the amount of features generated, onehotencoder would generate a new column for every unique value in every feature
Great. I have a question though, can we use clustering to classify this mushroom dataset!???
Im new to machine learning, if i want to use this for image classification, given an image of a mushroom, how can i use the model to predict the label of that image.
p/s: i know its been 3 years, but if anyone can help, please help
You should deal with the missing values in EDA process
Hi Gabriel, do you know how to do what you just did when transforming categorical data to numeric with matlab?
I'm a little rusty with my MATLAB syntax, but I can give it to you in pseudocode:
If y is a vector of categorical values,
Let unique_vals = empty list of strings
for each element in y:
if the element is not in unique_vals:
Add the element to unique_vals
Let encoded_vals = a list (same length as unique_vals) of unique integers
for each element in y (indexed by i):
for each element in unique_vals (indexed by j):
if y[i] == unique_vals[j]:
Set y[i] = encoded_vals[j]
The above should be easily implemented in MATLAB if you are familiar with the language.
Note: This type of encoding is not recommended for the dataset in this video. When I recorded this video, I was not aware of better encoding schemes. The problem with using this kind of encoding on the columns in the mushroom dataset, is that after encoding, the model assumes an order between the values in the columns (since the values are mapped to integers in a continuous range). So the model will assume that certain categorical values are "higher" than others (for example, the model will assume that a value of "x" is larger than a value of "b" in the cap-surface column).
One-hot encoding is a way to work around this problem. It encodes each unique value as its own column. I know that we achieved 100% accuracy on this dataset, but in general I recommend you look into it further if you will be encoding categorical variables with no inherent order.
Cheers!
@@gcdatkin thank you Gabriel your a massive help!
Why didn't you check cross validation score
HELLO i want at least 80% ACCURAcY but i get only72% ..image classification...in large where can be the issue ??
Is it common to get 100% accuracy or my model get over fitted
is this a right way of preprocessing sir at 5.50
for i in data.columns:
data[i] = encoder.fit_transform(data[i])
You are right! I should have instead used
for column in data.columns:
data[column] = encoder.fit_transform(data[column])
Thanks!
Thanks.. the feedback helps.👍
How to create CSV file on own
Naive bayes included ?
Thank you, it is so useful
Hii…!!
In preprocessing step the " for loop " is being used to encode each column in one go by iterating.
Does this for loop has any effect on mappings_dict ??
I mean, is it necessary to create mappings_dict and mappings.append (mappings_dict) inside for loop or can we create it outside the for loop and it will still work??
Actually I thought that for loop is only for encoding each column and I created mappings_dict and append operation outside the for loop and when i printed mappings it only gave me mappings for last column.
Check my reply on Kaggle :)
what about do the same algorithm, using only pure python, without external libraries
That's a great idea Lucas!
I assume you mean writing my own implementations of logistic regression, support vector machine classification, and neural networks.
I have coded my own implementations in the past, but the only problem is sklearn and other libraries are so highly optimized and use so many numerical computation "tricks," that my own version would be much slower and probably would not perform as well.
However, you make an interesting point. I may make a video in the future on how to code these models from scratch, although it does not have much practical application, as there are so many refined and optimized options available already.
GooD Effort.
Love your vids, but why are you always scaling categorical variables?
whats the benefit of normalizing nominal data? / why are you doing it?
Not all models benefit from scaled data (such as tree models), but for models that do (such as logistic regression), I would recommend scaling all the columns. The reason is that we want all features to be treated by equally by the model.
For example, in logistic regression, each feature gets a single weight that is learned by the model. If the features take different ranges of values, the model will have to adjust the weights accordingly. If you are using L2 regularization (which is enabled by default), this becomes even more of a problem.
As long as all the columns take on a similar range of values, there shouldn't be a problem. By centering the data at 0 and ensuring all columns have the same variance, we can guarantee the model will give each feature equal attention before weighting.
@@gcdatkin so its standardized! Thanks for your reply. Im new to Machine Learning :)
luv it 💜
good work,
how can i contact you ??
You can reach out to me at
gcdatkin@gmail.com
Or on LinkedIn:
www.linkedin.com/in/gcdatkin
Hi Can I get the code? It'll a great help
Of course! The link to the code is in the description of all my videos :)
Here is the link for this one:
www.kaggle.com/gcdatkin/deadly-mushroom-classification-100-accuracy