Put this vid on TH-cam today for ppl using machine learning and text classification for the first time in WEKA. My other vid was a general intro (23 min) and this is for text classification (59 min) 0:00 Introduction (5 minutes) 5:06 TextToDirectoryLoader (3 minutes) 8:12 StringToWordVector (19 minutes) 27:37 AttributeSelect (10 minutes) 37:37 Cost Sensitivity and Class Imbalance (8 minutes) 45:45 Classifiers (14 minutes) 59:07 Conclusion (20 seconds) Want to skip to a specific part? - Section 1 - 5:49 TextDirectoryLoader Command (1 minute) - Section 2 - 6:44 ARFF File Syntax (1 minute 30 seconds) 8:10 Vectorizing Documents (2 minutes) 10:15 WordsToKeep setting/Word Presence (1 minute 10 seconds) 11:26 OutputWordCount setting/Word Frequency (25 seconds) 11:51 DoNotOperateOnAPerClassBasis setting (40 seconds) 12:34 IDFTransform and TFTransform settings/TF-IDF score (1 minute 30 seconds) 14:09 NormalizeDocLength setting (1 minute 17 seconds) 15:46 Stemmer setting/Lemmatization (1 minute 10 seconds) 16:56 Stopwords setting/Custom Stopwords File (1 minute 54 seconds) 18:50 Tokenizer setting/NGram Tokenizer/Bigrams/Trigrams/Alphabetical Tokenizer (2 minutes 35 seconds) 21:25 MinTermFreq setting (20 seconds) 21:45 PeriodicPruning setting (40 seconds) 22:25 AttributeNamePrefix setting (16 seconds) 22:42 LowerCaseTokens setting (1 minute 2 seconds) 23:45 AttributeIndices setting (2 minutes 4 seconds) - Section 3 - 28:07 AttributeSelect for reducing dataset to improve classifier performance/InfoGainEval evaluator/Ranker search (7 minutes) - Section 4 - 38:32 CostSensitiveClassifer/Adding cost effectiveness to base classifier (2 minutes 20 seconds) 42:17 Resample filter/Example of undersampling majority class (1 minute 10 seconds) 43:27 SMOTE filter/Example of oversampling the minority class (1 minute) - Section 5 - 45:34 Training vs. Testing Datasets (1 minute 32 seconds) 47:07 Naive Bayes Classifier (1 minute 57 seconds) 49:04 Multinomial Naive Bayes Classifier (10 seconds) 49:33 K Nearest Neighbor Classifier (1 minute 34 seconds) 51:17 J48 (Decision Tree) Classifier (2 minutes 32 seconds) 53:50 Random Forest Classifier (1 minute 39 seconds) 55:55 SMO (Support Vector Machine) Classifier (1 minute 38 seconds) 57:35 Supervised vs Semi-Supervised vs Unsupervised Learning/Clustering (1 minute 20 seconds) Since all text data is turned into numbers and categories after sections 1-2, most of sections 3-5 are useful in both text classification and other data analysis in WEKA. Classifiers introduces you to six (but not all) of WEKA's popular classifiers for text mining; 1) Naive Bayes, 2) Multinomial Naive Bayes, 3) K Nearest Neighbor, 4) J48, 5) Random Forest and 6) SMO. Each StringToWordVector setting is shown, e.g. tokenizer, outputWordCounts, normalizeDocLength, TF-IDF, stopwords, stemmer, etc. These are ways of representing documents as document vectors. Automatically converting 2,000 text files (plain text documents) into an ARFF file with TextDirectoryLoader is shown. Additionally shown is AttributeSelect which is a way of improving classifier performance by reducing the dataset. Cost-Sensitive Classifier is shown which is a way of assigning weights to different types of guesses. Resample and SMOTE are shown as ways of undersampling the majority class and oversampling the majority class. Introductory tips are shared throughout, e.g. distinguishing supervised learning (which is most of data mining) from semi-supervised and unsupervised learning, making identically-formatted training and testing datasets, how to easily subset outliers with the Visualize tab and more...
would you help me , i am running the same expermint , i followed the steps in the video , but most of the calssifiers are inactive (grey color) , thanx in advance
+Rana Alqaisi i am running the same experiment but unable to load text directory to .arff format. what should i do? please reply me as soon as possible.
You need to have a csv or arff fie to run the file in weka, if its text file you have to see the structure of arff file and convert it ! for any help let me know !
If I could give this video 10000 thumbs up I totally would. Brilliant work and great explanation of how all the different features actually work. You may have just saved me from failing a class. Thankyouthankyouthankyou.
@27:37 If anyone is stuck by the SMO being grayed out, I *think* the solution is to go to the Preprocess tab, click Edit... and remember to first right-click the @@class@@ Nominal header and select "Attribute as class". I'm learning just like you so I could be wrong though!
This is a very , very good quick intro especially in filters. Weka has numerous data cleaning and filters and parameters and this was a good tutorial on the filters we use. You have an error in interpretation of False Positive and False N at 38:24. Look at page 164 Chapter 5 of the Third Edition of Witten, Frank and Hall. You seemed to have transposed the two. I am of the camp that the true value,"actual class" or ground truth is at the top of the confusion table, but the literature is replete with such inconsistencies. So beware of confusing the confusion matrix.
would you help me , i am running the same expermint , i followed the steps in the video , but most of the calssifiers are inactive (grey color) , thanx in advance
Really excellent. This is exactly what I was looking for as I start out on my first text classification project. So helpful and very much worth the hour.
@10:08 When I set wordsToKeep from 1000 to 100 and click Apply, it doesn't update the model from 1000 to 100. Do I have to Undo everytime? Or if I run 1000 initially, why can I not then run 100 and have it update?
This is an informative video on text data mining, but it can be done with the WEKA knowledge flow. WEKA knowledge flow allows user to compare multiple models and it can be used for batch processing and instance processing.
Thanks for a great video. What is the significance of the text files containing one sentence for each line? Is this necessary and/or does this improve performance?
Hi Bran, For Some reason I get the error when I execute the command.. fo converting text files in pos and neg folders contained within folder into .arff file Please tell Me Where Could I Be Wrong
+Brandon Weinberg - the video is fantastic. It gives a brief idea about the sentiment analysis using weka. But I tried to do it myself, and I am getting error during the command execution. The command I tried is: java weka.core.converters.TextDirectoryLoader -dir C:\Users\Sudip P\Desktop\Weka files\Training > C:\Program Files\Weka-3-7\data\IMDB.arff The error I am getting are: weka.core.converters.TextDirectoryLoader.setSource(TextDirectoryLoader.java:398) weka.core.converters.TextDirectoryLoader.setDirectory(TextDirectoryLoader.java:367) weka.core.converters.TextDirectoryLoader.setOptions(TextDirectoryLoader.java:219) weka.core.converters.TextDirectoryLoader.main(TextDirectoryLoader.java:658) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) java.lang.reflect.Method.invoke(Unknown Source) weka.gui.SimpleCLIPanel$ClassRunner.run(SimpleCLIPanel.java:199) at weka.core.converters.TextDirectoryLoader.setSource(TextDirectoryLoader.java:398) at weka.core.converters.TextDirectoryLoader.setDirectory(TextDirectoryLoader.java:367) at weka.core.converters.TextDirectoryLoader.setOptions(TextDirectoryLoader.java:219) at weka.core.converters.TextDirectoryLoader.main(TextDirectoryLoader.java:658) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at weka.gui.SimpleCLIPanel$ClassRunner.run(SimpleCLIPanel.java:199) I have tried with your dataset, but the problem still exists. I am using Windows Platform. Please help. Thanks in advance.
Thank you so much for this informative tutorial. Please I received this error when I was attempting to load my text file using TextDirectoryLoader. The error is "Problem settings base instance....", please what is the cause of this error?
hi, i have to load text directory in weka , i have write a java weka.core.converts.TextDirectoryLoader dir and replace wit -dir "path" to my file directory path , but still not work ? can you provide a exact step for that ?
Hi Brandon Weinberg, I want to detect outliers from a dataset using weka, Please tell if weka is capable of generating the response in another file from my dataset.
Someone can help me? When i try to apply AttSelection weka gives me "Cannot handle numeric class". Ok, when i select some attribute i can see the "NUM" but i already applyd StringToWordVector..
This is solved by clicking on 'Edit...' on the top, right clicking on the first column and selecting 'Attribute as class'. This puts the class at the last vs. first column which was causing the error. I find this video very hard to follow for the new user, he doesn't bother showing how to select the filters, etc. One has to hunt through them all to find what he's using. And then doesn't bothering to answer the questions he caused by his confusing video. For more experienced users this might be of value.
would be nice if he answered a couple questions from the comments. Doesn't appear so... My question is "was WEKA used in the Netflix Challenge". Thanks for posting.
It is very good video. But I couldn't import my excel data into WEKA software. It is more environmental data (physicochemical data of streams) and aquatic small organisms data. please give me a hint that how can I import the data into WEKA. I tried many time to change it into ARFF but still no data when I open file in WEKA. Thank you in advance !!!
hey...i want to convert product reviews txt format file into arff file... In your arff file attributes are--- ReviewText string sentiment {pos,neg} but in my arff file attributes are---- text string @@class@@{rd} i want them as---- ReviewText string class {subjective, objective} I am doing subjectivity and objectivity analysis on product reviews.....any help will be appreciated....
Thank for this video. currently i am workin on this topic. i am unable to convert textdiirectory to .arff file. please reply me as soon as possible. i have the dataset from your mention link.
Anyone here who has done Web Log mining, please i need help, i don't know anything about this tool, i tried to learn, but what i want isn't on the web. Anyone please can guide me how to preprocess Log files and find result according to the attribute i require.
Put this vid on TH-cam today for ppl using machine learning and text classification for the first time in WEKA. My other vid was a general intro (23 min) and this is for text classification (59 min)
0:00 Introduction (5 minutes)
5:06 TextToDirectoryLoader (3 minutes)
8:12 StringToWordVector (19 minutes)
27:37 AttributeSelect (10 minutes)
37:37 Cost Sensitivity and Class Imbalance (8 minutes)
45:45 Classifiers (14 minutes)
59:07 Conclusion (20 seconds)
Want to skip to a specific part?
- Section 1 -
5:49 TextDirectoryLoader Command (1 minute)
- Section 2 -
6:44 ARFF File Syntax (1 minute 30 seconds)
8:10 Vectorizing Documents (2 minutes)
10:15 WordsToKeep setting/Word Presence (1 minute 10 seconds)
11:26 OutputWordCount setting/Word Frequency (25 seconds)
11:51 DoNotOperateOnAPerClassBasis setting (40 seconds)
12:34 IDFTransform and TFTransform settings/TF-IDF score (1 minute 30 seconds)
14:09 NormalizeDocLength setting (1 minute 17 seconds)
15:46 Stemmer setting/Lemmatization (1 minute 10 seconds)
16:56 Stopwords setting/Custom Stopwords File (1 minute 54 seconds)
18:50 Tokenizer setting/NGram Tokenizer/Bigrams/Trigrams/Alphabetical Tokenizer (2 minutes 35 seconds)
21:25 MinTermFreq setting (20 seconds)
21:45 PeriodicPruning setting (40 seconds)
22:25 AttributeNamePrefix setting (16 seconds)
22:42 LowerCaseTokens setting (1 minute 2 seconds)
23:45 AttributeIndices setting (2 minutes 4 seconds)
- Section 3 -
28:07 AttributeSelect for reducing dataset to improve classifier performance/InfoGainEval evaluator/Ranker search (7 minutes)
- Section 4 -
38:32 CostSensitiveClassifer/Adding cost effectiveness to base classifier (2 minutes 20 seconds)
42:17 Resample filter/Example of undersampling majority class (1 minute 10 seconds)
43:27 SMOTE filter/Example of oversampling the minority class (1 minute)
- Section 5 -
45:34 Training vs. Testing Datasets (1 minute 32 seconds)
47:07 Naive Bayes Classifier (1 minute 57 seconds)
49:04 Multinomial Naive Bayes Classifier (10 seconds)
49:33 K Nearest Neighbor Classifier (1 minute 34 seconds)
51:17 J48 (Decision Tree) Classifier (2 minutes 32 seconds)
53:50 Random Forest Classifier (1 minute 39 seconds)
55:55 SMO (Support Vector Machine) Classifier (1 minute 38 seconds)
57:35 Supervised vs Semi-Supervised vs Unsupervised Learning/Clustering (1 minute 20 seconds)
Since all text data is turned into numbers and categories after sections 1-2, most of sections 3-5 are useful in both text classification and other data analysis in WEKA.
Classifiers introduces you to six (but not all) of WEKA's popular classifiers for text mining; 1) Naive Bayes, 2) Multinomial Naive Bayes, 3) K Nearest Neighbor, 4) J48, 5) Random Forest and 6) SMO.
Each StringToWordVector setting is shown, e.g. tokenizer, outputWordCounts, normalizeDocLength, TF-IDF, stopwords, stemmer, etc. These are ways of representing documents as document vectors.
Automatically converting 2,000 text files (plain text documents) into an ARFF file with TextDirectoryLoader is shown.
Additionally shown is AttributeSelect which is a way of improving classifier performance by reducing the dataset.
Cost-Sensitive Classifier is shown which is a way of assigning weights to different types of guesses.
Resample and SMOTE are shown as ways of undersampling the majority class and oversampling the majority class.
Introductory tips are shared throughout, e.g. distinguishing supervised learning (which is most of data mining) from semi-supervised and unsupervised learning, making identically-formatted training and testing datasets, how to easily subset outliers with the Visualize tab and more...
would you help me , i am running the same expermint , i followed the steps in the video , but most of the calssifiers are inactive (grey color) , thanx in advance
Rana Alqaisi your data set is not ready for use. Prepare it properly
Thank you , it solved :)
+Rana Alqaisi i am running the same experiment but unable to load text directory to .arff format. what should i do? please reply me as soon as possible.
You need to have a csv or arff fie to run the file in weka, if its text file you have to see the structure of arff file and convert it ! for any help let me know !
If I could give this video 10000 thumbs up I totally would. Brilliant work and great explanation of how all the different features actually work. You may have just saved me from failing a class. Thankyouthankyouthankyou.
@27:37 If anyone is stuck by the SMO being grayed out, I *think* the solution is to go to the Preprocess tab, click Edit... and remember to first right-click the @@class@@ Nominal header and select "Attribute as class". I'm learning just like you so I could be wrong though!
I really appreciate you uploading this video. You clarified several questions I had that my professor struggled to explain to his class.
This is the one of the besttt videos I have seen! Thank you so much!! It's crazy how powerful WEKA is!
This is a very , very good quick intro especially in filters. Weka has numerous data cleaning and filters and parameters and this was a good tutorial on the filters we use.
You have an error in interpretation of False Positive and False N at 38:24. Look at page 164 Chapter 5 of the Third Edition of Witten, Frank and Hall. You seemed to have transposed the two. I am of the camp that the true value,"actual class" or ground truth is at the top of the confusion table, but the literature is replete with such inconsistencies. So beware of confusing the confusion matrix.
Thank you soo much! This is the MOST Informative and helpful video in WEKA!! Keep up the good work Brandon! We are learning so much from your videos.
would you help me , i am running the same expermint , i followed the steps in the video , but most of the calssifiers are inactive (grey color) , thanx in advance
Really excellent. This is exactly what I was looking for as I start out on my first text classification project. So helpful and very much worth the hour.
so much better and useful than Weka's official tutorial
this is a very good tutorial - great job at explaining things properly and slowly, thank you very much for the great work
I really appreciate the uploaded Weka Text Classification
Thank you Sir, this was the most helpful video I've seen so far..
Awesome tutorial. Thanks Brandon!
Thanks for sharing this Brandon! very useful video..very well explained.
excellent tutorial, really helpful- thanks Brandon!!!
thank you so much.. it is very helpful .. the best resource i could find for text classification :)
Thanks for sharing, excellent informative video
This video is extremely helpful. Thank you . :)
@10:08 When I set wordsToKeep from 1000 to 100 and click Apply, it doesn't update the model from 1000 to 100. Do I have to Undo everytime? Or if I run 1000 initially, why can I not then run 100 and have it update?
This tutorial is sooo good
great video thank you. You have easy my worries on text mining
Very nicely done!
This is an informative video on text data mining, but it can be done with the WEKA knowledge flow. WEKA knowledge flow allows user to compare multiple models and it can be used for batch processing and instance processing.
really appreciate the upload :)
Amazing video! Thanks.
thank you sir very clearly explained
Thanks for a great video. What is the significance of the text files containing one sentence for each line? Is this necessary and/or does this improve performance?
Thank you very much for the tutorial. Is it possible to use lemmatization in weka ?
Hi Bran, For Some reason I get the error when I execute the command.. fo converting text files in pos and neg folders contained within folder into .arff file Please tell Me Where Could I Be Wrong
+Brandon Weinberg - the video is fantastic. It gives a brief idea about the sentiment analysis using weka. But I tried to do it myself, and I am getting error during the command execution.
The command I tried is:
java weka.core.converters.TextDirectoryLoader -dir C:\Users\Sudip P\Desktop\Weka files\Training > C:\Program Files\Weka-3-7\data\IMDB.arff
The error I am getting are:
weka.core.converters.TextDirectoryLoader.setSource(TextDirectoryLoader.java:398)
weka.core.converters.TextDirectoryLoader.setDirectory(TextDirectoryLoader.java:367)
weka.core.converters.TextDirectoryLoader.setOptions(TextDirectoryLoader.java:219)
weka.core.converters.TextDirectoryLoader.main(TextDirectoryLoader.java:658)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
java.lang.reflect.Method.invoke(Unknown Source)
weka.gui.SimpleCLIPanel$ClassRunner.run(SimpleCLIPanel.java:199)
at weka.core.converters.TextDirectoryLoader.setSource(TextDirectoryLoader.java:398)
at weka.core.converters.TextDirectoryLoader.setDirectory(TextDirectoryLoader.java:367)
at weka.core.converters.TextDirectoryLoader.setOptions(TextDirectoryLoader.java:219)
at weka.core.converters.TextDirectoryLoader.main(TextDirectoryLoader.java:658)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at weka.gui.SimpleCLIPanel$ClassRunner.run(SimpleCLIPanel.java:199)
I have tried with your dataset, but the problem still exists. I am using Windows Platform. Please help. Thanks in advance.
Hope its not too late to help you.. Try to make your folders without space between words..
Thank you so much! I was getting the same problem, but this solved it! :)
Thank you so much , it really helped a lot. you are awesome.
Thank you so much for this video!!!!!
Thank you for posting the video
Thank you so much for this informative tutorial. Please I received this error when I was attempting to load my text file using TextDirectoryLoader. The error is "Problem settings base instance....", please what is the cause of this error?
Anybody have any idea how to find the directory for your folders when you are converting the text to arff for windows?
Great video, thanks a lot
Hello, would you help me, i need visualize the model but i got this error : can't print smo classifier ?
hi, i have to load text directory in weka , i have write a java weka.core.converts.TextDirectoryLoader dir and replace wit -dir "path" to my file directory path , but still not work ?
can you provide a exact step for that ?
Thank you very much, this is helpful.
Hi Brandon Weinberg, I want to detect outliers from a dataset using weka, Please tell if weka is capable of generating the response in another file from my dataset.
Someone can help me? When i try to apply AttSelection weka gives me "Cannot handle numeric class". Ok, when i select some attribute i can see the "NUM" but i already applyd StringToWordVector..
hello ,i have the same problem. did you solved it ??
This is solved by clicking on 'Edit...' on the top, right clicking on the first column and selecting 'Attribute as class'. This puts the class at the last vs. first column which was causing the error.
I find this video very hard to follow for the new user, he doesn't bother showing how to select the filters, etc. One has to hunt through them all to find what he's using. And then doesn't bothering to answer the questions he caused by his confusing video. For more experienced users this might be of value.
Thanks for your hints. If you know better tutorial, I would appreciate if you share the links.
This videoooo.... soooooo goood.
would be nice if he answered a couple questions from the comments. Doesn't appear so... My question is "was WEKA used in the Netflix Challenge". Thanks for posting.
Hello, as I have access to the database IMDB. arff
It is very good video.
But I couldn't import my excel data into WEKA software. It is more environmental data (physicochemical data of streams) and aquatic small organisms data. please give me a hint that how can I import the data into WEKA. I tried many time to change it into ARFF but still no data when I open file in WEKA.
Thank you in advance !!!
hey...i want to convert product reviews txt format file into arff file...
In your arff file attributes are---
ReviewText string
sentiment {pos,neg}
but in my arff file attributes are----
text string
@@class@@{rd}
i want them as----
ReviewText string
class {subjective, objective}
I am doing subjectivity and objectivity analysis on product reviews.....any help will be appreciated....
excelent tutorial. big thumbs up
Thank for this video. currently i am workin on this topic. i am unable to convert textdiirectory to .arff file. please reply me as soon as possible.
i have the dataset from your mention link.
Hello! What is happening when you try to convert? I can try to help you
hello!!! please help me out.
i'm in need of review text file as ARFF format. please help me to get it. its too urgent. please!!!
where can i download the dataset.
How to classifier only 1 text?
to rana it s work even it s coulor grey ,it s active so try again
Where can I get arff file?
you can make one by yourself
Anyone here who has done Web Log mining, please i need help, i don't know anything about this tool, i tried to learn, but what i want isn't on the web. Anyone please can guide me how to preprocess Log files and find result according to the attribute i require.
thanks for tutorial
Thanks very much
Classfier 1 text only, and this classifier using "class value"
THANKS !
Parabéns, BR.
Thanks
ty!
KSU student
Was here :/
Present sir
smo was not explained enough
This is a great video, thanks for demystifying this process!
how can I download these tutorial?