great! I still have one question- the analyze starts with click on "paint data" icon--> k-means --> scatter plot How it works without loading any data? I guess I'm missing something but yet, it's not clear how to load my data, save it as Orange can uses it for all it's great analyze.. TNX
Are there any requirements for the minimal number of samples for k-means to function well? If you consider the geometrical interpretation - then it works well with a small number of samples (as can be seen in your tutorial where you draw the points on the whiteboard). Various sources on the Internet talk about how well clustering algorithms scale when the data set is large, but I didn't find any authoritative source that would say something like "K-means is awesome, as long as you have >N samples in your set".
No reason at all. The idea is to have obvious groups that people can see and observe, whether k-means finds them, too. And observe how k-means behaves when we ask it to give more than 3 or less than 3 groups. You can try it yourself with a different number of groups or with data, where there are no groups at all, and observe how k-means behaves. There's educational value to this.
Amazing work! Does the algorithm automatically normalize numerical attributes when doing distance computations and convert categorical attributes to binary?
We wrap k-Means from sklearn. k-Means does not normalize (consensus is that is shouldn't) and it transforms categorical attributes with one-hot-encoding.
Thank you for the demo, it was a clear and straightforward tutorial. I have 2 questions: 1. how well will it deal with multi-dimensional data? (in your example, there are 2 only: x and y) 2. if I have some category data for one of my dimensions (e.g. {up, down, side}), will Orange be happy with that, or will I have to encode those values into numbers for the algorithm to work? Bonus question #3 :-) once it found the clusters, what would be a user-friendly way to visualize them, given that they're multi-dimensional? In the case of a classifier you can render a decision tree and figure out what rule the algorithm followed to make a decision - is there some analog of that for clustering? p.s. general remark - I have watched the tutorials so far and played with Orange a bit, I am very impressed with how it works and how easy it is to use.
1. k-Means is actually quite great with multidimensional data. It will find clusters regardless of the number of dimensions. 2 dimensions in the example are used for simplicity. 2. As mentioned in one of the comments below, Orange will one-hot-encode categorical data (it will transform them into numbers just like sklearn does). Bonus 3. It is generally impossible to verily represent multidimensional data so that humans can understand it. You can try with Scatter Plot, color by Cluster and find the projection that makes the most sense. Another option is MDS, which is actually meant to simplify multidimensional data into a 2D representation. You can also color by Cluster here. But the approach I prefer the most is using Box Plot, setting the subgroup to Cluster and exploring the meaning of the clusters with 'Order by relevance'. It is not exactly visualizing data in a projection, but it does enable you to figure out what each cluster means.
Thank you for the reply. I have tinkered with it a bit and I stumbled upon a problem (using version 3.11). When I load my actual set of data from an Excel file, I seem to be unable to use any of the columns, though it looks fine in the `Data Table` widget. My first guess was that this is an issue with the `k-means` widget, but the same happens with the `Column selector` - nothing is listed. To illustrate my point, I made a GIF demo imgur.com/a/lSaCl It works as expected with built-in data sets, or those that I paint with the data painting tool. Are there some additional constraints I should keep in mind?
Yes, because you have not selected anything in the Data Table. Hence the dashed line on the output. Always connect widgets to the data source. Data Table will only output the selection.
Hello. I´ve used orange for 2 years for education, and in the recent months some functions are very slow on all cumputers, like k-means and in some general funtions or widgets, show red dots. What should these changes be? Thank you for orange. Regrads
K-means is, as most algorithms in Orange, wrapped from sklearn, so perhaps they changed it. Alternatively, there might be some preprocessing that takes time to do. It would be best if you tell us which widgets specifically are slow and we can investigate.
I worked with lots of Machine Learning Tools .. Orange is the best for educational purposes .. Thumbs UP
great! request for video implementation of probabilities classification using k-nearest neighbor (k-NN) after this.
I've been really enjoying your videos! Keep them coming 😃
great! I still have one question- the analyze starts with click on "paint data" icon--> k-means --> scatter plot
How it works without loading any data? I guess I'm missing something but yet, it's not clear how to load my data, save it as Orange can uses it for all it's great analyze.. TNX
how do I upload my excel survey data into orange and let it analyze into charts or diagrams?
Are there any requirements for the minimal number of samples for k-means to function well? If you consider the geometrical interpretation - then it works well with a small number of samples (as can be seen in your tutorial where you draw the points on the whiteboard).
Various sources on the Internet talk about how well clustering algorithms scale when the data set is large, but I didn't find any authoritative source that would say something like "K-means is awesome, as long as you have >N samples in your set".
Hi Orange, I can't find the k-means widget any more. Can you please help?
hi, can you please explain why did you draw 3 groups of data point. there is any reason behind it.
thanks
No reason at all. The idea is to have obvious groups that people can see and observe, whether k-means finds them, too. And observe how k-means behaves when we ask it to give more than 3 or less than 3 groups. You can try it yourself with a different number of groups or with data, where there are no groups at all, and observe how k-means behaves. There's educational value to this.
Amazing work! Does the algorithm automatically normalize
numerical attributes when doing distance computations and convert categorical attributes to binary?
We wrap k-Means from sklearn. k-Means does not normalize (consensus is that is shouldn't) and it transforms categorical attributes with one-hot-encoding.
K Means Clustering is working on my dataset but is there a way in which I can extract the cluster groups from the application onto Excel.
Of course. You can save the output with Save Data and use .xlsx format.
Thank you for the demo, it was a clear and straightforward tutorial. I have 2 questions:
1. how well will it deal with multi-dimensional data? (in your example, there are 2 only: x and y)
2. if I have some category data for one of my dimensions (e.g. {up, down, side}), will Orange be happy with that, or will I have to encode those values into numbers for the algorithm to work?
Bonus question #3 :-) once it found the clusters, what would be a user-friendly way to visualize them, given that they're multi-dimensional? In the case of a classifier you can render a decision tree and figure out what rule the algorithm followed to make a decision - is there some analog of that for clustering?
p.s. general remark - I have watched the tutorials so far and played with Orange a bit, I am very impressed with how it works and how easy it is to use.
1. k-Means is actually quite great with multidimensional data. It will find clusters regardless of the number of dimensions. 2 dimensions in the example are used for simplicity.
2. As mentioned in one of the comments below, Orange will one-hot-encode categorical data (it will transform them into numbers just like sklearn does).
Bonus 3. It is generally impossible to verily represent multidimensional data so that humans can understand it. You can try with Scatter Plot, color by Cluster and find the projection that makes the most sense. Another option is MDS, which is actually meant to simplify multidimensional data into a 2D representation. You can also color by Cluster here. But the approach I prefer the most is using Box Plot, setting the subgroup to Cluster and exploring the meaning of the clusters with 'Order by relevance'. It is not exactly visualizing data in a projection, but it does enable you to figure out what each cluster means.
Thank you for the reply. I have tinkered with it a bit and I stumbled upon a problem (using version 3.11). When I load my actual set of data from an Excel file, I seem to be unable to use any of the columns, though it looks fine in the `Data Table` widget. My first guess was that this is an issue with the `k-means` widget, but the same happens with the `Column selector` - nothing is listed. To illustrate my point, I made a GIF demo imgur.com/a/lSaCl
It works as expected with built-in data sets, or those that I paint with the data painting tool. Are there some additional constraints I should keep in mind?
Yes, because you have not selected anything in the Data Table. Hence the dashed line on the output. Always connect widgets to the data source. Data Table will only output the selection.
Indeed... silly me! Thank you :-)
Hello. I´ve used orange for 2 years for education, and in the recent months some functions are very slow on all cumputers, like k-means and in some general funtions or widgets, show red dots. What should these changes be? Thank you for orange. Regrads
K-means is, as most algorithms in Orange, wrapped from sklearn, so perhaps they changed it. Alternatively, there might be some preprocessing that takes time to do. It would be best if you tell us which widgets specifically are slow and we can investigate.
The best place to report: github.com/biolab/orange3/issues
Super! Thanks for your kind explanation.
I can not get widget for k means clustering after installation.I have tried several way.can anyone help
Please restart the system or Orange BI
Mam, please make a video on time series data forecasting with ARIMA and VAR model
Can't do. No timeseries experts here. :(
Here is the link for Time Series in Orange BI
th-cam.com/video/szLbgFRRl18/w-d-xo.html
Wow! Thnx. TIL abt Silhouette scoring
Amazing Tool 😊😊😊😊😊