This is wonderful! I don't know how I missed seeing this when it was released. This is one to watch take notes! I am looking forward to applying it. Thanks, Cat!
I really liked the idea of using PCA along with clustering. K-means is in such wide use that I wish people would understand that k-means will produce clusters whether or not the data are naturally clustered. As such, it is really a *partitioning* method. As an example, you can create 4 random uniform variables between 0 and 1 and cluster them with k-means, and you'll get an answer. I'm glad you at least cover the CCC, and I think value could be added by discussing whether the summary stats justify the division of the data into clusters. AS it pertains to this video, the IRIS example shows clusters 2 and 3, but I'm not convinced, without seeing a rotation that separates them, that they are really separate.
Thanks for watching, Michael, and for leaving a comment. I think you make some important points about clustering-- cluster analysis divides data into artificial groupings that are hopefully useful. There's no guarantee that they are useful, or that they approximate "true" groups. With the iris data, I know that there are 3 species, so it's a bit of a cheat to solve for 3 clusters. I think that it looks like one small compact cluster and one big, elongated cluster or, you split the one big elongated cluster and you get 3 compact clusters. Real business data rarely imitates these classic textbook examples. Have a great day!
@@SASUsers I do love SAS, so that's why I try to contribute to the discussions. I remember taking a traditional SAS course (in SEM) from Cat, and always liked her approach to difficult material.
@@catherinetruxillo3845 Can you really call it cheating if you use prior domain expertise to set the number of clusters. In fact, that's really the best way, if it's available. I do remember there's a good rotation to separate clusters 2 and 3.
@@michaeltuchman9656 Oh, I loved teaching that SEM class! It's been a few years since we offered it. Perhaps the word "cheating" is too negatively valenced. Here's what I'm saying-- if you know the true groups, then there's not much reason to do clustering. To say that solving for 3 clusters recovered the 3 known species, on variables that were measured because they differentiate among species, is not terribly interesting. Where domain knowledge is ideal (and this, I think, was the point you made above) is when you think that there are 3 groups, but that group membership is not present in the data. In that case, solve for 3 groups, profile them, and if they describe the groups that you have theorized, then they could be useful. I have used rotation with factor analysis, and less with clustering. One approach that is akin to rotation (although really it's a transformation based on an approximated within-cluster covariance matrix) is what PROC ACECLUS does. That might be what you're thinking? Or, you could transpose the rows and columns of the data table, perform factor analysis on the correlation matrix of that table (which is now ewauivalent to factoring observations instead of variables, and rotate factors-- typically Varimax rotation. This is a rotated Q-factor analysis, and it's one approach to clustering that was popular in the 1970s-80s. Some people still find that approach useful because you can have overlapping clusters, and individuals can belong to more than one cluster. In fact, individuals belong to all clusters with some correlation. This has been a fun discussion! I should mention that the only way I know there's a comment on TH-cam is when I get over here to look-- it doesn't notify me of comments-- so it might take me awhile to see and respond to comments here. But don't let that stop you from dropping in to say hello!
LaSupp, good question - In my view, the best solution 1) is explainable (when you profile the results, they make subject-matter sense) and/or 2) conforms to a theoretical number of classes (I think there are 4 breeds of pigeon, so I'll get 4 clusters) and/or 3) produces clusters that are compact and well-separated (the means are reasonably far apart, and most observations are close to just one cluster centroid). If you don't have one or more of the conditions above, then you can rely on some statistical measures to evaluate fit of a k-cluster solution. There are several metrics that are popular and that you can compare against other cluster solutions- 2 that are easy to find in SAS are the Cubic Clustering Criterion (CCC) and the Pseudo-F statistic (PSF). You have to run the k-means clustering multiple times to get the CCC or PSF for each solution and compare them. There is also a nifty hybrid approach that combines hierarchical and k-means clustering. An example can be found in the PROC CLUSTER documentation here: go.documentation.sas.com/?docsetId=statug&docsetTarget=statug_cluster_examples03.htm&docsetVersion=15.1&locale=en I hope this helps! Thanks for watching.
If I have an ID variable that identifies each observation. Is there a way to assign each ID with a new variable (Cluster) that shows the cluster it belongs to, in Visual analytics?
Dritan, please look at time stamp 9:21 in the video where the instructor talks about assigning Cluster ID. This is a feature of Visual Statistics in Visual Analytics. If you need more help with assigning cluster IDs we suggest that you work with SAS Tech Support because this feedback area is not the appropriate place to resolve usage questions. To open a track with Tech Support, fill out the form at this link: 2.sas.com/6058HoDPs .
Thank you, searched the web for a while, only to find very few instructions on K-Means clustering for SAS. Your video is very informative and helpful!
Thank you so much for your feedback! 👍
This is wonderful! I don't know how I missed seeing this when it was released. This is one to watch take notes! I am looking forward to applying it. Thanks, Cat!
We're so glad you enjoyed it!
I really liked the idea of using PCA along with clustering.
K-means is in such wide use that I wish people would understand that k-means will produce clusters whether or not the data are naturally clustered. As such, it is really a *partitioning* method. As an example, you can create 4 random uniform variables between 0 and 1 and cluster them with k-means, and you'll get an answer. I'm glad you at least cover the CCC, and I think value could be added by discussing whether the summary stats justify the division of the data into clusters.
AS it pertains to this video, the IRIS example shows clusters 2 and 3, but I'm not convinced, without seeing a rotation that separates them, that they are really separate.
Thank you, Michael - we appreciate your feedback!
Thanks for watching, Michael, and for leaving a comment. I think you make some important points about clustering-- cluster analysis divides data into artificial groupings that are hopefully useful. There's no guarantee that they are useful, or that they approximate "true" groups. With the iris data, I know that there are 3 species, so it's a bit of a cheat to solve for 3 clusters. I think that it looks like one small compact cluster and one big, elongated cluster or, you split the one big elongated cluster and you get 3 compact clusters. Real business data rarely imitates these classic textbook examples. Have a great day!
@@SASUsers I do love SAS, so that's why I try to contribute to the discussions. I remember taking a traditional SAS course (in SEM) from Cat, and always liked her approach to difficult material.
@@catherinetruxillo3845 Can you really call it cheating if you use prior domain expertise to set the number of clusters. In fact, that's really the best way, if it's available. I do remember there's a good rotation to separate clusters 2 and 3.
@@michaeltuchman9656 Oh, I loved teaching that SEM class! It's been a few years since we offered it.
Perhaps the word "cheating" is too negatively valenced. Here's what I'm saying-- if you know the true groups, then there's not much reason to do clustering. To say that solving for 3 clusters recovered the 3 known species, on variables that were measured because they differentiate among species, is not terribly interesting. Where domain knowledge is ideal (and this, I think, was the point you made above) is when you think that there are 3 groups, but that group membership is not present in the data. In that case, solve for 3 groups, profile them, and if they describe the groups that you have theorized, then they could be useful.
I have used rotation with factor analysis, and less with clustering. One approach that is akin to rotation (although really it's a transformation based on an approximated within-cluster covariance matrix) is what PROC ACECLUS does. That might be what you're thinking? Or, you could transpose the rows and columns of the data table, perform factor analysis on the correlation matrix of that table (which is now ewauivalent to factoring observations instead of variables, and rotate factors-- typically Varimax rotation. This is a rotated Q-factor analysis, and it's one approach to clustering that was popular in the 1970s-80s. Some people still find that approach useful because you can have overlapping clusters, and individuals can belong to more than one cluster. In fact, individuals belong to all clusters with some correlation.
This has been a fun discussion! I should mention that the only way I know there's a comment on TH-cam is when I get over here to look-- it doesn't notify me of comments-- so it might take me awhile to see and respond to comments here. But don't let that stop you from dropping in to say hello!
As always, excellent presentation and demonstration Dr. Cat!
Glad you enjoyed it!
Very informative. Keep up the good work!
Thanks for sharing!
Thank you, such an insightful lecture, really enjoy it!
Awesome! Glad to hear you enjoyed it! 👍
Hey, join the conversation! How are you using k-means clustering right now? Leave a comment and let me know.
Just perfect!
Is there an approach or algorithm to figure out what the optimum number of groups might be given the data?
LaSupp, thank you for your inquiry! We are checking into this for you!
LaSupp, good question - In my view, the best solution 1) is explainable (when you profile the results, they make subject-matter sense) and/or 2) conforms to a theoretical number of classes (I think there are 4 breeds of pigeon, so I'll get 4 clusters) and/or 3) produces clusters that are compact and well-separated (the means are reasonably far apart, and most observations are close to just one cluster centroid).
If you don't have one or more of the conditions above, then you can rely on some statistical measures to evaluate fit of a k-cluster solution. There are several metrics that are popular and that you can compare against other cluster solutions- 2 that are easy to find in SAS are the Cubic Clustering Criterion (CCC) and the Pseudo-F statistic (PSF). You have to run the k-means clustering multiple times to get the CCC or PSF for each solution and compare them. There is also a nifty hybrid approach that combines hierarchical and k-means clustering. An example can be found in the PROC CLUSTER documentation here: go.documentation.sas.com/?docsetId=statug&docsetTarget=statug_cluster_examples03.htm&docsetVersion=15.1&locale=en
I hope this helps! Thanks for watching.
Thank You Madam
Thanks for tuning in!
If I have an ID variable that identifies each observation.
Is there a way to assign each ID with a new variable (Cluster) that shows the cluster it belongs to, in Visual analytics?
Dritan, we are checking on this for you!
Dritan, please look at time stamp 9:21 in the video where the instructor talks about assigning Cluster ID. This is a feature of Visual Statistics in Visual Analytics. If you need more help with assigning cluster IDs we suggest that you work with SAS Tech Support because this feedback area is not the appropriate place to resolve usage questions. To open a track with Tech Support, fill out the form at this link: 2.sas.com/6058HoDPs .
Great!!!
I know it's kinda randomly asking but do anybody know of a good site to watch new tv shows online ?
@Rhett Kaison i would suggest flixzone. You can find it by googling =)
@Kasen Wade definitely, I've been watching on Flixzone for since april myself :D
@Kasen Wade Thanks, I signed up and it seems like a nice service =) I appreciate it!!
@Rhett Kaison glad I could help :D
You didn't show how to do cluster observation in University Edition
We no longer offer University Edition. Are you new to SAS looking to learn?