Hello sir, Which algorithm works well for customer segmentation wrt Recency, Frequency, Monetory? And is necessary to apply all the algorithms that is Kmeans, Dbscan, hier to the dataset and then come yo conclusion.
Great! One question: what do you mean when you write "dist i=dist of the 5th neighbor of the ith data point"? What is the neighbor in this case? Thank you
Hello Normalizer, I am wondering : If DBSCAN doesn't handle higher dimensionality very well, does standardizing improve performance if there is a moderate degree of correlation between features/ dimensions?
Thanks for great video! I have two questions that I want to ask: 1. You said DBSCAN performs poorly for high dimensional data, how many dimension are considered high? 2. Why is it bad for high dimensional data?
1. That's a very subjective question. For some datasets it's 100 for others it might be 1000. It depends on the distribution of the data. 2. Because we are using Euclidean distance to find the neighborhood points. Euclidean distance is bad for searching in higher dimensions because it searches a tiny percentage of volume compared to circumscribing hypercube!
OMG man it's working.....I have been searching in the wrong direction for over 1 week....this one word opened doors to all my answers😭😭... thanks again man....
Thank u so much, what a great explanation! I have a question, can we use PCA before doing clustering with DBSCAN? If yes, which dimension should I use? before PCA (in this case I have 30 dimensions), or after PCA with 3 dimensions?
@@NormalizedNerd then for the MinPts, in case I will use PCA Dataframe to fit in DBSCAN Algorithm.. which one should I use? MinPts = 2*30 - 1 = 59 (original number of features) or MinPts = 2*3 - 1 (PCA features) ?? (refers to the heuristic approach by the inventor of DBSCAN Algorithm, Martin Ester 1996)
Very easy to understand for 1st timers. Great work. Appreciated.
Thanks mate!
Hello sir,
Which algorithm works well for customer segmentation wrt Recency, Frequency, Monetory?
And is necessary to apply all the algorithms that is Kmeans, Dbscan, hier to the dataset and then come yo conclusion.
Excellent video. Very well explained. Thank you so much.
You are welcome!
Thank you for this very useful video
nice, clear explanation, thank you.
Great! One question: what do you mean when you write "dist i=dist of the 5th neighbor of the ith data point"? What is the neighbor in this case? Thank you
dist = an array of n elements
dist[i] stores the distance of the 5th nearest datapoint from i th data point
n = number of data points
you made that easy! glad that i found you :)
Thanks a lot! :D
This was very helpful. Thank you!
You're very welcome!
Great! Can you please provide the more detail explanation of DBSCAN algorithm
How do we specify out the exact values of the outliers from the dataset from this DBSCAN cluster? Thank you
Exact values of the outliers...meaning?
Hello Normalizer, I am wondering :
If DBSCAN doesn't handle higher dimensionality very well, does standardizing improve performance if there is a moderate degree of correlation between features/ dimensions?
By 5th neighbour you mean the 5th radially farthest point from ith point? What if many points are lying in the 5th position
A point can have any number of equidistant neighbors. The algorithm just checks how many points are inside the circle.
Please. Provide us code to copy it
z surely can't refer to neighbours only, it must also include the point itself?
Yes, z includes the point itself. (Sorry for the late reply)
Thank you for the video with a clear explanation. Could you also show how to find optimal z and epsilon in sklearn?
Well explained content!!
Thanks a lot ❤️❤️
Thanks for great video! I have two questions that I want to ask:
1. You said DBSCAN performs poorly for high dimensional data, how many dimension are considered high?
2. Why is it bad for high dimensional data?
1. That's a very subjective question. For some datasets it's 100 for others it might be 1000. It depends on the distribution of the data.
2. Because we are using Euclidean distance to find the neighborhood points. Euclidean distance is bad for searching in higher dimensions because it searches a tiny percentage of volume compared to circumscribing hypercube!
@@NormalizedNerd
Thanks for answering!
1. Distribution of each feature? Can't we just normalize all features?
Very nice explanation. Thank you!!
Can you please video on HDBSCAN?
Thanks for the suggestion.
How can we input the excel or csv data while using this algorithm?
Pretty easy...
df = read.csv("path_to_csv_file.csv")
# then use iloc to select columns for features and target variables and put them in X and Y
Great video! Is there any function built into scikit that can plot the clusters like the function you have in this video? Your show_clusters function
IDK if scikit learn can do that but you can do a scatter plot using seaborn to indicate the clusters.
Very useful....So I have 1 doubt...Assuming we created the clusters...how do we create a buffer or outer polygon for those cluster??...
Thanks!
You need something called convex hull.
@@NormalizedNerd thanks man... that's everything I need ...
OMG man it's working.....I have been searching in the wrong direction for over 1 week....this one word opened doors to all my answers😭😭... thanks again man....
@@cruzab3153 Haha...Yeah it happens. Happy to help :D
Thank u so much, what a great explanation! I have a question, can we use PCA before doing clustering with DBSCAN? If yes, which dimension should I use? before PCA (in this case I have 30 dimensions), or after PCA with 3 dimensions?
Yes, you can try to reduce the dimension using PCA and then cluster using DBSCAN.
@@NormalizedNerd then for the MinPts, in case I will use PCA Dataframe to fit in DBSCAN Algorithm.. which one should I use? MinPts = 2*30 - 1 = 59 (original number of features) or MinPts = 2*3 - 1 (PCA features) ?? (refers to the heuristic approach by the inventor of DBSCAN Algorithm, Martin Ester 1996)
Very well experienced, can we get more usecases for DBScan for better understanding .
give this a read: datascience.stackexchange.com/questions/10063/for-which-real-world-data-sets-does-dbscan-surpass-k-means
Can you please do it with an image
Nice suggestion!