To Clarify, for 26:19 , for a Gaussian Process, each data point on the X-axis would we a queried test point , the grey region would be the standard deviation and the points that we have not "queried" would be fitted according to its respective determined distribution which it itself would be a Gaussian distribution with its own mean and s.d?
9:30, so for my test data, y_test, it has 1) its own variance, 2) n correlations with respect to all observed data y1...yn, then how to determine y_test distribution? how did you get the conclusion at 11:06? Thanks!
Question: Regarding hyperparameter search via GP, I recall that the earlier steps in hyperparameter search involves determining the scale of hyperparameter. How should we determine the scale? Should we use GP for both scale and minimal value at the same scale. Or, Use grid search to determine scale and then, use GP to find the value of hyperparameter. Thanks for both rigorous and enjoyable lectures :)
hi Prof In Bayesian Optimiser I assume that algorithm for which we are trying to find out best hyper-parameters should be costly enough otherwise it will not make any sense to use GP on top of another algo.
Seems from your explanation that the covariance matrix is a simple kernel/distance matrix that does not take into account variable importance. (1) Does that cause any issues if there are variables that have no significant prediction value?, (2) Does it mean we have to be careful about variable selection? And (3) is there a way to incorporate feature importance in the kernel?
For the linear kernel that's not an issue (as your algorithm becomes identical to linear regression where you learn a weight for each dimension), however for non-linear kernels that can indeed be a problem. One common trick is to multiply each feature dimension by a non-negative weight, and also learn these weights as part of the kernel parameters.
Hm isn't there maybe a way to do low-dimensional egg-search (if it's a manifold there should allways be some main directions) so for the start just make it elipsoid in just one dimension and for comparing distort the room so the elipsoid you're comparing with becomes a globe hm...
Sorry, I cannot post them. The projects are still used at Cornell University, and if they were public someone would certainly post solutions somewhere and spoil all the fun. :-(
One thing I would like to ask is, "what's the catch?" The algorithms seems great but where would we not want to use GPR? Is it in situations where we would like to actually know what the function is? Or are there some situations where GPR wont work well?
Well, I wouldn’t recommend them for data that is very high dimensional (e.g. bag of word vectors, or images in pixel space). Also, when features are sparse splitting along features becomes tedious and too restrictive, as almost all samples always have zeros in all dimensions.
KD-Trees begins at 28:50
this series is a work of art. needs way more views.
The best explanation for Gaussian Process ever!
Most intuitive explanation of the topics in classroom
Love this. Probably the clearest explanation i have seen on GP online.
The previous video and the current one are the best material I watched on Gaussian Processes! Wonderful :)
definitely! I saw many, but this one is one of the best
thank you very much! I've tried a couple of times to understand GPs, but always gave up. Now i think they're much clearer to me. very very greatful
Prof Killian killin it! Thanks prof for all the lectures. This course should be the first introduction to the Machine Learning world for everyone
Omg. I love this lecture material. To the point, clear and the best!
awesome simulation of a beautiful application !
You make it look easy ! Thanks for the clear explanation of GP.
This helps me with my exam preparation, thank you.
To Clarify, for 26:19 , for a Gaussian Process, each data point on the X-axis would we a queried test point , the grey region would be the standard deviation and the points that we have not "queried" would be fitted according to its respective determined distribution which it itself would be a Gaussian distribution with its own mean and s.d?
Exactly :-)
9:30, so for my test data, y_test, it has 1) its own variance, 2) n correlations with respect to all observed data y1...yn, then how to determine y_test distribution? how did you get the conclusion at 11:06? Thanks!
Best and simplest explanation of GPR.
Thank you for the lecture, very clear! Just one question, how does the Bayesian Optimisation already have a mapped surface?
initially that is just a flat surface, which is an uninformed prior.
Thank you very much, these lectures are really useful.
this is such a great lesson. Thanks!
Question: Regarding hyperparameter search via GP, I recall that the earlier steps in hyperparameter search involves determining the scale of hyperparameter. How should we determine the scale? Should we use GP for both scale and minimal value at the same scale. Or, Use grid search to determine scale and then, use GP to find the value of hyperparameter.
Thanks for both rigorous and enjoyable lectures :)
U keep running bayes optimization which uses gaussian processes, with more iterations it converges to smaller scales itself
hi Prof
In Bayesian Optimiser I assume that algorithm for which we are trying to find out best hyper-parameters should be costly enough otherwise it will not make any sense to use GP on top of another algo.
Seems from your explanation that the covariance matrix is a simple kernel/distance matrix that does not take into account variable importance. (1) Does that cause any issues if there are variables that have no significant prediction value?, (2) Does it mean we have to be careful about variable selection? And (3) is there a way to incorporate feature importance in the kernel?
For the linear kernel that's not an issue (as your algorithm becomes identical to linear regression where you learn a weight for each dimension), however for non-linear kernels that can indeed be a problem. One common trick is to multiply each feature dimension by a non-negative weight, and also learn these weights as part of the kernel parameters.
Hm isn't there maybe a way to do low-dimensional egg-search (if it's a manifold there should allways be some main directions) so for the start just make it elipsoid in just one dimension and for comparing distort the room so the elipsoid you're comparing with becomes a globe hm...
Awesome lecture! One question is are the projects available for public? I have found homeworks but no coding projects.
Sorry, I cannot post them. The projects are still used at Cornell University, and if they were public someone would certainly post solutions somewhere and spoil all the fun. :-(
For the hyper parameter search, wouldn't the bayesian optimization approach be more likely to get stuck at a local minimum?
No, Bayesian optimization is global. The exploration component makes sure that you don’t get stuck.
Thank you Professor !!!
Living for that "YAY" 😂😂
This guy is brilliantly funny.
Important @8:00
One thing I would like to ask is, "what's the catch?" The algorithms seems great but where would we not want to use GPR? Is it in situations where we would like to actually know what the function is? Or are there some situations where GPR wont work well?
Well, I wouldn’t recommend them for data that is very high dimensional (e.g. bag of word vectors, or images in pixel space). Also, when features are sparse splitting along features becomes tedious and too restrictive, as almost all samples always have zeros in all dimensions.
are the homeworks available for public?
www.dropbox.com/s/tbxnjzk5w67u0sp/Homeworks.zip?dl=0
Kilian Weinberger you must be an angel
Is it possible to use B/B+ tree instead of simple binary tree?
dead mouse got me
He clears his throat a lot
THANKS!
Are all these lectures dependent on previous ones?
Some more than others... but generally yes.
KD Tree starts from th-cam.com/video/BzHJ57QCdVo/w-d-xo.html