Thank you! I was reading papers about TPEs and it seems they all either are very high level in how the calculations work (the original paper seems to explain some things but not others at all) or like 70 pages long and with tons of less useful information added in… this is such a great video :)
That is a great question! Unfortunately, I do not know of such a rule of thumb. This really depends on the problem and the importance of the different hyperparameters. Normally, hyperparameter tuning is very expensive and we have to go with a small number of observations. TPE can more effectively explore the search space than, for example, random search. However, TPE is often warm-started with random search: you can for example test between 10 and 30 randomly selected configurations and use TPE to select new promising configurations.
Good question! No, we are computing a mixture for the good observations and a mixture for the bad observations independently of each other. For example, we put a Gaussian distribution over every observation for the good distribution, and combine these Gaussians for every good point to fit a "good" distribution. Likewise for the bad points. Hope this helps!
Thank you. Great question! It is called "Parzen Estimtor" because it builds on Parzen windows for building probabilistic models (stats.stackexchange.com/questions/244012/can-you-explain-parzen-window-kernel-density-estimation-in-laymans-terms). The "tree" part refers to the fact that the configuration space of hyperparameters has a tree structure.
Amazing!.Thank you.As a newbie, I would like to ask a few questions. What is the point of using Gaussian mixture for each hyperparameter in this paper? I know the distribution of good point and bad point is calculated by kernel density estimation(Parzen Estimator). But I don't understand how the mixture of gaussian is related to the distribution of good point and bad point. I probably don't know much about TPE yet so the question might be a little weird. I hope you can understand. Looking forward to your reply ! Thank you !
That's a good question! In TPE, we create two partitions: the good partition and bad partition. These contain the observations with either good or bad performance, respectively. As you mention, we do kernel density estimation for both partitions. Within kernel density estimation, we basically perform a weighted sum over all observations in a given partition, where the weight is given by the kernel. This summation naturally also occurs in a mixture of Gaussians (we sum up individual the probabilities of the individual Gaussians). So when each Gaussian corresponds to a given observation, we retrieve a summation over all data points, as in kernel density estimation. Hope this helps! :)
@@aixplained4763 Thank you for your answering! You help me a lot. I have another question.In Alois Bissuel Blog, he said 'One of the great drawback of tree-structured Parzen estimators is that they do not model interactions between the hyper-parameters. ' How do I understand this? Does it mean there is no relationship between hyperparameters? Looking forward to your reply ! Thank you!
@@JW-iw9fe TPE models the densities for hyperparameters separately (for every hyperparameter, we fit l(x) and g(x)). As a consequence, it does not know how the individual densities are affected by other hyperparameter values, because it models them in isolation. This is a limitation of this method as in reality, the optimal value of a hyperparameter depends on the values of other hyperparameters.
@@aixplained4763 You are so nice. Thank you so much! Your answer completely solved my confusion! Even with these drawbacks, is TPE used more in HPO? And can you tell me some of the most popular technologies in HPO? Thank you!
@@JW-iw9fe In spite of this limitation, TPE is quite commonly used as a surrogate model for computing the expected improvement of configurations. Some popular HPO methods include SMAC (uses random forests instead of TPE as surrogate model) and BOHB (does use TPE). In BOHB, however, they used a multidimensional TPE such that interactions between hyperparameters can be modeled. If you are interested, you can read more here: www.automl.org/automated-algorithm-design/algorithm-configuration/smac/ and arxiv.org/pdf/1807.01774.pdf
Thank you :) That's a great question! Hyperband has a theoretical guarantee to find the optimal configuration with probability 1 given an infinite budget (because it does a random search in the final bracket). Using TPE as a sampler changes this, as TPE can get stuck in local optima and never explore parts of the search space (thus potentially missing out on the optimal configuration). A workaround to this would be to choose a configuration at random with a small probability (and TPE otherwise) so that we retrieve the guarantees of random search. These theoretical guarantees are, however, not very useful in practice.
Your video shows the optimisation of a hyperparameter using TPE. When the TPE optimises a set of parameters, how does it work? optimising each hyperparameter one by one with the other parameters fixed, or all hyperparameters together? I hope you can answer this, I am confused.
That is a great question! When you have multiple hyperparameters you want to tune, you fit P(x|good) and P(x|bad) for every individual hyperparameter x (for example, we could do this for the learning rate and batch size). This means that every hyperparameter is optimized independently of the other hyperparameters and that interactions cannot be modeled. This is one of the main shortcomings of TPE, but can be addressed by using multivariable distributions like P(x_1, x_2|good), although that brings about new issues :)
@@aixplained4763 Thank you for your answer, does this mean that using TPE for hyperparameter optimisation will ignore the link between the hyperparameters. For example, when tuning for learning rate and batch size, the results they get will be significantly different between optimising learning rate before batch size using TBE and optimising batch size before learning rate?
@@icelover7166 Yes, exactly. TPE cannot learn the interaction effects (or links) between different hyperparameters. Concerning the latter example you mention: this can indeed be the case but it depends on how the tuning is performed. In the regular case, you perform a random search for a few iterations (a warm-up) where you select random hyperparameter configurations (one configuration = a certain value for the learning rate and batch size in our case). These can be used to fit the densities used by TPE, and for selecting promising configurations to try out next. If instead you decide to first tune the batch size and fixing all other hyperparameters, and afterward tune the learning rate, fixing all other hyperparameters, then the outcome can indeed be very different compared to the case described above. Hope this helps!
Thank you, you saved my life😢❤
Such a clear explanation! Out of the woods, at last. Thank you for the time you put in making this video.
The explanation was amazing and very clear, thank you so much :D
Thank you! I was reading papers about TPEs and it seems they all either are very high level in how the calculations work (the original paper seems to explain some things but not others at all) or like 70 pages long and with tons of less useful information added in… this is such a great video :)
Thank you for your kind comment :) Glad that it was helpful!
So well explained, thank you so much for sharing.
really nice explanation! thank you!
Very well explained
Any rule of thumb for how many observations needed for optimizing N hyperparameters?
That is a great question! Unfortunately, I do not know of such a rule of thumb. This really depends on the problem and the importance of the different hyperparameters. Normally, hyperparameter tuning is very expensive and we have to go with a small number of observations. TPE can more effectively explore the search space than, for example, random search. However, TPE is often warm-started with random search: you can for example test between 10 and 30 randomly selected configurations and use TPE to select new promising configurations.
@@aixplained4763 Thanks! And great video btw.
For the mixture of Gaussians, are we taking the mixture of both bad and good distributions for each observations?
Good question! No, we are computing a mixture for the good observations and a mixture for the bad observations independently of each other. For example, we put a Gaussian distribution over every observation for the good distribution, and combine these Gaussians for every good point to fit a "good" distribution. Likewise for the bad points. Hope this helps!
great explenation!!!!! thank you
Nicely done. Thank you. Do you know why it’s called Tree Parzen Estimators?
Thank you. Great question! It is called "Parzen Estimtor" because it builds on Parzen windows for building probabilistic models (stats.stackexchange.com/questions/244012/can-you-explain-parzen-window-kernel-density-estimation-in-laymans-terms). The "tree" part refers to the fact that the configuration space of hyperparameters has a tree structure.
Amazing!.Thank you.As a newbie, I would like to ask a few questions.
What is the point of using Gaussian mixture for each hyperparameter in this paper?
I know the distribution of good point and bad point is calculated by kernel density estimation(Parzen Estimator).
But I don't understand how the mixture of gaussian is related to the distribution of good point and bad point.
I probably don't know much about TPE yet so the question might be a little weird.
I hope you can understand. Looking forward to your reply !
Thank you !
That's a good question! In TPE, we create two partitions: the good partition and bad partition. These contain the observations with either good or bad performance, respectively. As you mention, we do kernel density estimation for both partitions. Within kernel density estimation, we basically perform a weighted sum over all observations in a given partition, where the weight is given by the kernel. This summation naturally also occurs in a mixture of Gaussians (we sum up individual the probabilities of the individual Gaussians). So when each Gaussian corresponds to a given observation, we retrieve a summation over all data points, as in kernel density estimation. Hope this helps! :)
@@aixplained4763 Thank you for your answering! You help me a lot. I have another question.In Alois Bissuel Blog, he said 'One of the great drawback of tree-structured Parzen estimators is that they do not model interactions between the hyper-parameters. ' How do I understand this?
Does it mean there is no relationship between hyperparameters?
Looking forward to your reply !
Thank you!
@@JW-iw9fe TPE models the densities for hyperparameters separately (for every hyperparameter, we fit l(x) and g(x)). As a consequence, it does not know how the individual densities are affected by other hyperparameter values, because it models them in isolation. This is a limitation of this method as in reality, the optimal value of a hyperparameter depends on the values of other hyperparameters.
@@aixplained4763 You are so nice. Thank you so much!
Your answer completely solved my confusion!
Even with these drawbacks, is TPE used more in HPO?
And can you tell me some of the most popular technologies in HPO?
Thank you!
@@JW-iw9fe In spite of this limitation, TPE is quite commonly used as a surrogate model for computing the expected improvement of configurations. Some popular HPO methods include SMAC (uses random forests instead of TPE as surrogate model) and BOHB (does use TPE). In BOHB, however, they used a multidimensional TPE such that interactions between hyperparameters can be modeled. If you are interested, you can read more here: www.automl.org/automated-algorithm-design/algorithm-configuration/smac/ and arxiv.org/pdf/1807.01774.pdf
Amazing course! What theorethically implications there exists when using TPE as a sampler and Hyperband as a prunner?
Thank you :) That's a great question! Hyperband has a theoretical guarantee to find the optimal configuration with probability 1 given an infinite budget (because it does a random search in the final bracket). Using TPE as a sampler changes this, as TPE can get stuck in local optima and never explore parts of the search space (thus potentially missing out on the optimal configuration). A workaround to this would be to choose a configuration at random with a small probability (and TPE otherwise) so that we retrieve the guarantees of random search. These theoretical guarantees are, however, not very useful in practice.
Your video shows the optimisation of a hyperparameter using TPE. When the TPE optimises a set of parameters, how does it work?
optimising each hyperparameter one by one with the other parameters fixed, or all hyperparameters together? I hope you can answer this, I am confused.
That is a great question! When you have multiple hyperparameters you want to tune, you fit P(x|good) and P(x|bad) for every individual hyperparameter x (for example, we could do this for the learning rate and batch size). This means that every hyperparameter is optimized independently of the other hyperparameters and that interactions cannot be modeled. This is one of the main shortcomings of TPE, but can be addressed by using multivariable distributions like P(x_1, x_2|good), although that brings about new issues :)
@@aixplained4763 Thank you for your answer, does this mean that using TPE for hyperparameter optimisation will ignore the link between the hyperparameters. For example, when tuning for learning rate and batch size, the results they get will be significantly different between optimising learning rate before batch size using TBE and optimising batch size before learning rate?
@@icelover7166 Yes, exactly. TPE cannot learn the interaction effects (or links) between different hyperparameters. Concerning the latter example you mention: this can indeed be the case but it depends on how the tuning is performed. In the regular case, you perform a random search for a few iterations (a warm-up) where you select random hyperparameter configurations (one configuration = a certain value for the learning rate and batch size in our case). These can be used to fit the densities used by TPE, and for selecting promising configurations to try out next.
If instead you decide to first tune the batch size and fixing all other hyperparameters, and afterward tune the learning rate, fixing all other hyperparameters, then the outcome can indeed be very different compared to the case described above. Hope this helps!
@@aixplained4763 This is very helpful. Thank you
Good shit