Here's another vote for a nonlinear regression analysis video. That approach made sense for my dissertation research (inverse problems with mechanistic time-series models), and I'm curious what your perspective is. It seems to me like weighted least squares can work well in many heteroscedastic contexts if you assume residuals are independent and have a constant CoV.
I look at non-linear regression and Bayesian regression as logically independent classes of models. They can both involve using more fundamental principles, rather than just grabbing a recipe off the shelf as the other extreme, which I think is a valuable skill for a statistician to have.
Plotting the residuals can be very beneficial for learning about the performance of a predictive model. There is a common pitfall worth mentioning though. The distribution of the residuals is not in general the likelihood distribution. Take for example the equation Y = X + epsilon where Y ~ Poisson(lambda) X ~ Poisson(mu) and epsilon ~ Poisson(tau). If you compute the residuals you will obtain a Skellam random variable rather than a Poisson random variable.
Hey Dustin! Speaking of non-linear data, what about a video on Generalized Additive (Mixed) Models? GA(M)Ms?! I'm sure it'd be sooooo useful for many of us!!!
Data itself is almost never linear, although in special cases it can be. In order for a Cartesian product of two sets to be linear it must be a function (i.e. left-total and right-unique) satisfying homogeneity of scaling of order one and additivity. The only data sets I have encountered that were linear were synthetic examples.
I take it back. I don't think any finite sample can be linear since for any maximum point there will exist a scalar multiple of it in the real numbers that is not in the data set.
I would recommend Fractional Polynomial Models that identify the best transformations of the covariates, with the obvious risk of overfitting and ambiguity in the interpretation of the coefficients.
Oh neat, I didn't know that approach by name. I agree that over fitting is the largest risk with fractional polynomial models since they're a natural superset of polynomial models.
I have noticed the word "line" in "linear", but unfortunately the terminology is more complicated than Dustin presented. I'll give a couple of reasons: The first is that they are not synonyms in mathematics. All lines are linear, but not all linear functions are lines. For example, the derivative operator linear on the space of analytic functions, but it is not a line per se. The second is that statisticians were focused on the parameters when they coined the term "linear model". Conventionally "linear model" refers to a regression model which is linear in its conditional expection with respect to the unknown parameters. This makes both the example polynomial regression and log-transformed regression model in the video out to be special cases of linear models.
On the log transform, you say the estimate b is now on a log scale, yes. But that is not a problem for interpretation, when you transform it back to where it came from. Exponentiate that value and you are back and can interpret it as normal. So there is no real "cost" there. But overall nice video, as always :D
Also, since the logarithm is monotonic we can readily anticipate the direction of change in the conditional expectation when we consider a change in one of the predictors.
Supposing for example the conditional expectation E[Y|X=x] = exp(m * x+b) then it is straightforward to take the derivative with respect to x via the chain rule of calculus: dE[Y|X=x]/dx = m * exp(m * x + b) Thus we can calculate how much Y is changing on average with respect to a change in x by knowing m, x, and b.
@@QuantPsych That's true. Since not everything is linear, or even most things, it is wise not to recoil from introducing non-linearity into models when it has warrant. Reading from Kit Yates', "How to expect the unexpected", I came across the term "linearity bias". It is informally defined as a cognitive bias of tending to assume that changes are linear. One concern I have about only (or predominantly) teaching models that are linear in the predictors is that it may enculcate or reinforce linearity bias in students. But I'm not read on the psychology or education literature to say if that concern has been addressed; just concerned for now.
It has been a long while since I have really thought about semi-partial correlation coefficients. But if memory serves it does not in general equal to the conditional correlation coefficient except under certain families of distributions. A sufficient criterion for distributional assumptions to hold such that the partial correlation equals the conditional correlation is when the joint distribution is in an exponential parametric family of distributions.
Thanks for your work as always, I am approaching Bayesian statistics so it would be great to see you going into bayesian regression. please please please!
It's not intuitive. It's the expected change in Y when the square of X increases by one unit. The only thing that's really intuitive is the sign (positive versus negative, indicating whether it's concave upward or downward, respectively). I usually don't bother interpreting it. I just look at the plot.
@@QuantPsych In the case of a quadratic polynomial the sign(um) of the leading coefficient tells us about concavity/convexity. This works because a twice-differentiable function of a single variable is convex if and only if its second derivative is nonnegative on its entire domain. A similar result holds for concavity. Polynomials are always twice-differentiable. Many polynomials are neither convex/concave over their entire domain. The second derivative of a quadratic is always the leading coefficient, which is why the inference is straightforward in this case. The leading term of higher-degree polynomials cannot reliably be used this way. When we're dealing with single-variable functions I'd give the same recommendation; just look at a plot. When you get into multivariable systems (which is typical of realistic systems) it is much more difficult to eyeball the convexity/concavity. I think that trying to visually infer concavity/convexity from PCA plots or parallel axis plots is unlikely to be reliable, for example. If you're lucky enough to have a function that is second-differentiable in all its inputs, then you can generalize the result given in single-variable calculus. It requires finding the stationary points using the gradients, then using the Hessian to (1) determine which points are optima and (2) then for the optimal points use the signum of the eigenvalues to evaluate convexity/concavity/neither.
Thank you for directing to the link. Am I completely blind or does the depression_wide set not contain any of the variables in the video (i.e., cancer related or rizz)? Sorry if I'm missing it somewhere.@@QuantPsych
The biggest limitation of polynomial regression is over fitting. Via the Stone-Weierstrass theorem we can say that a sufficient number of polynomial terms will fit as well as we like. In fact, many functions (including the exponential function; wink wink) have a Taylor series which is basically a polynomial with an infinite number of terms.
Consider this me asking nicely (BEGGING) for the non-linear regression/Bayesian video! :D Also, arm twist! Arm twist!
Noted!
Yeah, please do it
How have I not found this channel sooner! Amazing stuff, binge watching this channel
Hello, i am statistician. i live in Africa and really appreciate this lessons. So Fun thanks to the Teacher.
Here's another vote for a nonlinear regression analysis video. That approach made sense for my dissertation research (inverse problems with mechanistic time-series models), and I'm curious what your perspective is. It seems to me like weighted least squares can work well in many heteroscedastic contexts if you assume residuals are independent and have a constant CoV.
I agree, there are some heteroskedastic processes with parameters that can be estimated with weighted least squares.
I look at non-linear regression and Bayesian regression as logically independent classes of models. They can both involve using more fundamental principles, rather than just grabbing a recipe off the shelf as the other extreme, which I think is a valuable skill for a statistician to have.
Well done! Looking forward to the GLM video. I still did not fully understand the link functions there.
FWIW the Wikipedia page on generalized linear models discusses the role of the link function explicitly.
I've already made a video on GLMs: th-cam.com/video/SqN-qlQOM5A/w-d-xo.html
Plotting the residuals can be very beneficial for learning about the performance of a predictive model. There is a common pitfall worth mentioning though. The distribution of the residuals is not in general the likelihood distribution.
Take for example the equation
Y = X + epsilon
where
Y ~ Poisson(lambda)
X ~ Poisson(mu)
and
epsilon ~ Poisson(tau).
If you compute the residuals you will obtain a Skellam random variable rather than a Poisson random variable.
Hey Dustin! Speaking of non-linear data, what about a video on Generalized Additive (Mixed) Models? GA(M)Ms?! I'm sure it'd be sooooo useful for many of us!!!
Data itself is almost never linear, although in special cases it can be. In order for a Cartesian product of two sets to be linear it must be a function (i.e. left-total and right-unique) satisfying homogeneity of scaling of order one and additivity. The only data sets I have encountered that were linear were synthetic examples.
I've used GAMs with time series data, and I can readily see GAMMs be similarly useful in hierarchical time series.
I take it back. I don't think any finite sample can be linear since for any maximum point there will exist a scalar multiple of it in the real numbers that is not in the data set.
Bottom of the heap has a nice video on it.
I would recommend Fractional Polynomial Models that identify the best transformations of the covariates, with the obvious risk of overfitting and ambiguity in the interpretation of the coefficients.
Oh neat, I didn't know that approach by name. I agree that over fitting is the largest risk with fractional polynomial models since they're a natural superset of polynomial models.
I have noticed the word "line" in "linear", but unfortunately the terminology is more complicated than Dustin presented. I'll give a couple of reasons:
The first is that they are not synonyms in mathematics. All lines are linear, but not all linear functions are lines. For example, the derivative operator linear on the space of analytic functions, but it is not a line per se.
The second is that statisticians were focused on the parameters when they coined the term "linear model". Conventionally "linear model" refers to a regression model which is linear in its conditional expection with respect to the unknown parameters. This makes both the example polynomial regression and log-transformed regression model in the video out to be special cases of linear models.
On the log transform, you say the estimate b is now on a log scale, yes. But that is not a problem for interpretation, when you transform it back to where it came from. Exponentiate that value and you are back and can interpret it as normal. So there is no real "cost" there. But overall nice video, as always :D
I agree.
Also, since the logarithm is monotonic we can readily anticipate the direction of change in the conditional expectation when we consider a change in one of the predictors.
Supposing for example the conditional expectation
E[Y|X=x] = exp(m * x+b)
then it is straightforward to take the derivative with respect to x via the chain rule of calculus:
dE[Y|X=x]/dx = m * exp(m * x + b)
Thus we can calculate how much Y is changing on average with respect to a change in x by knowing m, x, and b.
Except that it's not a constant change anymore, meaning we can't say "for every change in our predictor, there is a x point change in y."
@@QuantPsych That's true.
Since not everything is linear, or even most things, it is wise not to recoil from introducing non-linearity into models when it has warrant.
Reading from Kit Yates', "How to expect the unexpected", I came across the term "linearity bias". It is informally defined as a cognitive bias of tending to assume that changes are linear. One concern I have about only (or predominantly) teaching models that are linear in the predictors is that it may enculcate or reinforce linearity bias in students. But I'm not read on the psychology or education literature to say if that concern has been addressed; just concerned for now.
It has been a long while since I have really thought about semi-partial correlation coefficients. But if memory serves it does not in general equal to the conditional correlation coefficient except under certain families of distributions. A sufficient criterion for distributional assumptions to hold such that the partial correlation equals the conditional correlation is when the joint distribution is in an exponential parametric family of distributions.
thanks!
Thanks for your work as always, I am approaching Bayesian statistics so it would be great to see you going into bayesian regression. please please please!
It's on my to-do list :)
What package has the visualize() function? Great explanations, as usual!
flexplot
How do you interpret the coefficients of the polynomial model?
Hint: The conditional expectation of the predicted variable on a predictor will be monotonic in the coefficients under mild assumptions.
It's not intuitive. It's the expected change in Y when the square of X increases by one unit. The only thing that's really intuitive is the sign (positive versus negative, indicating whether it's concave upward or downward, respectively). I usually don't bother interpreting it. I just look at the plot.
@@QuantPsych In the case of a quadratic polynomial the sign(um) of the leading coefficient tells us about concavity/convexity. This works because a twice-differentiable function of a single variable is convex if and only if its second derivative is nonnegative on its entire domain. A similar result holds for concavity. Polynomials are always twice-differentiable. Many polynomials are neither convex/concave over their entire domain. The second derivative of a quadratic is always the leading coefficient, which is why the inference is straightforward in this case. The leading term of higher-degree polynomials cannot reliably be used this way.
When we're dealing with single-variable functions I'd give the same recommendation; just look at a plot. When you get into multivariable systems (which is typical of realistic systems) it is much more difficult to eyeball the convexity/concavity. I think that trying to visually infer concavity/convexity from PCA plots or parallel axis plots is unlikely to be reliable, for example. If you're lucky enough to have a function that is second-differentiable in all its inputs, then you can generalize the result given in single-variable calculus. It requires finding the stationary points using the gradients, then using the Hessian to (1) determine which points are optima and (2) then for the optimal points use the signum of the eigenvalues to evaluate convexity/concavity/neither.
In other words, it depends...@@QuantPsych
Great video! Any chance you be open to sharing a link to the dataset used so we can re create the exercise and try it ourselves? Thank you!
If the data is not available you can readily simulate data suitable for these cases if you just need them for exercise.
Most of my datasets are here:
quantpsych.net/data/
That particular one is called depression_wide
Thank you for directing to the link. Am I completely blind or does the depression_wide set not contain any of the variables in the video (i.e., cancer related or rizz)? Sorry if I'm missing it somewhere.@@QuantPsych
The biggest limitation of polynomial regression is over fitting. Via the Stone-Weierstrass theorem we can say that a sufficient number of polynomial terms will fit as well as we like. In fact, many functions (including the exponential function; wink wink) have a Taylor series which is basically a polynomial with an infinite number of terms.
Why don't you use poly(var_name, n) instead, for orthogonal polynomials?
Because I can never remember how to do that.
Still don't know how to actually calculate polynomial regression
3:38
See line 25 of my R code
The phrasing "y=x^2 gives a polygon" must have been a brain fart. It happens. Polygons and polynomials are distinct mathematical concepts.
Yes, the brain did indeed fart.
You ARE hot. And entertaining. Great video. I will have to learn this stuff eventually... Subscribed.
Ha! Flattered again :)