Hi everyone, this video has been getting a lot of views lately so I just wanted to say thank you, and I really appreciate all the positive feedback. It’s great to see such a positive response, and I’m glad that so many people are enjoying linear regression! :) I also appreciate the constructive criticism! A few of you have pointed out that the music is distracting, the motion is too repetitive, and the pace is a bit slow. I didn’t see that when posting the video, but I can totally see where you’re coming from, so I’ll definitely take that into account when making future videos. This was one of my earlier videos and I was still figuring things out. So I really appreciate your feedback, and I hope these videos will get better over time.
You are not only teaching math stuff but also teaching how to think, thank you very much for the great video. Really inspiring, glad i discovered this channel, waiting for the videos about jacobian , translation , rotation, quaternions
the constant animation loop gets a bit annoying. reversing, stopping and changing the animation from time to time would be a solution (and your newer videos are even better anyway!)
I agree. Honestly I look back on this video and cringe at a few of the details, like how the animation loop goes on and on and is a bit nauseating, and music is too loud. But you live and learn! 😅 When I first started making these videos I really had no idea what I was doing.
With regards to pacing, I want to say that I really enjoy your general presentation style. You're not simply reading a script and getting the perfect take, you're actually doing a "live" presentation and I really appreciate the way you ad lib or go off on little tangents. I burst out laughing in your buoyancy video when you read the integral "zndS" phonetically.
I really love how calmly you speak and how the lines you say feel unscripted. Makes it feel very personal. You also speak so clearly and concisely. I was able to get the gist of this with only high school calculus! This is making me like math again.
And he repeats the animation so we can assimilate what's going on instead of quickly switching to the next thing. Very relaxed explanation which is nice.
As an Econ Major, you have no idea how much this helped me understand the behind the scenes of regression lines and everything I've done in Statistics this semester, I've learned soo many new techniques with equation manipulation so, thank you!
Another beautiful way to get a linear regression formula is to take the vector space of all real-valued functions that are defined for the x values, choose the hypothetical ideal function that maps all of the x's to their y's, and orthogonally project that hypothetical function onto the subspace of linear functions. By defining the inner product as the cartesian dot product between the output of the functions due to the x values, you'll see that the distance the projection minimizes is the error between the linear function and ideal function.
Thank you! This is definitely one of TH-cams math gems! Ties so many ideas together. I would love for you to do a video on Fourier Epicycles. For reference, GoldPlatedGoofs ‘Fourier for the rest of us’ is a great starting point. I’m sure you could do a beautiful refined version showing how the Inner Product, Fourier, QM, function spaces and Art all come together in a beautiful way. Thank you so much for your sharing your videos!
Thanks for the kind comment, John! :) I touch on Fourier analysis in my upcoming video on relativistic QM, the Klein-Gordon equation. Hoping to upload it within a week.
This is a piece of art, a captivating blend of deep understanding of the matter, beauty of plain graphics, voice acting, matrices, and "simple" software.
Really very helpful - and I'm no professional in any of these fields, but just an old technician who is being reminded of all those brain neurons that have lain dormant for decades,
Thank you! I've been playing with a spherical geometry problem and there's so much I've forgotten from my school days. This video reminded me of so many things, including ways of expanding my approaches to problem solving. Brilliant 👌
Oh this was so SATISFYING! I don't think i have ever seen regression explained this way. It's like parts of how i understand it, is being so wonderfully articulated by someone who obviously knows the subject matter well. I have had to teach myself mathematics and statistics, and I've always been drawn to this intuitive and philosophical way of understanding it. Thank you for this!
Great video! Coming from a linear algebra heavy background I still think taking the singular value decomposition of X, inverting it, and multiplying by y to find b is a more elegant and simple approach especially for multiple linear regressions, but I imagine if you have more experience with physics this approach would be more familiar and easier to digest. Keep these videos coming!
The parameter space is a super powerful concept. Especially in computer vision, where you can take a bunch of pixels and quickly detect all the lines they approximately form
I was really stuck on a practical, I have to make a graph of my readings the book Stated that I should get a straight line but instead I got curves was really stressful, but thankfully found your video, It really helped❤ Thanks again
Great video! With those animations it would be wonderful to see an essay about bayesian linear regression since it is quite different and powerful approach to similar topic.
I love code and taught calc 3 a couple times, which is my favorite class, but never learned about this topic in school (only hear of its name a lot). That was really interesting
Amazing! Thanks for showing us how to solve a Maths problem in a Physics way. Even though this method has been used in nowadays AI already, it is still very interesting to see it works outside AI. The conceptual journey you taken reminds my trial on machine proving, or ATP; and helps most to eliminate the intimidation of numerical analysis. Thanks!
Great video! I absolutely love the visual and dynamical proofs in math. I just wanted to add that there is a beautiful point-line duality between the two spaces: While a dot in parameter space corresponds to a line in real space, a line in parameter space defines a family of curves in real space that intersect at the same point. Moreover, if you map your datapoints to their corresponding dual lines, the center of mass of these lines will be a dual point to the best fit line of the data! Hope you find this as cool as I do.
That’s really cool! I’ve read about that kind of thing in an intro to differential geometry book, but hadn’t connected the dots in the context of this video. Thanks for a very interesting comment :)
IIRC there *is* a way to leverage that outer product observation: If D is a matrix where each column is [xᵢ 1] and Y is another matrix where each row is [yᵢ] then the entire left Σ becomes DDᵀ and the entire right Σ becomes DY. also (I think) this actually generalizes to linear equations with more terms by adding the data as more rows in D. And the data can also be functions of existing simpler terms (e.g. Nth powers of x to get polynomial fits, sin(nx)/cos(nx) to get discrete Fourier transforms, etc.).
introducing the Jacobean could be a nice extension - the shape of best fit is an ellipse, which can make converging towards the best solution hard, as many of the gradient directions in the top half of your example are not pointed towards the best solution, simply towards that valley of best fit. Reshaping the gradients to make that ellipse a circle allows much quicker conversion
Cool music Richard, it opens my mind and makes me understand things better! It's like combining hypnosis and a class;-) I wish my math teacher at school would have explained it to us in that way 🙂
Such a great video! I had a lecture about this years ago in my engineering analysis class in undergrad, but I took such poor notes that I was never able to reproduce this function. Now as homework I'm going to take your process and solve for other functions like parabolas or cubics which will require me to use 3 and 4 dimensional parameter spaces. Thanks again for the great video!
That’s awesome, I love to hear that! Challenge for you: can you solve it for a general N-degree polynomial? Like with some kind of recursive algorithm. I actually don’t know if this is possible but it seems like a fun puzzle!
@@RichBehiel that would be a fun problem to solve! And even if it can't be solved, I'm sure proving or disproving the possibility of a solution would make a great paper!
Wow. Seen so many videos, read so many papers and books - but this one takes the cake. Would love to see you doing this but for more complex models with fixed effects and all sorts of other bells and whistles. Impressive!!
I like Sujal Gupta watched this video because I am studying machine learning. I have been studying simple linear regression for the past couple weeks now! Just yesterday I started to think about how the moore-penrose psuedo inverse generalizes the idea of an inverse to situtations where the matrix is not square. I call linear maps to a higher dimensional space "embeddings" and linear maps to a lower dimensional space "projections". For a square matrix, which is neither an embedding or a projection but a linear operator in the same dimension, we can undo the linear mapping by finding the inverse X^-1. In the case of projections, there are many high dimensional vectors that can be projected down to a given low dimensional vector, so there is no unique inverse. However we can solve the system Xb=y for b using the Moore-Penrose *psuedo* inverse: (X^T X)^-1 X^T. When we apply the moore-penrose psuedo-inverse on the vector of response variable y, we project y onto the row-space of X, which is formed by the row vectors, which are linear combinations of the parameters. By projecting y onto every data point (row vector) and adding it up(in essence projecting onto the entire row space), we get our coefficients, and that is the beauty of the moore-penrose psuedo-inverse!
I code DNNs too. Um. I understood your words but not your point. Genuinely curious here. So we can calculate the inv matrix. Take the reciprocal of the determinant and multiply it by the matrix with the diagonal swapped and the upper/lower negated. This spits out a new matrix with the property that if you multiply that by the original you get the identity (assuming linear independence). Ok fine, all very useful. But what's that got to do with the price of fish?
Great explanation, loving it so far. I'm majoring in applied math with a focus in numerical analysis, so this stuff is always fascinating haha. I noticed around 18:20, you started using delta instead of del. Thought it might be a typo but just wanted to check!
Lmao thank you for this, this video just came into my recommendations when I needed it most: I've been stressed these last few days just doing laboratory reports, where I have to use a lot the regression line 🛌 It made me hate it less
Imagine you have a surface with a magnet. That's a game changer. Understanding the concept of statistics doing physics is the correct way of UNDERSTANDING mathematics and PHYSICS. However physics has nothing to do with mathematics and mathematics has nothing to do with physics. The magic of this is MODELING. Linear regression, average, the gauss curve are concept of fundamental use in statistical mechanics. Eventually higher mathematical physics will launch the student into the field of MODEL MAKING.
This video is a wonderful explainer! You've listed in the description that linear regression is "very useful in math, science, and engineering" to which I would like to add economics, which is what I am studying. This video and Jazon Jiao's work (th-cam.com/video/3g-e2aiRfbU/w-d-xo.html) are the best explanations of the concept that I have seen in video, lecture, or textbook form. I look forward to seeing what else you share on this channel!
It looks like you are trying to hypnotize your listener. 😂 Great explanation btw. Using physical arguments to explain a mathematical concept, I like that.
This video is wonderful. How did you create the interactive visualization with the "Parameter Space" and "Real Space" subplots? I'd love to be able to create one on my own.
Thanks William! :) For this video I used Python, specifically matplotlib. You can use that by downloading Anaconda, which will install Python and some scientific modules, then call “from matplotlib import pyplot as plt”. After calling that line, you can use things like plt.figure() and plt.plot() to make a figure and plot things. In this case the parameter space and real space are two subplots in a figure. They’re refreshing at 60 frames per second in a loop which sets the dot’s position in the parameter space while making the line in the real space, based on the current a and b values. To turn on the error landscape, I also added some code to evaluate the error metric (objective function) at all points in the parameter space for each a and b. Then for the error force I calculated and plotted the negative gradient of that. For the part where the dot descends down the gradient, I used F = ma - kv with mass parameter m and friction-ish parameter k to make the dot roll down the hill and then stop at the optimal point. I’ll be more careful in future videos to post the source code of the animations too. Well, at least for videos after the one I’m going to post this week; for that one, and the previous videos, I was very sloppy with the code and it wouldn’t be too helpful to see them. But there have been a few comments now about how these animations were made, so I figure the best answer is the code itself. In the future I’ll be better about writing cleaner animation codes and sharing them.
Great video! One question I still have by the end, is why did we square the residuals? How do we know that’s the best way to represent how well our line fits (compared to say taking the raw value or the square root or something else)?
Just adding the residuals is a bad idea because then positive and negative errors will cancel out, which obviously isn’t what we want. Basically we want a method that treats positive and negative errors the same (because, for linear regression they’re equally bad). The absolute value of the residuals does work and gives you the Least Absolute Deviations or LAD method. However the absolute value isn’t differentiable at zero, making it quite tricky to work with. LAD also often has multiple solutions. This leaves squaring the residuals (LS) as the next “obvious” choice, being a simple differentiable function that treats positive and negative errors equally. There are situations where LS isn’t a good idea, for example if your data has significant outliers (we say that LS is not robust, whereas LAD is). There are also many other methods for linear regression, but a lot of those are significantly more complicated than LS or LAD.
Nice video on OLS. I've often wondered though why lessons on regression focus on OLS rather than Deming Regression, as OLS seems objectively inferior, so to have so many projections based on the inferior model, we are shooting our research methods in the foot from the start
Good point. Frankly I think it’s because OLS is easier, and gets the job done in most situations. But I agree that there are times when Deming regression is better. Although someone who uses Deming would presumably have learned OLS first. OLS is also conceptually ideal for explaining how calculus can be used to minimize fit error, so it’s a good go-to image to have in mind when solving fancier optimization problems.
@@RichBehiel I completely understand, in fact, this subject is making me think about applied mathematics, because if we go deeper, it's not like linear regression in any form is the the best way to actually model most data, so I'm thinking about dividing a function into splines to create a good fit, you can go too far and smoothly fit every point into a function, but then your function is skewed towards the data set, losing the ability for good projections. It's an interesting puzzle (and I hated applied mathematics in college)
Well, there are quite a few advantages of OLS compared to total least squares fit. For one, every measurement where x is tightly controlled and y is the thing you want to learn about, OLS is the right tool. Because there are no or only negligible errors in x, the distance of datapoints to the prediction, dx, doesn't matter and must not be included in the fit. And also it works much better with arbitrary functions than total least squares. For an arbitrary function I don't think there even is _any_ way to calculate the total least squares error. Only well behaved functions work, and even then you have to define the derivative to perform a total least squares fit.
I love this video. I would have loved this back when I started learning optimization. 😍 Actually your code can be much faster... You should be using numpy to do all the sum. np.sum. I have added a comment to pastebin. Also the "for loop" used when adding the "random errors" should also be done in numpy. On my pc the "Calculate Best Fit" went from 0.450 [s] to 0.012 [s] for 1 million data points. Plotting is still slow though. (You could probably optimize that too), so expect the full python code to take about 1-2 seconds with 1 million data points. Also the plot does not make much sense for a million data points.
Instead of calculating Dy, it might be better to calculate the distance a point is from the line (especially for smaller data sets, where Dy could be large, bur infact the line could be very close).
There is always something that bothers me when the linear regression is approched that way. It is that from the start you consider that y and y are of a different nature: the value of x is known perfectly and the error is on y. This is a pretty strong constraint. I am a metrology engineer and I saw in the comments that you are a metrology engineer too, so you are well aware that in the real world there are errors on both x and y. In which case the error could be the distance from the data point to the line for example
That’s true! And there are ways of doing regression with ds rather than dy. Although often x is more precise than y, for example if you have a sensor array or are sampling data at a fast and precise rate relative to the change in your signal. For example, if we’re looking at a trend in some signal that drifts linearly over an hour, and sampling one datapoint per second, with error on the order of microseconds, then x is very precise in that context. But you’re right that there are some cases where x and y might be similarly varying.
@@flexeos I think you are missing the big picture. In most of these data sets (in practice), x(i) is the data set corresponding to the independent variable, the one which you can actually control for much more easily, and y(i) is the data set corresponding the independent variable, and you want to understand y as a function of x, not the other way around, because the other way around is (in every scenario I have seen physicists, engineers, and any other applied S.T.E.M. worker deal with) very impractical and not useful. Now, are there circumstances which are more complicated? Of course there are, but they are the exception, and in those circumstances, the complexities involved are of such a nature that dealing with residuals, as the video does, is not the practical approach anyway.
@@angelmendez-rivera351 It is not my experience in practice. let 's say that you want to measure a resistor. you inject a current I that you "control" usually using a digital to analog converter and you measure the voltage V at the edge of the resistor and V/I is your resistor. because the world is not perfect, if you want to have a better result, you do the measurement with a bunch of Is and the resistor is now the slope of the best line through the cloud of points V,I. To have a better idea of the exact value of I, while you set it digitally, you have to measure the actual value of it as the translation between the digital value and the actual current is everything but perfect. so in practice you have a cloud of points V,I with the same kind of error (noise, offset, non linearity...) on both V and I. If you assume that I is an independent variable you will end up with a bias. There was a math paper on that bias effect almost 100 years ago that I read but I cannot find the reference just right now. If an electronic example seems too specific, let's look at something that is a typical example given to students like annual income vs age in years. age looks like an independent variable, but in reality by definition there is 1 year uncertainty on it which is not too good as the relative error bar is not even constant. of course in such an example the required precision is not a big problem so you can forget about those subtleties. But in metrology you are tracking few parts per million. Not taking that into account would be like trying to design the GPS without taking the general relativistic effects into account (~accuracy on location becomes > 10kms). my 2 cents
Very nice video and a quite interesting and actually useful topic I'd just like to say that the line wiggling around for most of the video was (to me) irritating, great work nonetheless
Thanks for your feedback! :) That’s not something I noticed, but now that I see it I totally get where you’re coming from. I’ll try to avoid having large repetitive motions in future videos.
Usually in that case, one would filter for outliers, or sometimes use absolute values for residuals instead of their squares, although in that case the regression algorithm is usually slower, which may or may not matter depending on the data rate.
First, you must become accustomed to the intensity of its density. Then, carefully place the cube into your mouth. You will immediately develop a gigachad jawline as a result of resisting the immense gravitation. (Don’t actually put a tungsten cube in your mouth. I just put that into the review for comedic effect 😅)
Is this method, or something similar applicable to non linear least squares? I did a project over Christmas using non linear least squares regression and this would’ve been super helpful 😅
The same concept of minimizing a least squares objective function by setting the gradient to zero applies to nonlinear least squares, but there are also extra steps involved.
Not for linear regression, but for fits with more parameters yes. Gradient descent can sometimes get stuck in a local minimum, a valley other than the best one. If there’s an analytic solution, it might involve the roots of a polynomial or something, so you can have multiple values which are locally optimal. In that situation, the height of the objective function at each optimum can be quickly compared, since the list should be pretty short.
Wonderful presentation. But I have a doubt in the definition of "Error" function, wouldn't it be nice to define it as mean square. Although the solution won't change, but still it's satisfactory😌
That’s a good idea. I usually do that when reporting the number. Divide by number of data points and take square root. Sometimes it’s nice to turn that into a percentage too, but then you have to weight by the value at each point.
Since you are not actually computing the error function, but merely minimizing it, adding unnecessary components like dividing by N and taking the square root is just a waste of energy for anyone trying to do the derivation.
sub, like and comment for your effort, even if you dont make much on yt you are a great mathematician! And i am sure you will make it in life and be a help to humanity as a whole. thank you
I believe so, but I’m not 100% sure actually. As a good exercise in math, you can explore if it might be noninvertible under some conditions, just set the determinant to zero and see what a dataset would have to be like in order for that to happen. I’ve done millions, maybe billions, of linear regressions (on data streams) and have never run into this problem though.
@@RichBehiel Doing just the quickest amount of working out with a dataset of 3 values, I think the sum of outer product would only be singular if all the x values are the same, which obviously isn't going to happen. It's fairly easy to show that if we have a dataset like this, the matrix is singular (the 1st row of the matrix is just the second multiplied by x_i), though I'm not sure how you'd prove it the other way around (i.e. the matrix is non-singular in all other cases).
That makes sense! Btw, these equations are equivalent to a force and torque balance, if the residuals are imagined as elastic springs, so physically it makes sense that it would only be singular if the x values are all the same, or something like that.
Hi everyone, this video has been getting a lot of views lately so I just wanted to say thank you, and I really appreciate all the positive feedback. It’s great to see such a positive response, and I’m glad that so many people are enjoying linear regression! :)
I also appreciate the constructive criticism! A few of you have pointed out that the music is distracting, the motion is too repetitive, and the pace is a bit slow. I didn’t see that when posting the video, but I can totally see where you’re coming from, so I’ll definitely take that into account when making future videos. This was one of my earlier videos and I was still figuring things out. So I really appreciate your feedback, and I hope these videos will get better over time.
You are not only teaching math stuff but also teaching how to think, thank you very much for the great video.
Really inspiring, glad i discovered this channel, waiting for the videos about jacobian , translation , rotation, quaternions
the constant animation loop gets a bit annoying. reversing, stopping and changing the animation from time to time would be a solution (and your newer videos are even better anyway!)
I agree. Honestly I look back on this video and cringe at a few of the details, like how the animation loop goes on and on and is a bit nauseating, and music is too loud. But you live and learn! 😅 When I first started making these videos I really had no idea what I was doing.
@RichBehiel 18:19 on the left hand side you used Laplace Symbol instead of Nabla Symbol.
But except that => great video! 👍
With regards to pacing, I want to say that I really enjoy your general presentation style. You're not simply reading a script and getting the perfect take, you're actually doing a "live" presentation and I really appreciate the way you ad lib or go off on little tangents. I burst out laughing in your buoyancy video when you read the integral "zndS" phonetically.
I really love how calmly you speak and how the lines you say feel unscripted. Makes it feel very personal.
You also speak so clearly and concisely. I was able to get the gist of this with only high school calculus!
This is making me like math again.
I’m very glad to hear that! :)
unbelievably crisp explanation of gradient decent. It is remarkable to see it play out in those dimensions. Thank you
And he repeats the animation so we can assimilate what's going on instead of quickly switching to the next thing. Very relaxed explanation which is nice.
As an Applied Math (Stats/Probability Theory focused) major, this really got me excited!
As an Econ Major, you have no idea how much this helped me understand the behind the scenes of regression lines and everything I've done in Statistics this semester, I've learned soo many new techniques with equation manipulation so, thank you!
Glad to hear that! :)
Another beautiful way to get a linear regression formula is to take the vector space of all real-valued functions that are defined for the x values, choose the hypothetical ideal function that maps all of the x's to their y's, and orthogonally project that hypothetical function onto the subspace of linear functions. By defining the inner product as the cartesian dot product between the output of the functions due to the x values, you'll see that the distance the projection minimizes is the error between the linear function and ideal function.
A rare video that's technically adept and, most importantly, not condescending or pedantic! Well done, from a chemist and educator =)
great video! I love how at the start you explain the equation of a straight line and by the end it's multivariable vector calculus
I cannot finish the video because your voice is SO charming and comforting and makes me feel so safe, I just cannot pay attention in the maths
Thank you! This is definitely one of TH-cams math gems! Ties so many ideas together. I would love for you to do a video on Fourier Epicycles. For reference, GoldPlatedGoofs ‘Fourier for the rest of us’ is a great starting point. I’m sure you could do a beautiful refined version showing how the Inner Product, Fourier, QM, function spaces and Art all come together in a beautiful way.
Thank you so much for your sharing your videos!
Thanks for the kind comment, John! :) I touch on Fourier analysis in my upcoming video on relativistic QM, the Klein-Gordon equation. Hoping to upload it within a week.
Absolutely beautiful visualization! Simple, smart and intuitive.
This is a piece of art, a captivating blend of deep understanding of the matter, beauty of plain graphics, voice acting, matrices, and "simple" software.
I saw the calculus approach coming a mile away but it's great to see the linear algebra done so clearly. I need to take that again.
Thanks a lot for clearly explaining the concept of fitting a linear regression so beautifully.
Sooo cool! As a cs major struggling with a numerical analysis class, this helped me understand linear regression so much better.
Thanks man!
this was SO satisfying! hope to see many more explanations, such a great execution!
Thanks! :)
Really very helpful - and I'm no professional in any of these fields, but just an old technician who is being reminded of all those brain neurons that have lain dormant for decades,
Thank you! I've been playing with a spherical geometry problem and there's so much I've forgotten from my school days. This video reminded me of so many things, including ways of expanding my approaches to problem solving. Brilliant 👌
he just dropped the most beautiful linear regression video and thought we wouldn't notice
Oh this was so SATISFYING! I don't think i have ever seen regression explained this way. It's like parts of how i understand it, is being so wonderfully articulated by someone who obviously knows the subject matter well. I have had to teach myself mathematics and statistics, and I've always been drawn to this intuitive and philosophical way of understanding it. Thank you for this!
Thanks for the kind comment, and I’m glad you enjoyed the video! :)
It's so beautiful! Thank you a lot! I hope your channel is gonna grow fast soon
Спасибо! :)
I absolutely loved this video. Please do more videos on regression and machine learning as a whole.
Wow! This is perfection explaining/visualizing complexity and its beauty! ❤❤❤❤ 👏👏👏👏👏
Great video! Coming from a linear algebra heavy background I still think taking the singular value decomposition of X, inverting it, and multiplying by y to find b is a more elegant and simple approach especially for multiple linear regressions, but I imagine if you have more experience with physics this approach would be more familiar and easier to digest. Keep these videos coming!
The parameter space is a super powerful concept. Especially in computer vision, where you can take a bunch of pixels and quickly detect all the lines they approximately form
I was really stuck on a practical, I have to make a graph of my readings the book Stated that I should get a straight line but instead I got curves was really stressful, but thankfully found your video,
It really helped❤
Thanks again
Great video! With those animations it would be wonderful to see an essay about bayesian linear regression since it is quite different and powerful approach to similar topic.
Great work! Thank you, and I'm looking forward to your linear regression and gradient decent videos you mentioned at the end of the video
I love code and taught calc 3 a couple times, which is my favorite class, but never learned about this topic in school (only hear of its name a lot). That was really interesting
Amazing! Thanks for showing us how to solve a Maths problem in a Physics way. Even though this method has been used in nowadays AI already, it is still very interesting to see it works outside AI. The conceptual journey you taken reminds my trial on machine proving, or ATP; and helps most to eliminate the intimidation of numerical analysis. Thanks!
Great video! I absolutely love the visual and dynamical proofs in math.
I just wanted to add that there is a beautiful point-line duality between the two spaces:
While a dot in parameter space corresponds to a line in real space, a line in parameter space defines a family of curves in real space that intersect at the same point.
Moreover, if you map your datapoints to their corresponding dual lines, the center of mass of these lines will be a dual point to the best fit line of the data!
Hope you find this as cool as I do.
That’s really cool! I’ve read about that kind of thing in an intro to differential geometry book, but hadn’t connected the dots in the context of this video. Thanks for a very interesting comment :)
IIRC there *is* a way to leverage that outer product observation: If D is a matrix where each column is [xᵢ 1] and Y is another matrix where each row is [yᵢ] then the entire left Σ becomes DDᵀ and the entire right Σ becomes DY.
also (I think) this actually generalizes to linear equations with more terms by adding the data as more rows in D. And the data can also be functions of existing simpler terms (e.g. Nth powers of x to get polynomial fits, sin(nx)/cos(nx) to get discrete Fourier transforms, etc.).
introducing the Jacobean could be a nice extension - the shape of best fit is an ellipse, which can make converging towards the best solution hard, as many of the gradient directions in the top half of your example are not pointed towards the best solution, simply towards that valley of best fit. Reshaping the gradients to make that ellipse a circle allows much quicker conversion
Great idea! I’d love to do a video on that someday.
Cool music Richard, it opens my mind and makes me understand things better! It's like combining hypnosis and a class;-) I wish my math teacher at school would have explained it to us in that way 🙂
This was amazing! So fun to watch and appreciate this concept.
Thanks, glad you enjoyed the video! :)
Yeah, this is a nice explanation. Neural network is just a more sophisticated version of line fitting with more parameters.
Such a great video! I had a lecture about this years ago in my engineering analysis class in undergrad, but I took such poor notes that I was never able to reproduce this function. Now as homework I'm going to take your process and solve for other functions like parabolas or cubics which will require me to use 3 and 4 dimensional parameter spaces. Thanks again for the great video!
That’s awesome, I love to hear that! Challenge for you: can you solve it for a general N-degree polynomial? Like with some kind of recursive algorithm. I actually don’t know if this is possible but it seems like a fun puzzle!
@@RichBehiel that would be a fun problem to solve! And even if it can't be solved, I'm sure proving or disproving the possibility of a solution would make a great paper!
This is beautifully explained and visualized! I'm glad to be on the first wagon for the ride of this video.
Thanks, I’m glad you liked the video! It’s one of my favorite mathematical concepts, so it’s great to see others enjoying it too :)
Why do your videos only get recommended to me at 1am, they send me straight down a rabbit hole 😂
Sorry 😂
Fantastic presentation!
12:16 "it should keep you up at night"
Very apropos considering it's almost 4:30 am right now and I've been watching your videos for hours 😅
very cool video, this connected some dots that I've been struggling to reconcile
electric potential actually helped me understand this omg
Beautifully explained, thank you. Liked and subscribed and looking forward to more.
I really like all you put into this video. It helps connect ideas in interesting ways. Thank you for including the Python code.
Ive been waiting for this for too long 10:17
Satisfying video. Took me back to University.
Wow. Seen so many videos, read so many papers and books - but this one takes the cake. Would love to see you doing this but for more complex models with fixed effects and all sorts of other bells and whistles. Impressive!!
I like Sujal Gupta watched this video because I am studying machine learning. I have been studying simple linear regression for the past couple weeks now! Just yesterday I started to think about how the moore-penrose psuedo inverse generalizes the idea of an inverse to situtations where the matrix is not square. I call linear maps to a higher dimensional space "embeddings" and linear maps to a lower dimensional space "projections". For a square matrix, which is neither an embedding or a projection but a linear operator in the same dimension, we can undo the linear mapping by finding the inverse X^-1. In the case of projections, there are many high dimensional vectors that can be projected down to a given low dimensional vector, so there is no unique inverse. However we can solve the system Xb=y for b using the Moore-Penrose *psuedo* inverse: (X^T X)^-1 X^T. When we apply the moore-penrose psuedo-inverse on the vector of response variable y, we project y onto the row-space of X, which is formed by the row vectors, which are linear combinations of the parameters. By projecting y onto every data point (row vector) and adding it up(in essence projecting onto the entire row space), we get our coefficients, and that is the beauty of the moore-penrose psuedo-inverse!
I code DNNs too. Um. I understood your words but not your point. Genuinely curious here.
So we can calculate the inv matrix. Take the reciprocal of the determinant and multiply it by the matrix with the diagonal swapped and the upper/lower negated. This spits out a new matrix with the property that if you multiply that by the original you get the identity (assuming linear independence).
Ok fine, all very useful. But what's that got to do with the price of fish?
This was interesting! Thanks for sharing.
Great video! I never thought parameter space with 'Error Force'.
This is beautifully put together! What a great explanation!
Thanks! :)
Ok but now do it for non-linear regression 😅
Really enjoyed this, you're great at explaining stuff
wow. So well explained. Thank you
Great explanation, loving it so far. I'm majoring in applied math with a focus in numerical analysis, so this stuff is always fascinating haha. I noticed around 18:20, you started using delta instead of del. Thought it might be a typo but just wanted to check!
Yeah that’s a typo, sorry! 😅 Thanks for pointing that out.
Really beautiful class.
Just gotta remind myself this is why I must master linear algebra.
Mastering linear algebra is a great and enduring source of spiritual fulfillment 🙏
Lmao thank you for this, this video just came into my recommendations when I needed it most: I've been stressed these last few days just doing laboratory reports, where I have to use a lot the regression line 🛌 It made me hate it less
i just liked, subbed and commented :D i don't think i can be any more "violently complementary" than that. this was excellent thanks!
Imagine you have a surface with a magnet. That's a game changer.
Understanding the concept of statistics doing physics is the correct way of UNDERSTANDING mathematics and PHYSICS. However physics has nothing to do with mathematics and mathematics has nothing to do with physics.
The magic of this is MODELING. Linear regression, average, the gauss curve are concept of fundamental use in statistical mechanics. Eventually higher mathematical physics will launch the student into the field of MODEL MAKING.
This video is a wonderful explainer! You've listed in the description that linear regression is "very useful in math, science, and engineering" to which I would like to add economics, which is what I am studying. This video and Jazon Jiao's work (th-cam.com/video/3g-e2aiRfbU/w-d-xo.html) are the best explanations of the concept that I have seen in video, lecture, or textbook form. I look forward to seeing what else you share on this channel!
It looks like you are trying to hypnotize your listener. 😂 Great explanation btw. Using physical arguments to explain a mathematical concept, I like that.
This video is genius. Subscribed.
This video is wonderful. How did you create the interactive visualization with the "Parameter Space" and "Real Space" subplots? I'd love to be able to create one on my own.
Thanks William! :) For this video I used Python, specifically matplotlib. You can use that by downloading Anaconda, which will install Python and some scientific modules, then call “from matplotlib import pyplot as plt”. After calling that line, you can use things like plt.figure() and plt.plot() to make a figure and plot things. In this case the parameter space and real space are two subplots in a figure. They’re refreshing at 60 frames per second in a loop which sets the dot’s position in the parameter space while making the line in the real space, based on the current a and b values. To turn on the error landscape, I also added some code to evaluate the error metric (objective function) at all points in the parameter space for each a and b. Then for the error force I calculated and plotted the negative gradient of that. For the part where the dot descends down the gradient, I used F = ma - kv with mass parameter m and friction-ish parameter k to make the dot roll down the hill and then stop at the optimal point.
I’ll be more careful in future videos to post the source code of the animations too. Well, at least for videos after the one I’m going to post this week; for that one, and the previous videos, I was very sloppy with the code and it wouldn’t be too helpful to see them. But there have been a few comments now about how these animations were made, so I figure the best answer is the code itself. In the future I’ll be better about writing cleaner animation codes and sharing them.
@@RichBehiel wow, you're awesome for such an in-depth reply to this. Thank you, I might try this on my own
Great video! 30 minutes felt like 5 :) Thanks!!!
Thanks, glad you enjoyed the video! :)
❤️❤️❤️This is Gold ❤️❤️❤️ Thank you
Great video! One question I still have by the end, is why did we square the residuals? How do we know that’s the best way to represent how well our line fits (compared to say taking the raw value or the square root or something else)?
Just adding the residuals is a bad idea because then positive and negative errors will cancel out, which obviously isn’t what we want. Basically we want a method that treats positive and negative errors the same (because, for linear regression they’re equally bad).
The absolute value of the residuals does work and gives you the Least Absolute Deviations or LAD method. However the absolute value isn’t differentiable at zero, making it quite tricky to work with. LAD also often has multiple solutions.
This leaves squaring the residuals (LS) as the next “obvious” choice, being a simple differentiable function that treats positive and negative errors equally.
There are situations where LS isn’t a good idea, for example if your data has significant outliers (we say that LS is not robust, whereas LAD is). There are also many other methods for linear regression, but a lot of those are significantly more complicated than LS or LAD.
Very good explanation 👍🏻👍🏻
Hello, Richard! Could you explain what you meant by error metric? Thanks
Nice video on OLS. I've often wondered though why lessons on regression focus on OLS rather than Deming Regression, as OLS seems objectively inferior, so to have so many projections based on the inferior model, we are shooting our research methods in the foot from the start
Good point. Frankly I think it’s because OLS is easier, and gets the job done in most situations. But I agree that there are times when Deming regression is better. Although someone who uses Deming would presumably have learned OLS first. OLS is also conceptually ideal for explaining how calculus can be used to minimize fit error, so it’s a good go-to image to have in mind when solving fancier optimization problems.
@@RichBehiel I completely understand, in fact, this subject is making me think about applied mathematics, because if we go deeper, it's not like linear regression in any form is the the best way to actually model most data, so I'm thinking about dividing a function into splines to create a good fit, you can go too far and smoothly fit every point into a function, but then your function is skewed towards the data set, losing the ability for good projections. It's an interesting puzzle (and I hated applied mathematics in college)
Well, there are quite a few advantages of OLS compared to total least squares fit.
For one, every measurement where x is tightly controlled and y is the thing you want to learn about, OLS is the right tool. Because there are no or only negligible errors in x, the distance of datapoints to the prediction, dx, doesn't matter and must not be included in the fit.
And also it works much better with arbitrary functions than total least squares. For an arbitrary function I don't think there even is _any_ way to calculate the total least squares error. Only well behaved functions work, and even then you have to define the derivative to perform a total least squares fit.
😂 I realy love this and wish my highschool students would understand it so I could share it with them.
I love this video. I would have loved this back when I started learning optimization. 😍
Actually your code can be much faster... You should be using numpy to do all the sum. np.sum. I have added a comment to pastebin.
Also the "for loop" used when adding the "random errors" should also be done in numpy.
On my pc the "Calculate Best Fit" went from 0.450 [s] to 0.012 [s] for 1 million data points.
Plotting is still slow though. (You could probably optimize that too), so expect the full python code to take about 1-2 seconds with 1 million data points. Also the plot does not make much sense for a million data points.
That was on the level of 3 blue 1 brown videos
Thanks! :) Grant is a role model for sure. The aesthetics of his videos are much better than mine though 😅 But I’ll get better over time.
I really loved how you put this video together! What did you use to animate and edit everything? It was really clean!
Thanks! :) I used matplotlib in Python.
Instead of calculating Dy, it might be better to calculate the distance a point is from the line (especially for smaller data sets, where Dy could be large, bur infact the line could be very close).
holy shit
I have genuinely never even come close to thinking about it like this
top marks, no notes
And you can fit to other curves with simple transforms of one or both axes, like log or exp.
That’s awesome! But what happens if we let N approach infinity where the data points are in a finite domain?
Do you have the blue dot following a Lissajous curve?
I forget what I did for that, I think I just had some sines and cosines of different frequency in x and y.
There is always something that bothers me when the linear regression is approched that way. It is that from the start you consider that y and y are of a different nature: the value of x is known perfectly and the error is on y. This is a pretty strong constraint. I am a metrology engineer and I saw in the comments that you are a metrology engineer too, so you are well aware that in the real world there are errors on both x and y. In which case the error could be the distance from the data point to the line for example
That’s true! And there are ways of doing regression with ds rather than dy. Although often x is more precise than y, for example if you have a sensor array or are sampling data at a fast and precise rate relative to the change in your signal.
For example, if we’re looking at a trend in some signal that drifts linearly over an hour, and sampling one datapoint per second, with error on the order of microseconds, then x is very precise in that context.
But you’re right that there are some cases where x and y might be similarly varying.
@@RichBehiel my world is more the relation between 2 voltages at different location in an analog network so the noise on both are of the same nature.
@@flexeos I think you are missing the big picture. In most of these data sets (in practice), x(i) is the data set corresponding to the independent variable, the one which you can actually control for much more easily, and y(i) is the data set corresponding the independent variable, and you want to understand y as a function of x, not the other way around, because the other way around is (in every scenario I have seen physicists, engineers, and any other applied S.T.E.M. worker deal with) very impractical and not useful. Now, are there circumstances which are more complicated? Of course there are, but they are the exception, and in those circumstances, the complexities involved are of such a nature that dealing with residuals, as the video does, is not the practical approach anyway.
@@angelmendez-rivera351 It is not my experience in practice. let 's say that you want to measure a resistor. you inject a current I that you "control" usually using a digital to analog converter and you measure the voltage V at the edge of the resistor and V/I is your resistor. because the world is not perfect, if you want to have a better result, you do the measurement with a bunch of Is and the resistor is now the slope of the best line through the cloud of points V,I. To have a better idea of the exact value of I, while you set it digitally, you have to measure the actual value of it as the translation between the digital value and the actual current is everything but perfect. so in practice you have a cloud of points V,I with the same kind of error (noise, offset, non linearity...) on both V and I. If you assume that I is an independent variable you will end up with a bias. There was a math paper on that bias effect almost 100 years ago that I read but I cannot find the reference just right now. If an electronic example seems too specific, let's look at something that is a typical example given to students like annual income vs age in years. age looks like an independent variable, but in reality by definition there is 1 year uncertainty on it which is not too good as the relative error bar is not even constant. of course in such an example the required precision is not a big problem so you can forget about those subtleties. But in metrology you are tracking few parts per million. Not taking that into account would be like trying to design the GPS without taking the general relativistic effects into account (~accuracy on location becomes > 10kms). my 2 cents
The beauty of: Linear Regression
Do I understand correctly that the "valley" in the error landscape is the set of all lines that pass through the point (x-bar, y-bar)?
Great question, and I’m actually not sure. Anyone know the answer?
Beautiful and surprised I never knew some of what you explained. I wanna add something irrelevant : you are so handsome!
Very nice video and a quite interesting and actually useful topic
I'd just like to say that the line wiggling around for most of the video was (to me) irritating, great work nonetheless
Thanks for your feedback! :) That’s not something I noticed, but now that I see it I totally get where you’re coming from. I’ll try to avoid having large repetitive motions in future videos.
Very cool animations
What about outliers? Naive least-squares method fails when dataset has even one "quite big" outlier...
Usually in that case, one would filter for outliers, or sometimes use absolute values for residuals instead of their squares, although in that case the regression algorithm is usually slower, which may or may not matter depending on the data rate.
@@RichBehiel Thanks for your response.
> filter for outliers
I'd like to see a video on this theme (concepts/methods). ;-)
Oh my god, you’re the tungsten guy! Tell me, great and mighty man of tungsten, how do you hold a tungsten cube in your mouth?
First, you must become accustomed to the intensity of its density. Then, carefully place the cube into your mouth. You will immediately develop a gigachad jawline as a result of resisting the immense gravitation.
(Don’t actually put a tungsten cube in your mouth. I just put that into the review for comedic effect 😅)
How did code these interactive plots? Thanks
Could you make similar video about parabolic graphs?
I’d like to someday! The procedure is very similar, but ax^2 + bx + c instead of ax + b. It’s a 3D parameter space, but the same techniques work.
Is this method, or something similar applicable to non linear least squares? I did a project over Christmas using non linear least squares regression and this would’ve been super helpful 😅
The same concept of minimizing a least squares objective function by setting the gradient to zero applies to nonlinear least squares, but there are also extra steps involved.
Is it possible to get a „second best“ valley? A pseudo best solution?
Not for linear regression, but for fits with more parameters yes. Gradient descent can sometimes get stuck in a local minimum, a valley other than the best one. If there’s an analytic solution, it might involve the roots of a polynomial or something, so you can have multiple values which are locally optimal. In that situation, the height of the objective function at each optimum can be quickly compared, since the list should be pretty short.
Wonderful presentation. But I have a doubt in the definition of "Error" function, wouldn't it be nice to define it as mean square. Although the solution won't change, but still it's satisfactory😌
That’s a good idea. I usually do that when reporting the number. Divide by number of data points and take square root. Sometimes it’s nice to turn that into a percentage too, but then you have to weight by the value at each point.
@@RichBehiel thanks 👍
Since you are not actually computing the error function, but merely minimizing it, adding unnecessary components like dividing by N and taking the square root is just a waste of energy for anyone trying to do the derivation.
man tahat's really beautiful
sub, like and comment for your effort, even if you dont make much on yt you are a great mathematician! And i am sure you will make it in life and be a help to humanity as a whole. thank you
Thanks for the kind comment! :)
Perfect video
in which langage of programmation You do This code?
Python, using matplotlib for the animations.
@@RichBehiel thanks
I would love it if you could show why the pseudoinverse recovers this method!
Thank you 🙏 ❤❤❤
Is the direct product in the final formula always non-singlular, and so always has a inverse?
I believe so, but I’m not 100% sure actually. As a good exercise in math, you can explore if it might be noninvertible under some conditions, just set the determinant to zero and see what a dataset would have to be like in order for that to happen.
I’ve done millions, maybe billions, of linear regressions (on data streams) and have never run into this problem though.
@@RichBehiel Doing just the quickest amount of working out with a dataset of 3 values, I think the sum of outer product would only be singular if all the x values are the same, which obviously isn't going to happen. It's fairly easy to show that if we have a dataset like this, the matrix is singular (the 1st row of the matrix is just the second multiplied by x_i), though I'm not sure how you'd prove it the other way around (i.e. the matrix is non-singular in all other cases).
That makes sense! Btw, these equations are equivalent to a force and torque balance, if the residuals are imagined as elastic springs, so physically it makes sense that it would only be singular if the x values are all the same, or something like that.