Professor Strang ,thank you for an awesome lecture on Distance Matrices , structure of Neural Nets and the Learning Function. All these mathematical concepts improves my understanding of Machine Learning.
For BPP, deep learning and for the general structure of neural networks the following comments may be useful. To begin with, note that instead of partial derivatives one can work with derivatives as the linear transformations they really are. It is also possible to look at the networks in a more structured manner. The basic ideas of BPP can then be applied in much more general cases. Several steps are involved. 1.- More general processing units. Any continuously differentiable function of inputs and weights will do; these inputs and weights can belong, beyond Euclidean spaces, to any Hilbert space. Derivatives are linear transformations and the derivative of a neural processing unit is the direct sum of its partial derivatives with respect to the inputs and with respect to the weights; this is a linear transformation expressed as the sum of its restrictions to a pair of complementary subspaces. 2.- More general layers (any number of units). Single unit layers can create a bottleneck that renders the whole network useless. Putting together several units in a unique layer is equivalent to taking their product (as functions, in the sense of set theory). The layers are functions of the of inputs and of the weights of the totality of the units. The derivative of a layer is then the product of the derivatives of the units; this is a product of linear transformations. 3.- Networks with any number of layers. A network is the composition (as functions, and in the set theoretical sense) of its layers. By the chain rule the derivative of the network is the composition of the derivatives of the layers; this is a composition of linear transformations. 4.- Quadratic error of a function. ... --- Since this comment is becoming too long I will stop here. The point is that a very general viewpoint clarifies many aspects of BPP. If you are interested in the full story and have some familiarity with Hilbert spaces please google for papers dealing with backpropagation in Hilbert spaces. A related article with matrix formulas for backpropagation on semilinear networks is also available. For a glimpse into a completely new deep learning algorithm which is orders of magnitude more efficient, controllable and faster than BPP search in this platform for a video about deep learning without backpropagation; in its description there are links to a demo software. The new algorithm is based on the following very general and powerful result (google it): Polyhedrons and perceptrons are functionally equivalent. For the elementary conceptual basis of NNs see the article Neural Network Formalism. Daniel Crespin
Hats off for professor Gil Strang and MIT for this amazing classes. One question: I may missing something, the last part, where he says that you can obtain the matrix G from D with this formula, I think that is not correct. You don't have the squared norms of the vectors and professor Strang assumes that you have it on the diagonal of D but diagonal of D is all zeros, am I right? Or am I misunderstanding anything? Again thank you very much!
You are right, the diagonal enteries d_{i,i} are all zeros. He explains it a little complicated (also in his book!). I think the d_i and d_j are here not supposed as the diagonal enteries of D, but they produce the diagonal enteries of G in the way of formula at 42:53. I hope I gave your answer :)
The detail is given in the page 260 of the reference book. To make it breif, placing the first point x1 at the origin, namely x1 = zero vector, makes xi - x1 = xi. Therefore, the first column of the distance matrix D is the x1^2, x2^2 ..., xn^2, which is the diagonal of XTX and also the vector d. I recomend you to look up in the original book to fully get the main idea.
I don't think I understood what is the D matrix in the distance problem: I tried getting a similar term, but I'm not sure it's the same or correct: consider dᵢⱼ = |xᵢ - xⱼ | ² = xᵢ² + xⱼ² + 2xᵢxⱼ, we want to find an expression for X^T X which is xᵢxⱼ We can do rigid translation, so we can limit ourselves to xs which are centered, i.e. the average of the position is 0 i.e. =0. Now, if we average dᵢⱼ on one of the indices, let's pick j, we get ⱼ= xᵢ² + + 2 xᵢ = xᵢ² + + 0 We can denote the second moment =σ², so ⱼ = xᵢ² + σ². We can average dᵢⱼ again, this time over the i index, and we get < ⱼ >ᵢ = < xᵢ² + σ²>ᵢ = 2σ² We can use this to rewrite the xᵢ² terms using averages over dᵢⱼ xᵢ² + xⱼ² = ᵢ + ⱼ - ᵢⱼ = xᵢ² + σ² + xⱼ² + σ² - 2σ² And get 2xᵢxⱼ = ᵢ + ⱼ - ᵢⱼ - dᵢⱼ I think these averages over i and j correspond to some of the weird column tricks [11111][D], but I'm not sure.
He is certainly right that even the best would fare well to add anything. To have the pleasure of his lectures is more than gold.
Just once to see one of his lectures and not be amazed. He is simply awesome!
Again Prof. Gilbert Strang! Thank you very much!
I can’t say I will have to review and review from beginning to end many times. You are most clear in explanations.
I like your intelligent advanced lectures.they are very challenging..strang is the smartest.thank you.
Professor Strang ,thank you for an awesome lecture on Distance Matrices , structure of Neural Nets and the Learning Function. All these mathematical concepts improves my understanding of Machine Learning.
Thank you Prof Strang!
Very thanks MIT for sharing such a knowledge
For BPP, deep learning and for the general structure of neural networks the following comments may be useful.
To begin with, note that instead of partial derivatives one can work with derivatives as the linear transformations they really are.
It is also possible to look at the networks in a more structured manner. The basic ideas of BPP can then be applied in much more general cases. Several steps are involved.
1.- More general processing units.
Any continuously differentiable function of inputs and weights will do; these inputs and weights can belong, beyond Euclidean spaces, to any Hilbert space. Derivatives are linear transformations and the derivative of a neural processing unit is the direct sum of its partial derivatives with respect to the inputs and with respect to the weights; this is a linear transformation expressed as the sum of its restrictions to a pair of complementary subspaces.
2.- More general layers (any number of units).
Single unit layers can create a bottleneck that renders the whole network useless. Putting together several units in a unique layer is equivalent to taking their product (as functions, in the sense of set theory). The layers are functions of the of inputs and of the weights of the totality of the units. The derivative of a layer is then the product of the derivatives of the units; this is a product of linear transformations.
3.- Networks with any number of layers.
A network is the composition (as functions, and in the set theoretical sense) of its layers. By the chain rule the derivative of the network is the composition of the derivatives of the layers; this is a composition of linear transformations.
4.- Quadratic error of a function.
...
---
Since this comment is becoming too long I will stop here. The point is that a very general viewpoint clarifies many aspects of BPP.
If you are interested in the full story and have some familiarity with Hilbert spaces please google for papers dealing with backpropagation in Hilbert spaces. A related article with matrix formulas for backpropagation on semilinear networks is also available.
For a glimpse into a completely new deep learning algorithm which is orders of magnitude more efficient, controllable and faster than BPP search in this platform for a video about deep learning without backpropagation; in its description there are links to a demo software.
The new algorithm is based on the following very general and powerful result (google it): Polyhedrons and perceptrons are functionally equivalent.
For the elementary conceptual basis of NNs see the article Neural Network Formalism.
Daniel Crespin
I’m very enjoy your course.
Thank you Gilbert!
Hats off for professor Gil Strang and MIT for this amazing classes.
One question: I may missing something, the last part, where he says that you can obtain the matrix G from D with this formula, I think that is not correct. You don't have the squared norms of the vectors and professor Strang assumes that you have it on the diagonal of D but diagonal of D is all zeros, am I right? Or am I misunderstanding anything?
Again thank you very much!
You are right, the diagonal enteries d_{i,i} are all zeros. He explains it a little complicated (also in his book!). I think the d_i and d_j are here not supposed as the diagonal enteries of D, but they produce the diagonal enteries of G in the way of formula at 42:53. I hope I gave your answer :)
Can some one please explain to me @ 34:10 is this some sort of identity ? can't understand what does it mean actually
Such a nice prof!
He sounds like the film star jimmy Stewart. It may be the accent
@47:29 But where do we get the d vector in the formula for the G = X^T * X matrix?
vector d consists of xi*xi for 1
You can take any d vector that keeps X.T*X pos. semidef.
Good question.
The detail is given in the page 260 of the reference book. To make it breif, placing the first point x1 at the origin, namely x1 = zero vector, makes xi - x1 = xi. Therefore, the first column of the distance matrix D is the x1^2, x2^2 ..., xn^2, which is the diagonal of XTX and also the vector d.
I recomend you to look up in the original book to fully get the main idea.
Love him. ❤
6:03 A_1, b_1 -> A_l, b_l
A_1, b_1 -> A_k-1, b_k-1
was he always old? :P
Old is gold
Your remark is idiotic.
Young people may be very old. Old people may be very young.
Legends speak of a time which he was but a mere child... But we here do not condone such nonsense
I don't think I understood what is the D matrix in the distance problem: I tried getting a similar term, but I'm not sure it's the same or correct:
consider dᵢⱼ = |xᵢ - xⱼ | ² = xᵢ² + xⱼ² + 2xᵢxⱼ, we want to find an expression for X^T X which is xᵢxⱼ
We can do rigid translation, so we can limit ourselves to xs which are centered, i.e. the average of the position is 0 i.e. =0.
Now, if we average dᵢⱼ on one of the indices, let's pick j, we get
ⱼ= xᵢ² + + 2 xᵢ = xᵢ² + + 0
We can denote the second moment =σ², so ⱼ = xᵢ² + σ².
We can average dᵢⱼ again, this time over the i index, and we get
< ⱼ >ᵢ = < xᵢ² + σ²>ᵢ = 2σ²
We can use this to rewrite the xᵢ² terms using averages over dᵢⱼ
xᵢ² + xⱼ² = ᵢ + ⱼ - ᵢⱼ = xᵢ² + σ² + xⱼ² + σ² - 2σ²
And get
2xᵢxⱼ = ᵢ + ⱼ - ᵢⱼ - dᵢⱼ
I think these averages over i and j correspond to some of the weird column tricks [11111][D], but I'm not sure.