I personally see a similarity between physics and deep learning in the way the world is made of encapsulated layers of realies. For example, as shown in thermodynamics, the macroscopic layer don't need to know every position and velocity of every particles. It only need to know certains computed features like temperature and pressure.
I think it's a fascinating summary of the tie between the power of neural networks / deep learning and the peculiar physics of our universe. The mystery of why they work so well may be resolved by seeing the resonant homology across the information-accumulating substrate of our universe, from the base simplicity of our physics to the constrained nature of the evolved and grown artifacts all around us. The data in our natural world is the product of a hierarchy of iterative algorithms, and the computational simplification embedded within a deep learning network is also a hierarchy of iteration. Since neural networks are symbolic abstractions of how the human cortex works, perhaps it should not be a surprise that the brain has evolved structures that are computationally tuned to tease apart the complexity of our world. When he says "efficient deep networks cannot be accurately approximated by shallow ones without efficiency loss," it reminds me of something I wrote in 2006: "Stephen Wolfram’s theory of computational equivalence suggests that simple, formulaic shortcuts for understanding evolution (and neural networks) may never be discovered. We can only run the iterative algorithm forward to see the results, and the various computational steps cannot be skipped. Thus, if we evolve a complex system, it is a black box defined by its interfaces. We cannot easily apply our design intuition to the improvement of its inner workings. We can’t even partition its subsystems without a serious effort at reverse-engineering." - from www.technologyreview.com/s/406033/technology-design-or-evolution/
That is a cool reflection on the sublayers below of physics and evolution. I can´t still the bigger picture but seems that there are a lot of good fundamental question about reality in the area of physics and deep learning.
I am having difficulty understanding step 11 of the paper in which Max goes from the taylor series expansion form of the activation function to the multiplication approximator. Does anyone know of a more detailed explanation of this?
I think the rationale is quite clear, which is just to provide a concrete example that continuous multiplication can be learned, so I assume that you're asking about the calculation instead. - You can express each sigmoid term in the multiplication operation as its Taylor series - Once you've done that, you'll find that the σ_0 and σ_1 terms disappear (respectively, their coefficients cancel). - You'll just be left with the σ_2 term (and error term, of course), with a coefficient of 4uv. Dividing this by the denominator gives uv. I'm not sure why the error term O(u^2 + v^2) exists, because their coefficients cancel. I suspect that it's because the term does not vanish when there is a non-zero bias (Taylor series about another point). Much appreciated if anyone could make this clear.
The interleaving of linear evolution and non-linear functions is also how quantum mechanics works: 1. The propagation step is perfectly linear, conservative, unitary, non-local and time-reversible. It is a continuous wave with complex amplitude specified by the Schrodinger equation. There is no loss of information. There are no localized particles in this step. There is no space in this step. 2. The interaction step is discrete, non-linear, local and time-irreversible. It is a selection/generation/collapse of alternatives based on the Born Rule. There is a loss of information, as complex values are added and amplitudes squared to give non-negative real probabilities. The result is an interaction, the creation of space-time intervals from the previous interactions, identification of localized entities which might be called particles, and some outgoing waves that are correlated (entangled). Go to 1. Einstein complained that the non-locality of QM was "Spooky action at a distance", but in the Quantum Gravity upgrade, space is only created by interaction, so it becomes "Spooky distance at an action".
Max is absolutely brilliant, and a scientist of the absolute highest caliber, but his categorization of the different tasks within machine learning is incorrect. Modeling a joint probability, or p(x,y) is categorically referred to as a generative modeling, not unsupervised modeling, which is a different, though potentially overlapping, concept. Classification, correspondingly, is returning a class label for a given input, in standard notation, this is p(y|x). Prediction, or forecasting, is similarly p(x(t)|x(t-1),....x(1)). Unsupervised learning, by contrast, does not have some conventional notation, it refers to a scheme where a class label y is not fed to the training system. The joint probability that he wrote for unsupervised learning actually says nothing about the presence or absence of supervision, unless the y is a label, in which case the formalism is just plain wrong. I say this because there are lots of students looking at the work of brilliant scientists like Max, and they owe it to the students to have consistent and correct formalism, given that the students may still be learning.
Thanks for pointing out this. I thought it was me that did not understand this properly. By the way, great stuff by Max Tegmark, inspiring and all, but still to an embryonic stage of development.
This is the caption of the image on 4:40, taken from his paper: "In this paper, we follow the machine learning convention where y refers to the model parameters and x refers to the data, thus viewing x as a stochastic function of y (please beware that this is the opposite of the common mathematical convention that y is a function of x). Computer scientists usually call x a category when it is discrete and a parameter vector when it is continuous. Neural networks can approximate probability distributions. Given many samples of random vectors y and x, unsupervised learning attempts to approximate the joint probability distribution of y and x without making any assumptions about causality. Classification involves estimating the probability distribution for y given x. The opposite operation, estimating the probability distribution of x given y is often called prediction when y causes x by being earlier data in a time sequence; in other cases where y causes x, for example via a generative model, this operation is sometimes known as probability density estimation. Note that in machine learning, prediction is sometimes defined not as outputting the probability distribution, but as sampling from it." I hope this clarifies things. For more information have a look at the paper arxiv.org/pdf/1608.08225.pdf
Prof. Haim Sompolinksy - I think he is referring to this talk - ocw.mit.edu/resources/res-9-003-brains-minds-and-machines-summer-course-summer-2015/unit-9.-theory-of-intelligence/lecture-9.2-haim-sompolinsky-sensory-representations-in-deep-networks/
Max is a bit 'late', for it has been known, for quite a while, of neural networks' compression-bound nature: www.quora.com/How-are-hidden-Markov-models-related-to-deep-neural-networks/answer/Jordan-Bennett-9 . . Albeit, we need subsume of larger problems, inclusive of Marcus Hutter's temporal difference aligned lemma, via hints from quantum mechanics, deep reinforcement learning (particularly deepmind flavoured) and causal learning (ie uetorch): www.academia.edu/25733790/Causal_Neural_Paradox_Thought_Curvature_Quite_the_transient_naive_hypothesis . . A code sample that initializes the confluence of temporal difference regime, abound the causal horizon: github.com/JordanMicahBennett/God
I've seen your work on research gate and visited your website. You're hardcore and weird at the same time. What the hell is wrong with your quora dp?? lol! It's hilarious! :)
I personally see a similarity between physics and deep learning in the way the world is made of encapsulated layers of realies. For example, as shown in thermodynamics, the macroscopic layer don't need to know every position and velocity of every particles. It only need to know certains computed features like temperature and pressure.
That's an interesting thought!
34:00 more advanced than my kindergarten :D
41:30 Opossum? Laughed out loud that this was the word that was used as an example, and even more amazing that someone from the crowd got it!
the fact that his first neural network diagram is upside down is poetic ... classic Physics space cadet
41:45 fractal nature of English!... Goddd that blows my head..!!!
I think it's a fascinating summary of the tie between the power of neural networks / deep learning and the peculiar physics of our universe. The mystery of why they work so well may be resolved by seeing the resonant homology across the information-accumulating substrate of our universe, from the base simplicity of our physics to the constrained nature of the evolved and grown artifacts all around us. The data in our natural world is the product of a hierarchy of iterative algorithms, and the computational simplification embedded within a deep learning network is also a hierarchy of iteration. Since neural networks are symbolic abstractions of how the human cortex works, perhaps it should not be a surprise that the brain has evolved structures that are computationally tuned to tease apart the complexity of our world.
When he says "efficient deep networks cannot be accurately approximated by shallow ones without efficiency loss," it reminds me of something I wrote in 2006: "Stephen Wolfram’s theory of computational equivalence suggests that simple, formulaic shortcuts for understanding evolution (and neural networks) may never be discovered. We can only run the iterative algorithm forward to see the results, and the various computational steps cannot be skipped. Thus, if we evolve a complex system, it is a black box defined by its interfaces. We cannot easily apply our design intuition to the improvement of its inner workings. We can’t even partition its subsystems without a serious effort at reverse-engineering." - from www.technologyreview.com/s/406033/technology-design-or-evolution/
LOL
Engineer Of Wonders why the laugh?
That is a cool reflection on the sublayers below of physics and evolution. I can´t still the bigger picture but seems that there are a lot of good fundamental question about reality in the area of physics and deep learning.
I am having difficulty understanding step 11 of the paper in which Max goes from the taylor series expansion form of the activation function to the multiplication approximator. Does anyone know of a more detailed explanation of this?
I think the rationale is quite clear, which is just to provide a concrete example that continuous multiplication can be learned, so I assume that you're asking about the calculation instead.
- You can express each sigmoid term in the multiplication operation as its Taylor series
- Once you've done that, you'll find that the σ_0 and σ_1 terms disappear (respectively, their coefficients cancel).
- You'll just be left with the σ_2 term (and error term, of course), with a coefficient of 4uv. Dividing this by the denominator gives uv.
I'm not sure why the error term O(u^2 + v^2) exists, because their coefficients cancel. I suspect that it's because the term does not vanish when there is a non-zero bias (Taylor series about another point). Much appreciated if anyone could make this clear.
Sat through the whole lecture although most of it went above my head!!
The interleaving of linear evolution and non-linear functions is also how quantum mechanics works:
1. The propagation step is perfectly linear, conservative, unitary, non-local and time-reversible. It is a continuous wave with complex amplitude specified by the Schrodinger equation. There is no loss of information. There are no localized particles in this step. There is no space in this step.
2. The interaction step is discrete, non-linear, local and time-irreversible. It is a selection/generation/collapse of alternatives based on the Born Rule. There is a loss of information, as complex values are added and amplitudes squared to give non-negative real probabilities. The result is an interaction, the creation of space-time intervals from the previous interactions, identification of localized entities which might be called particles, and some outgoing waves that are correlated (entangled). Go to 1.
Einstein complained that the non-locality of QM was "Spooky action at a distance", but in the Quantum Gravity upgrade, space is only created by interaction, so it becomes "Spooky distance at an action".
Really cool talk Max. I'm curious who the audience for your talk was. I think they must have a really good foundation of knowledge in mathematics.
Uh-mazing talk!
Max - loved your recent book: "Mathematical Universe" Good to see you have interest in ML!! Keep up the great work!!!
What is the difference between random forests and DNN?
Max is absolutely brilliant, and a scientist of the absolute highest caliber, but his categorization of the different tasks within machine learning is incorrect. Modeling a joint probability, or p(x,y) is categorically referred to as a generative modeling, not unsupervised modeling, which is a different, though potentially overlapping, concept. Classification, correspondingly, is returning a class label for a given input, in standard notation, this is p(y|x). Prediction, or forecasting, is similarly p(x(t)|x(t-1),....x(1)). Unsupervised learning, by contrast, does not have some conventional notation, it refers to a scheme where a class label y is not fed to the training system. The joint probability that he wrote for unsupervised learning actually says nothing about the presence or absence of supervision, unless the y is a label, in which case the formalism is just plain wrong.
I say this because there are lots of students looking at the work of brilliant scientists like Max, and they owe it to the students to have consistent and correct formalism, given that the students may still be learning.
so prediction is like bayesian learning with mcmc and all that?
Thanks for pointing out this. I thought it was me that did not understand this properly. By the way, great stuff by Max Tegmark, inspiring and all, but still to an embryonic stage of development.
This is the caption of the image on 4:40, taken from his paper:
"In this paper, we follow the machine learning convention where
y refers to the model parameters and x refers to the data, thus viewing
x as a stochastic function of y (please beware that this is the opposite of the common mathematical convention that y is a function of x). Computer scientists usually call
x a category when it is discrete and a parameter vector when it is continuous. Neural networks can approximate probability distributions. Given many samples of random vectors y and
x, unsupervised learning attempts to approximate the joint probability distribution of
y and x without making any assumptions about causality. Classification involves
estimating the probability distribution for y given x. The opposite operation, estimating the probability distribution of x given y is often called prediction when y causes x by being earlier data in a time sequence; in other cases where y causes x, for example via a generative model, this operation is sometimes known as probability density estimation. Note that in machine learning, prediction is sometimes defined not as outputting the probability distribution, but as sampling from it."
I hope this clarifies things. For more information have a look at the paper arxiv.org/pdf/1608.08225.pdf
mailoisback thanks, that helped a lot.
awesome
Max you're the man!!
Thank you very much, it was really helpfull. Even if I suggested smth like that idea before)
Just after 7:05 whose talk is he referring to ? Ends with *ski.
Prof. Haim Sompolinksy - I think he is referring to this talk - ocw.mit.edu/resources/res-9-003-brains-minds-and-machines-summer-course-summer-2015/unit-9.-theory-of-intelligence/lecture-9.2-haim-sompolinsky-sensory-representations-in-deep-networks/
Center for Brains, Minds and Machines (CBMM) thanks!
Max shows why AI is a cross-disciplinary science. Really stimulating video.
Max is a bit 'late', for it has been known, for quite a while, of neural networks' compression-bound nature:
www.quora.com/How-are-hidden-Markov-models-related-to-deep-neural-networks/answer/Jordan-Bennett-9
.
.
Albeit, we need subsume of larger problems, inclusive of Marcus Hutter's temporal difference aligned lemma, via hints from quantum mechanics, deep reinforcement learning (particularly deepmind flavoured) and causal learning (ie uetorch):
www.academia.edu/25733790/Causal_Neural_Paradox_Thought_Curvature_Quite_the_transient_naive_hypothesis
.
.
A code sample that initializes the confluence of temporal difference regime, abound the causal horizon:
github.com/JordanMicahBennett/God
I've seen your work on research gate and visited your website. You're hardcore and weird at the same time. What the hell is wrong with your quora dp?? lol! It's hilarious! :)
He should definitely learn how to speak without those annoying pauses and weird mouth warps.
I haven't read the paper but the presentation is terrible. The examples are not clearly illustrated.
什么玄学。。。拉黑了