How does this comparison holdup in the case of unbounded activation functions? If I am understanding your initial argument, we can essentially treat each neuron as on or off, but in practice most models use unbounded functions like ReLU, which has "off" and many degrees of "on". Additionally, it is becoming more common to remove biases from the larger models, which in many cases improves performance (marginally fewer FLOPs and lower loss). This would remove the B term in the probability function, instead entirely dependent on J. Although, perhaps this is imposing a type of symmetry on the network, and that's why it results in lower loss?
This is a great comment. Regarding ReLU I haven't tried it yet but my guess is we need to try higher spin systems like spin 3/2 rather than spin 1/2. Regarding bias yes we can switch terms on and off. And yes turning B off means adding Z2 symmetry (s -> -s will not change any of the results). I think we should first check if our dataset has this symmetry if yes then applying it will improve performance.
@@CompuFlair I believe the issue with functions like ReLU, is that it essentially turns your quantized state assumption into a continuous state. So looking at a higher spin system may not help bridge the gap. For symmetric, most datasets do not exhibit this property, yet removing the bias remains effective. Perhaps the inclusion of multiple hidden layers forces the model to learn a transformation which includes this new symmetry? Meanwhile, a simple test to show that the symmetry does or does not exist, would be to consider the embedding layer of one of the newer LLaMA models (since that architecture omits the bias terms). In this case, the embedding layer is equivalent to the "dataset" since the integer tokens ids index those values.
The challenge with ReLU is its abrupt change at the activation. I can model that in two ways (but there could be better ways that haven't come to my mind yet). A high spin system will have high enough states to approximate as continuous. Or it is a phase transition behavior going on. Regarding the B parameter it accounts for bias and the weights in the input layer so turning bias off, B is still not zero. Why turning bias off adds to performance is something we need to investigate. For sure, something interesting is going on there. Yes of course adding the embedding layer adds more complexity that I plan to get into. But I like your guess could be that.
@@hjups ReLU helped solve a key problem in AI backpropagation around continous differentiable vs non differentiable topology. While physical neurons do have behaviors like ReLU, namely the process of Long Term Potentiation LTP, but the brain also has quantum nature... It's got two active points thay are correlated to changes faster than light syncronization which we read with fMRI. This implies understanding all dimensions of space time is essential to the intelligence of Humans. Fortunately we ha love E-infinity theory by M.S. El Naschie and some neurologists have confirmed thus is the case, and done so with a sophisticated head start to underrating it. Let me be clear, this has nothing to do with uncertainty, it has everything to do with the fact that entanglement is infact capable of communication and this manifold the active points of the brain activity patterns form are key to decoding the brain's thought process. The signals of the brain are organized in higher dimensions, and some of their organization is the timing offset from resonance, which analyzed statistically over time we find brownian motion, fractal correlation of the standard devation of brain signal impulse from resonance impulses. We will not likely understand this to the degree we can run an AI equivalent AI with 3.5 watts, as the human brain uses... but there's tons of reasons to study manifolds and artificial neurons and fractals all in relation to one another. I think the brain is kind of 'dowsing' it's index of tags, then with a feeling of the right idea it can trigger a flood of relevant memories. AI is similar enough...
There are steps into an optimization algorithm that I am studying, that uses DFT and changes little by little parameters so the model fit the experiment, that looks suspiciously close to back propagation, but uses a pretty slow method that onloads the wavefunction from memory, changes parameters in the input file, and reloads it into memory. I could bet that if physicists adopted ML algorithms, we could do a lot of stuff much faster
Maybe I'm speculating too much but since I started to understand the gradient descent as some kind of least action principle it became even more intuitive about what's happening fundamentally speaking, now as far as I know the ising model combines the interaction of two biggest principles in nature, least action and entropy wich as far as i know is also kinda similar to shannon's entropy, maybe is more epistemology than maths but fundamentally kinda explain why those model could be similar. Now guessing the correspondece is actually functional, would this improve even more using quantum computers to run new ML models ?
Just because you can simulate the working of a television set analogue or digital it does not mean you have access to the vision that the Tell A Vision transmits the Vision
I'm a layman, but this sounds like the correspondence between neural networks and fuzzy logic, which seems like something more directly suited for writing proofs and reasoning about (given that logic is the basis of mathematicla proofs). Are we sure that the Ising model is a better proxy for studying and interpreting NNs?
Thanks for the comment. That is an interesting view. The Ising model is a simplified model of a spin system and has many applications in statistical mechanics. It is a mathematical object that is easy to analyze and yet captures the essence of complex phenomena. So, it can help but shouldn't be exclusive to other proxies.
11:26 : Hmm… is the Ising model even really about spin? I thought it was about, like, the magnetization of small domains in a ferromagnetic material or something like that… I guess it could be both, as a sort of universality thing… Still, saying that spin has 2 possible values of up and down, rather than having {|up>,|down>} as a basis… Well, I suppose this is supposed to be approachable to people who have ML experience but no physics experience? If we are going to treat spins ~classically , I am unsure where you are going when you mention possibly using higher spins like 3/2 . I will continue watching the video now. 14:53 : Ah, I see, you have a different J term for each pair of connected hidden or output neurons… Hmm… well, definitely some similarities in such systems, I imagine that the degree of activation is meant to be the probability that a given spin is spin up… Ah, hm, but if it is just that, then this sort of seems like it would neglect how the correlations between two spins in the same layer would influence their effect on a spin they are both connected to in the next layer. Also, it seems like it would lack the feed-forward property? Like, it seems like, holding the parameter of the early connections fixed, that changing the parameters of later connections would influence the probability of spins in earlier layers being up or down, which is unlike how things are with neural nets? 20:28 : hm, typically deep neural networks don’t involve random processes as part of their internal behavior? And the activation values aren’t 2-valued, but a continuous range (… well, technically it using floating point, but whatever.) 21:45 : “what does it take for the neuron to become active” : for almost all DNNs used today, the neurons all have degrees of activation, not just active/inactive ? Perceptrons had the sharp cut-off thing, but that doesn’t work with backprop, and so they mostly aren’t used? 22:23 : ok, yes, biological neurons are more like this… But they also like, spike at particular times… There is work on making artificial neural nets that more closely resemble them, but I don’t think they are used in practice much so far. 23:08 : tanh(x) isn’t a step function, though I suppose tanh(cx) approaches a step function as c goes to infinity… 26:41 : Ah, here is where the assumption of independence is used / where the correlations are neglected. So, I guess you are using the *expected value* of the spin, as the activation value of the neuron. Ok, fair. 28:13 : ah, hm, maybe you are including correlations between the spins?
26:57 : hmm, the product of exponential factors have the exponents add, yes, but what about the normalization factors? Ah, I guess it works out because those are also sums over possible states for a given part of the network, of the same kinds of exponential terms, and products over those are then just sums over combinations of values for the combined set of variables, of the product of the exponentials. Ok, that seems to work… Well, at least if the sets of variables being summed over are disjoint. … huh. Disjoint. I guess having many identical copies of parts of the network would help to make some things independent, and that would I guess address the other concern I mentioned, at the cost of making the size of the spin network exponential in the depth of the neural network. Like, if for each neuron in a given hidden layer (or in the output layer), you had a separate copy of the spin system corresponding to the part of the neural network that is run before it, where they only met up again at the actual input to the network. Seems a bit extravagant, but of course we don’t actually have to implement things that way in order to get a useful correspondence with Ising-model stuff by doing that. Huh!
Ok, so specifically, for every path from an output neuron to a hidden neuron, I want an Ising spin, and the energy for that Ising spin should be: -(the bias term for the corresponding neuron)*(the spin value) + sum over (the spins that correspond to paths extending the path for this spin), of (the value of that spin) times (the value of this spin), times (the weight between the corresponding neuron and the corresponding neuron for the extra step in the path, possibly times minus 1) , or, if the corresponding neuron is in the first hidden layer, then use the inputs rather than spins from “the previous hidden layer” . This should remove the influence of the correlations I was worried about. In doing this, all of the spins that correspond to a given neuron, have their expected value equal to the activation of that neuron. In this, I am regarding each spin value as being determined by a Boltzmann distribution from the spins that come “before” it, and so am not considering the possibility of energy considerations making a spin closer to the output having an influence on spins closer to the input. Uhh… Not sure whether treating the whole thing as one system, where the energy of the whole system is considered and used in Boltzmann, would result in the same probability distribution. Maybe it does?
i don t understand everything but coud we use spintronic and some magnetic semiconductor for supervided learning? we fixed entry and exit spin bite of the circuit which automaticaly take a new equilibrum reducing the Helmholtz function F=U−TS and once the learning is finished we have just to read all the spin bite in the circuit and use it as weight for our neural networks??? it will consume less energy because spin wave use less energy than electrons conduction use in traditional semiconductor?
@@CompuFlair the whole rationale about the Ising model depends on Information Theory, thats why I found odd you did not mention it at all. Also there are some details you missed about the Ising model at higher dimensions and there is a curious fact about it, the original 2D ising model was wrong. There are also some historical context that I find fascinating but that could be hours of content on its own
@@CompuFlair Leonard Susskind touches this topic on his Statistical Mechanics class. th-cam.com/play/PLpGHT1n4-mAsJ123W3fjPzvlDHOvIhHA0.html&si=ll_Ydgn2L-F6Twfj
@CompuFlair he wrote a book called A New Kind of Science, which has an accompanying website and it's about cellular automaton how they can be used to computate things that would otherwise have to be derived either algebraically or through geometric axioms. The core concept is that immense complexity can be derived from its very simple minimalistic rule sets Stephen Wolfram also created Mathematica, which was the de facto standard of academic mathematics for may years. Hr Is considered one of the founding figures of artificial intelligence, and he's still around, you can find his work online and on TH-cam, etc. He continues to write extensively on the topic of neural networks, physics simulations, computational science, etcetera
Stephen Wolfram's original work on cellular automaton were the 1d elementary variants. It has been shown that topologies that are isomorphic to the 1d case of the ising model were not able to undergo phase transitions, which is essential for systems to reach critical points and maximize efficiency. Wolfram and his team are currently focusing on hypergraphs, which is highly unrelated to this subfield.
Well, two videos of this series published before Nobel prize announced. Also, when Boltzmann worked on stat mech it wasn't considered physics, same true about cosmology when Hubble working on it (thats why he never won the prize), and when quantum mechanics and relativity were on the horizon Lord Kelvin claimed "there is nothing new to be discovered in physics" this is not because he was not aware of what is coming but as Steve Weinberg put it he didn't consider them physics. And now stat mech, quantum mechanics, and cosmology are all physics.
You're trying to emulate the brain, which is very similar as trying to simulate or emulate a physical system. It was generally always a neighbor of physics simulations due to the math being similar. Even a number of physics terms and names were used early on.
Because it is? Most modern machine learning control techniques are based off of concepts from physics. These constructs are basically dynamical systems.
How does this comparison holdup in the case of unbounded activation functions? If I am understanding your initial argument, we can essentially treat each neuron as on or off, but in practice most models use unbounded functions like ReLU, which has "off" and many degrees of "on".
Additionally, it is becoming more common to remove biases from the larger models, which in many cases improves performance (marginally fewer FLOPs and lower loss). This would remove the B term in the probability function, instead entirely dependent on J. Although, perhaps this is imposing a type of symmetry on the network, and that's why it results in lower loss?
This is a great comment.
Regarding ReLU I haven't tried it yet but my guess is we need to try higher spin systems like spin 3/2 rather than spin 1/2.
Regarding bias yes we can switch terms on and off. And yes turning B off means adding Z2 symmetry (s -> -s will not change any of the results). I think we should first check if our dataset has this symmetry if yes then applying it will improve performance.
@@CompuFlair I believe the issue with functions like ReLU, is that it essentially turns your quantized state assumption into a continuous state. So looking at a higher spin system may not help bridge the gap.
For symmetric, most datasets do not exhibit this property, yet removing the bias remains effective. Perhaps the inclusion of multiple hidden layers forces the model to learn a transformation which includes this new symmetry?
Meanwhile, a simple test to show that the symmetry does or does not exist, would be to consider the embedding layer of one of the newer LLaMA models (since that architecture omits the bias terms). In this case, the embedding layer is equivalent to the "dataset" since the integer tokens ids index those values.
The challenge with ReLU is its abrupt change at the activation. I can model that in two ways (but there could be better ways that haven't come to my mind yet). A high spin system will have high enough states to approximate as continuous. Or it is a phase transition behavior going on.
Regarding the B parameter it accounts for bias and the weights in the input layer so turning bias off, B is still not zero. Why turning bias off adds to performance is something we need to investigate. For sure, something interesting is going on there.
Yes of course adding the embedding layer adds more complexity that I plan to get into. But I like your guess could be that.
There's also continuous versions of these systems. Issing models, or other pots like models do have a limit to the xy model, etc.
@@hjups ReLU helped solve a key problem in AI backpropagation around continous differentiable vs non differentiable topology.
While physical neurons do have behaviors like ReLU, namely the process of Long Term Potentiation LTP, but the brain also has quantum nature...
It's got two active points thay are correlated to changes faster than light syncronization which we read with fMRI. This implies understanding all dimensions of space time is essential to the intelligence of Humans.
Fortunately we ha love E-infinity theory by M.S. El Naschie and some neurologists have confirmed thus is the case, and done so with a sophisticated head start to underrating it.
Let me be clear, this has nothing to do with uncertainty, it has everything to do with the fact that entanglement is infact capable of communication and this manifold the active points of the brain activity patterns form are key to decoding the brain's thought process. The signals of the brain are organized in higher dimensions, and some of their organization is the timing offset from resonance, which analyzed statistically over time we find brownian motion, fractal correlation of the standard devation of brain signal impulse from resonance impulses.
We will not likely understand this to the degree we can run an AI equivalent AI with 3.5 watts, as the human brain uses... but there's tons of reasons to study manifolds and artificial neurons and fractals all in relation to one another.
I think the brain is kind of 'dowsing' it's index of tags, then with a feeling of the right idea it can trigger a flood of relevant memories. AI is similar enough...
This framework exists as mindset clusters under language. Its the heart of our murmerings.
Excellent!
Glad you liked it!
There are steps into an optimization algorithm that I am studying, that uses DFT and changes little by little parameters so the model fit the experiment, that looks suspiciously close to back propagation, but uses a pretty slow method that onloads the wavefunction from memory, changes parameters in the input file, and reloads it into memory. I could bet that if physicists adopted ML algorithms, we could do a lot of stuff much faster
Interesting! Have you published your work? If yes, please share a link so we can take a look.
Amazing!
Thanks!
Maybe I'm speculating too much but since I started to understand the gradient descent as some kind of least action principle it became even more intuitive about what's happening fundamentally speaking, now as far as I know the ising model combines the interaction of two biggest principles in nature, least action and entropy wich as far as i know is also kinda similar to shannon's entropy, maybe is more epistemology than maths but fundamentally kinda explain why those model could be similar. Now guessing the correspondece is actually functional, would this improve even more using quantum computers to run new ML models ?
Thanks for the comment. Well, quantum computers have some potential to speed up computations. That is all I can say with certainty.
GiGo
Just because you can simulate the working of a television set analogue or digital it does not mean you have access to the vision that the Tell A Vision transmits the Vision
I'm a layman, but this sounds like the correspondence between neural networks and fuzzy logic, which seems like something more directly suited for writing proofs and reasoning about (given that logic is the basis of mathematicla proofs).
Are we sure that the Ising model is a better proxy for studying and interpreting NNs?
Thanks for the comment. That is an interesting view.
The Ising model is a simplified model of a spin system and has many applications in statistical mechanics. It is a mathematical object that is easy to analyze and yet captures the essence of complex phenomena. So, it can help but shouldn't be exclusive to other proxies.
11:26 : Hmm… is the Ising model even really about spin? I thought it was about, like, the magnetization of small domains in a ferromagnetic material or something like that…
I guess it could be both, as a sort of universality thing…
Still, saying that spin has 2 possible values of up and down, rather than having {|up>,|down>} as a basis…
Well, I suppose this is supposed to be approachable to people who have ML experience but no physics experience?
If we are going to treat spins ~classically , I am unsure where you are going when you mention possibly using higher spins like 3/2 .
I will continue watching the video now.
14:53 : Ah, I see, you have a different J term for each pair of connected hidden or output neurons…
Hmm… well, definitely some similarities in such systems, I imagine that the degree of activation is meant to be the probability that a given spin is spin up…
Ah, hm, but if it is just that, then this sort of seems like it would neglect how the correlations between two spins in the same layer would influence their effect on a spin they are both connected to in the next layer.
Also, it seems like it would lack the feed-forward property? Like, it seems like, holding the parameter of the early connections fixed, that changing the parameters of later connections would influence the probability of spins in earlier layers being up or down, which is unlike how things are with neural nets?
20:28 : hm, typically deep neural networks don’t involve random processes as part of their internal behavior? And the activation values aren’t 2-valued, but a continuous range (… well, technically it using floating point, but whatever.)
21:45 : “what does it take for the neuron to become active” : for almost all DNNs used today, the neurons all have degrees of activation, not just active/inactive ? Perceptrons had the sharp cut-off thing, but that doesn’t work with backprop, and so they mostly aren’t used?
22:23 : ok, yes, biological neurons are more like this…
But they also like, spike at particular times…
There is work on making artificial neural nets that more closely resemble them, but I don’t think they are used in practice much so far.
23:08 : tanh(x) isn’t a step function, though I suppose tanh(cx) approaches a step function as c goes to infinity…
26:41 : Ah, here is where the assumption of independence is used / where the correlations are neglected.
So, I guess you are using the *expected value* of the spin, as the activation value of the neuron. Ok, fair.
28:13 : ah, hm, maybe you are including correlations between the spins?
26:57 : hmm, the product of exponential factors have the exponents add, yes, but what about the normalization factors?
Ah, I guess it works out because those are also sums over possible states for a given part of the network, of the same kinds of exponential terms, and products over those are then just sums over combinations of values for the combined set of variables, of the product of the exponentials.
Ok, that seems to work…
Well, at least if the sets of variables being summed over are disjoint.
… huh. Disjoint.
I guess having many identical copies of parts of the network would help to make some things independent, and that would I guess address the other concern I mentioned, at the cost of making the size of the spin network exponential in the depth of the neural network.
Like, if for each neuron in a given hidden layer (or in the output layer), you had a separate copy of the spin system corresponding to the part of the neural network that is run before it, where they only met up again at the actual input to the network.
Seems a bit extravagant, but of course we don’t actually have to implement things that way in order to get a useful correspondence with Ising-model stuff by doing that.
Huh!
Ok, so specifically, for every path from an output neuron to a hidden neuron, I want an Ising spin,
and the energy for that Ising spin should be:
-(the bias term for the corresponding neuron)*(the spin value) + sum over (the spins that correspond to paths extending the path for this spin), of (the value of that spin) times (the value of this spin), times (the weight between the corresponding neuron and the corresponding neuron for the extra step in the path, possibly times minus 1)
, or, if the corresponding neuron is in the first hidden layer, then use the inputs rather than spins from “the previous hidden layer” .
This should remove the influence of the correlations I was worried about.
In doing this, all of the spins that correspond to a given neuron, have their expected value equal to the activation of that neuron.
In this, I am regarding each spin value as being determined by a Boltzmann distribution from the spins that come “before” it, and so am not considering the possibility of energy considerations making a spin closer to the output having an influence on spins closer to the input.
Uhh…
Not sure whether treating the whole thing as one system, where the energy of the whole system is considered and used in Boltzmann, would result in the same probability distribution. Maybe it does?
Thanks for the comment. Liked your way of thinking.
Computer scientists: I have an idea, I'll make a proof of concept and judge it by how well it performs.
Phycisists: I have an idea.
We'll try to change that picture :)
i don t understand everything but coud we use spintronic and some magnetic semiconductor for supervided learning? we fixed entry and exit spin bite of the circuit which automaticaly take a new equilibrum reducing the Helmholtz function F=U−TS and once the learning is finished we have just to read all the spin bite in the circuit and use it as weight for our neural networks??? it will consume less energy because spin wave use less energy than electrons conduction use in traditional semiconductor?
Well it might work or not. The only way to see if that works is to derive the details.
Supr complicated concepts! Luckily you bring the topics easily accesible! And summarise and repeat.
Thanks for the comment
Is there an implementation of this equivalence?
Would you explain which part should be implemented?
I did a paper on this
Nice. Would you like to tell us more about it?
I would be very interested into reading your paper ! Would you mind giving us a reference please ?
but what does it give us ?
This is a good question. It might open the possibility of building AI models with more control over them
You are a couple decades late to the party. I wonder why you left out Information theory
What about information theory?
@@CompuFlair the whole rationale about the Ising model depends on Information Theory, thats why I found odd you did not mention it at all. Also there are some details you missed about the Ising model at higher dimensions and there is a curious fact about it, the original 2D ising model was wrong. There are also some historical context that I find fascinating but that could be hours of content on its own
@@renanmonteirobarbosa8129 Can you provide references? I would be happy to include them in future videos
@@CompuFlair Leonard Susskind touches this topic on his Statistical Mechanics class. th-cam.com/play/PLpGHT1n4-mAsJ123W3fjPzvlDHOvIhHA0.html&si=ll_Ydgn2L-F6Twfj
Stephen wolfram called this, decades ago.
Reference Please. Did he derive probability of the neural nets?
@CompuFlair he wrote a book called A New Kind of Science, which has an accompanying website and it's about cellular automaton how they can be used to computate things that would otherwise have to be derived either algebraically or through geometric axioms.
The core concept is that immense complexity can be derived from its very simple minimalistic rule sets
Stephen Wolfram also created Mathematica, which was the de facto standard of academic mathematics for may years.
Hr Is considered one of the founding figures of artificial intelligence, and he's still around, you can find his work online and on TH-cam, etc.
He continues to write extensively on the topic of neural networks, physics simulations, computational science, etcetera
Stephen Wolfram's original work on cellular automaton were the 1d elementary variants. It has been shown that topologies that are isomorphic to the 1d case of the ising model were not able to undergo phase transitions, which is essential for systems to reach critical points and maximize efficiency.
Wolfram and his team are currently focusing on hypergraphs, which is highly unrelated to this subfield.
the Nobel prize has become a clout chasing thing where AI becomes popular and then suddenly AI is now physics.
Well, two videos of this series published before Nobel prize announced.
Also, when Boltzmann worked on stat mech it wasn't considered physics, same true about cosmology when Hubble working on it (thats why he never won the prize), and when quantum mechanics and relativity were on the horizon Lord Kelvin claimed "there is nothing new to be discovered in physics" this is not because he was not aware of what is coming but as Steve Weinberg put it he didn't consider them physics.
And now stat mech, quantum mechanics, and cosmology are all physics.
You're trying to emulate the brain, which is very similar as trying to simulate or emulate a physical system. It was generally always a neighbor of physics simulations due to the math being similar. Even a number of physics terms and names were used early on.
Because it is? Most modern machine learning control techniques are based off of concepts from physics. These constructs are basically dynamical systems.