Deep Networks Are Kernel Machines (Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 23 ม.ค. 2025

ความคิดเห็น • 172

  • @YannicKilcher
    @YannicKilcher  4 ปีที่แล้ว +28

    ERRATA: I simplify a bit too much when I pit kernel methods against gradient descent. Of course, you can even learn kernel machines using GD, they're not mutually exclusive. And it's also not true that you "don't need a model" in kernel machines, as it usually still contains learned parameters.

  • @florianhonicke5448
    @florianhonicke5448 4 ปีที่แล้ว +33

    Thank you so much for all of your videos. I just found some time to finish the last one and here is the next video in the pipeline. The impact you have on the ai community is immense! Just think about how many people were starting in this field just because of your videos. Not even talking about multiplication effects by educating theire friends.

  • @23kl104
    @23kl104 4 ปีที่แล้ว +14

    yes, would fully appreciate more theoretical papers.
    Keep up the videos man, they are gold

  • @ish9862
    @ish9862 4 ปีที่แล้ว +3

    Your way of explaining these difficult concepts in a simple manner is amazing. Thank you so much for your content.

  • @tpflowspecialist
    @tpflowspecialist 4 ปีที่แล้ว +4

    Amazing generalization of the concept of a kernel in learning algorithms to neural networks! Thanks for breaking it down for us.

  • @dalia.rodriguez
    @dalia.rodriguez 4 ปีที่แล้ว +6

    "A different way of looking at a problem can give rise to new and better algorithms because we understand the problem better" ❤

  • @andreassyren329
    @andreassyren329 4 ปีที่แล้ว +6

    I think this paper is wonderful in terms of explaining the Tangent Kernel, and I'm delighted to see them showing that there is a kernel _for the complete model_, such that the model can be interpreted as a kernel machine with some kernel (the path kernel). It ties the whole Neural Tangent Kernel stuff together rather neatly.
    I particularly liked your explanation of training in relationship to the Tangent Kernel, Yannic. Nice 👍.
    I do think their conclusion, that this suggests that ANNs don't learn by feature discovery, is not supported enough.
    What I'm seeing here is that, while the path kernel _given_ the trajectory can describe the full model as a kernel machine, the trajectory it took to get it _depends on the evolution_ of the Tangent Kernel.
    So the Tangent Kernel changing along the trajectory essentially captures the idea of ANNs learning features, that they then use to train in future steps.
    The outcome of K_t+1 depends on K_t, which represents some similarity between data points. But the outcomes of the similarities given by K_t were informed by K_t-1. To me that looks a lot like learning features that drive future learning. With a kind of _prior_ imposed by the architecture, through the initial Tangent Kernel K_0.
    In short. Feature discovery may not be necessary to _represent_ a trained neural network. But it might very well be needed to _find_ that representation (or find the trajectory that got you there). In line with the fact that representability != learnability.

  • @al8-.W
    @al8-.W 4 ปีที่แล้ว +12

    Thanks for the great content, delivered on a regular basis. I am just getting started as a machine learning engineer and I usually find your curated papers interesting. I am indeed interested in the more theoretical papers like this one so you are welcome to share. It would be a shame if the greatest mysteries of deep learning remained concealed just because the fundamental papers are not shared enough !

    • @dermitdembrot3091
      @dermitdembrot3091 4 ปีที่แล้ว +2

      Agree, good that Yannic isn't too shy to look into theory

  • @weysmostafapour_Official
    @weysmostafapour_Official 4 ปีที่แล้ว +1

    I love how simple you explained such a complicated paper. Thanks

  • @nasrulislam1968
    @nasrulislam1968 4 ปีที่แล้ว +3

    Oh man! You did a great service to all of us! Thank you! Hope to see more coming!

  • @andrewm4894
    @andrewm4894 4 ปีที่แล้ว +3

    Love this, Yannic does all the heavy lifting for me, but I still learn stuff. Great channel.

  • @111dimka111
    @111dimka111 3 ปีที่แล้ว +2

    Thanks Yannic again for very interesting review. I'll give here also my 5 cents on this paper.
    Will start with some critization. The specific point of the paper's proof is to divide and multiply by path kernel (almost end of the proof). This makes coefficients a_i to be a function of input, a_i(x), which as noted by remark 1 is very different from a typical kernel formulation. This difference is not something minor and I'll explain why. When you say that some model is a kernel machine and that it belongs to some corresponding RKHS defined via kernel k(x_1, x_2), we can start explore that RKHS and see what are its properties (mainly its eigen-decomposition) and from them to deduce various model behaviours (its expressiveness and tendency for overfitting). Yet, the above step of division/multiplication allows us to express NN as kernel machine of any kernel. Take some other irrelevant kernel (not the path kernel) and use it similarly - you will obtain the result that now NN is a kernel machine of this irrelevant kernel. Hence, if we allow a_i to be x-dependent then we can tell that any sum of train terms is a kernel machine of arbitrary kernel. Not a very strong statement, in my opinion.
    Now with good parts - the paper's idea is very clear and simple, propagating overall research domain more into the right direction of understanding theory behind DL. Also, the form that a_i obtained (derivative weighted by the kernel and then normalized by the same kernel) may provide some benefits in future works (not sure). But mainly, as someone that worked on these ideas a lot during my PhD I think papers like this one, that explain DL via tangent/path kernels and their evolution during the learning process, will eventually give us the entire picture of why and how NNs perform so well. Please review more papers like this :)

  • @minhlong1920
    @minhlong1920 3 ปีที่แล้ว

    I'm working on NTK and I came across this video. Truly amazing explaination, it really clears things up for me!

  • @master1588
    @master1588 4 ปีที่แล้ว +4

    This follows the author's hypothesis in "The Master Algorithm" that all machine learning algorithms (e.g. NN, Bayes, SVM, rule-based, genetic, ...) approximate a deeper, hidden algo. A Grand Unified Algorithm for Machine Learning.

    • @master1588
      @master1588 4 ปีที่แล้ว +1

      For example: lamp.cse.fau.edu/~lkoester2015/Master-Algorithm/

    • @herp_derpingson
      @herp_derpingson 4 ปีที่แล้ว +1

      @@master1588 The plot thickens :O

    • @adamrak7560
      @adamrak7560 4 ปีที่แล้ว

      Or they are similar in a way because they are all universal.
      Similar to the Universal Turing Machines. They can each simulate each other.
      The the underlying algorithms may be the original proof that NNs are universal approximators.

  • @morkovija
    @morkovija 3 ปีที่แล้ว +5

    "I hope you can see the connection.." - bold of you to hope for that

  • @mlearnxyz
    @mlearnxyz 4 ปีที่แล้ว +9

    Great news. We are back to learning Gram matrices.

    • @kazz811
      @kazz811 4 ปีที่แล้ว +1

      Unlikely that this perspective is used for anything.

  • @ハェフィシェフ
    @ハェフィシェフ 4 ปีที่แล้ว +1

    I really liked this paper, puts neural networks in a completely different perspective.

  • @OmanshuThapliyal
    @OmanshuThapliyal 4 ปีที่แล้ว +2

    Very well explained. The paper itself is written very well that I could read as a researcher outside of CS.

  • @LouisChiaki
    @LouisChiaki 4 ปีที่แล้ว +5

    Very excited to see some real math and machine learning theory here!

  • @kaikyscot6968
    @kaikyscot6968 4 ปีที่แล้ว +6

    Thank you so much for your efforts

  • @michelealessandrobucci614
    @michelealessandrobucci614 4 ปีที่แล้ว +3

    Check this paper: Input similarity from the neural network perspective. It's exactly the same idea (but older)

  • @mrpocock
    @mrpocock 4 ปีที่แล้ว +6

    I can't help thinking of attention mechanisms as neural networks that rate timepoints as support vectors, with enforced sparsity through the unity constraint.

    • @syedhasany1809
      @syedhasany1809 4 ปีที่แล้ว +2

      One day I will understand this comment and what a glorious day it will be!

  • @instituteofanalytics2017
    @instituteofanalytics2017 3 ปีที่แล้ว

    @yannic kilcher - Your brilliance and simplicity is amazing.

  • @wojciechkulma7748
    @wojciechkulma7748 4 ปีที่แล้ว +1

    Great overview, many thanks!

  • @MrDREAMSTRING
    @MrDREAMSTRING 4 ปีที่แล้ว

    So basically an NN trained with gradient descent is equivalent to a function that computes the kernel operations across all the training data (and across the entire training path!), and obviously NN runs so much more efficiently. That's pretty good; and very interesting insight!

  • @herp_derpingson
    @herp_derpingson 4 ปีที่แล้ว +5

    24:30 I wonder if this path tracing thingy works not only for neural networks but also for t-SNE. Imagine a bunch of points in the beginning of t-SNE. We have labels for all points except for one. During the t-SNE optimization, all points move. The class of the unknown point is equal to the class of the point to which its average distance was the least during the optimization process.
    .
    41:57 I think it means we can retire the Kernel Machines because Deep Networks are already doing that.
    .
    No broader impact statement? ;)
    .
    Regardless, perhaps one day we can have a proof like. "Since kernel machines cannot solve this problem and neural networks are kernel machines, it implies that there cannot exist any deep neural network that can solve this problem". Which might be useful.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +1

      very nice ideas. Yes, I believe the statement is valid for anything trained with gradient descent, and probably with a bit of modifications you could even extend that to any sort of ODE-driven optimization algorithm.

  • @scottmiller2591
    @scottmiller2591 4 ปีที่แล้ว +2

    I think what the paper is saying is "neural networks are equivalent to kernel machines, if you confine yourself to using the path kernel." No connection to Mercer or RKHS, so even the theoretical applicability is only to the path kernel - no other kernels allowed, unless they prove that path kernels are universal kernels, which sounds complicated. I'm also not sanguine about their statement about breaking the bottleneck of kernel machines - I'd like to see a big O for their inference method and compare it to modern low O kernel machines.
    Big picture, however, I think this agrees with what most kernel carpenters have always felt intuitively.

  • @IoannisNousias
    @IoannisNousias 4 ปีที่แล้ว +1

    Thank you for your service sir!

  • @willwombell3045
    @willwombell3045 4 ปีที่แล้ว +130

    Wow someone found a way to rewrite "Neural Networks are Universal Function Approximators" again.

    • @neoli8289
      @neoli8289 4 ปีที่แล้ว +1

      Exactly!!!

    • @gamerx1133
      @gamerx1133 4 ปีที่แล้ว +1

      @chris k Yes

    • @Lee-vs5ez
      @Lee-vs5ez 4 ปีที่แล้ว

      😂

    • @olivierphilip1612
      @olivierphilip1612 4 ปีที่แล้ว

      @chris k Any continuous function to be precise

    • @drdca8263
      @drdca8263 4 ปีที่แล้ว +5

      @chris k tl;dr: If by "all functions" you mean "all L^p functions" (or "all locally L^p functions"?) for some p in [1,infinity), then yes.
      (But technically, this isn't *all* functions from (the domain) to the real numbers, for which the question seems not fully defined, because in that case, what do we mean by "approximated by"?)
      needlessly long version:
      I was going to say "no, because what about the indicator function for a set which isn't measurable", thinking we would use the supremum norm for that (actual supremum, not supremum-except-neglecting-measure-zero-zets), in order to make talking about the convergence well-defined, but then I realized/remembered that under that criteria, you can't even approximate a simple step function using continuous functions (the space of bounded continuous functions is complete under the supremum norm), and therefore using the supremum norm can't be what you meant.
      Therefore, the actual issue here is that "all functions" is either too vague, or, if you really mean *all* functions from the domain to the real numbers, then this isn't compatible with the norm we are presumably using when talking about the approximation.
      If we mean "all L^p functions" for some p in [1,infinity) , then yes, because the continuous functions (or the continuous functions with compact support) are dense in L^p (at least assuming some conditions on the domain of these functions which will basically always be satisfied in the context we are talking about) .
      Original version of this comment: deleted without ever sending because I realized it had things wrong about it and was even longer than the "needlessly long version", which is a rewrite of it, which takes into account from the beginning things I realized only at the end of the original.
      I'm slightly trying to get better about taking the time to make my comments shorter, rather than leaving all the broken thought process on the way to the conclusion in the comment.

  • @Sal-imm
    @Sal-imm 4 ปีที่แล้ว +2

    Very good, pretty much straight forward linear.deduction.

  • @paulcarra8275
    @paulcarra8275 3 ปีที่แล้ว

    About your comment in the video about the fact that hte theorem applies only for the full GD case, in fact it can be extended to SGD aswell, you only need to add an indicator (in the sum of graditents over the trianing data) at each step to spot the points that are sampled at this step (this is explained by the author in the video below). Regards

  • @rnoro
    @rnoro 3 ปีที่แล้ว +1

    I think this paper is a translation of NN formalism to a functional analysis formalism. Simply speaking, a gradient-descent on a loss function framework is equivalent to a Linear matching problem on a Hilbert space determined by the NN structure. The linearization process is characterized by the gradient scheme. In other words, the loss function on the sample space becomes a linear functional on the Hilbert space. This is all it's about the paper, but nothing more.

  • @JI77469
    @JI77469 4 ปีที่แล้ว +4

    At 39:30 "... The ai's and b depend on x..." So how do we know that our y actually lives in the RKHS, since it's not a linear combination of kernel functions!? If we don't, then don't you lose the entire theory of RKHS?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      true, I guess that's for the theoreticians to figure out :D

    • @drdca8263
      @drdca8263 4 ปีที่แล้ว +1

      I didn't know what RKHS stood for. For any other readers of this comment section who also don't : It stands for Reproducing kernel Hilbert space.
      Also, thanks, I didn't know about these and it seems interesting.

  • @marvlnurban5102
    @marvlnurban5102 3 ปีที่แล้ว

    The paper reminds me of a paper by Maria Schuld comparing quantum ML models with kernels. Instead of dubbing quantum ML models as quantum neural networks, she demonstrates that quantum models are mathematically closer to kernels. Her argument is that the dot product of the hilbert space in which you embed your (quantum) data implies the construction of a kernel method. As far as I understand the method you use to encode your classical bits into your qbits is effectively your kernel function. Now it seems like kernels connect deep neural networks to "quantum models" by encoding the superposition of the training data points..?
    - 2021 Schuld Quantum machine learning models are kernel methods
    - 2020 Schuld Quantum embedding for Machine learning

  • @kimchi_taco
    @kimchi_taco 4 ปีที่แล้ว

    * NeuralNet with gradient descent is special version of kernel machine, which is sum_i().
    * It means NeuralNet works well like SVM works well. NeuralNet is even better because it doesn't need to compute kernel (O(data*data)) explicitly.
    * is similarity score between new prediction y of x and training prediction yi of xi.
    * The math is cool. I feel this derivation is useful later.

  • @LouisChiaki
    @LouisChiaki 4 ปีที่แล้ว +1

    Why the gradient vector of y w.r.t. w is not the tangent vector of the training history of w in the plot? Shouldn't the update of the w by gradient descent always proportional to the gradient vector?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +1

      true, but the two data points predicted are not the only ones. the curve follows the average gradient

  • @sinaasadiyan
    @sinaasadiyan ปีที่แล้ว

    Great Explanation👍

  • @hoaxuan7074
    @hoaxuan7074 4 ปีที่แล้ว

    The dot product is an associative memory if you meet certain mathematical requirements it has especially relating to the variance equation for linear combinations of random variable. The more things it learns the greater the angles between the input vectors and the weight vector.If it only learns 1 association the angle should actually be zero and the dot product will provide strong error correction.

  • @Sal-imm
    @Sal-imm 4 ปีที่แล้ว +2

    Now changing the weights in hypothetical sense means impaction reaction.

  • @shawkielamir9935
    @shawkielamir9935 2 ปีที่แล้ว

    Thanks a lot, this is a great explanation and I find it very useful. Great job !

  • @fmdj
    @fmdj 2 ปีที่แล้ว

    Damn that was inspiring. I almost got the full demonstration :)

  • @abdessamad31649
    @abdessamad31649 4 ปีที่แล้ว +1

    i love your content from Morroco !!!!!! keep it going

  • @damienhenaux8359
    @damienhenaux8359 4 ปีที่แล้ว +1

    I like very much this kind of videos on mathematical papers. I would like very much a video like this one on Stéphane Mallat paper : Group invariant scattering (2012). And thank you very much for everything

  • @woowooNeedsFaith
    @woowooNeedsFaith 4 ปีที่แล้ว +2

    3:10 - How about giving a link to this conversation in the description box?

    • @Pheenoh
      @Pheenoh 4 ปีที่แล้ว +2

      th-cam.com/video/y_RjsDHl5Y4/w-d-xo.html

    • @lirothen
      @lirothen 4 ปีที่แล้ว +1

      I work with the Linux kernel, so I too was VERY lost when he's referring to a kernel function in this different context. I was just about to ask too.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      I've added it to the description. Thanks for the suggestion.

  • @ashishvishwakarma5362
    @ashishvishwakarma5362 4 ปีที่แล้ว +1

    Thanks for the explanation. Can you please , also attach the annotated paper link in the description of every video, it would be great help ?

    • @YannicKilcher
      @YannicKilcher  3 ปีที่แล้ว +1

      The paper itself is already linked. If you want the annotations that I draw, you'll have to become a supporter on Patreon or SubscribeStar :)

  • @dawidlaszuk
    @dawidlaszuk 4 ปีที่แล้ว +1

    It isn't surprising that all function approximators, including NN and kernal methods, are equivalent since they all... can approximate functions. However, nice thing here is showing explicitly the connection between kernel methods and NN which allows easier knowledge transfer between methods' domains.

    • @joelwillis2043
      @joelwillis2043 4 ปีที่แล้ว

      All numbers are equivalent since they all... are numbers.

    • @Guztav1337
      @Guztav1337 4 ปีที่แล้ว

      @@joelwillis2043 No. 4 ⋦ 5

    • @joelwillis2043
      @joelwillis2043 4 ปีที่แล้ว

      @@Guztav1337 Now if only we had a concept of an equivalence relation we could formally use it on other objects instead of saying "equivalent" without much thought.

  • @JTMoustache
    @JTMoustache 4 ปีที่แล้ว

    Ooh baby ! That was a good one ☝🏼

  • @hecao634
    @hecao634 4 ปีที่แล้ว +2

    Hey Yannic, could you plz gently zoom in or zoom out in the following videos? I really felt dizzy sometimes especially when you derive formulas

  • @anubhabghosh2028
    @anubhabghosh2028 3 ปีที่แล้ว +1

    At around 19 min mark of the video, the way you describe the similarity between kernels and gradient descent leads me to believe that what this paper is claiming is that neural networks don't really "generalize" on the test data but rather compares similarities with samples it has already seen in the training data. This can be perhaps a very bold claim by the authors or probably I am misunderstanding the problem. What do you think?
    P.S. Really appreciate this kind of slightly theoretical paper review on deep learning in addition to your other content as well.

    • @YannicKilcher
      @YannicKilcher  3 ปีที่แล้ว +1

      Isn't that exactly what generalization is? You compare a new sample to things you've seen during training?

    • @anubhabghosh2028
      @anubhabghosh2028 3 ปีที่แล้ว

      @@YannicKilcher Yes intuitively I think so too. Like in case of neural networks, we train a network, learn some weights and then use the trained weights to get predictions for unseen data. I think my confusion is because about the explicit way they define this in case of these kernels, where they compare the sensitivity of the new data vector with that of every single data point in the training set 😅.

  • @Jack-sy6di
    @Jack-sy6di 6 หลายเดือนก่อน

    Unfortunately this doesn't actually show that gradient descent is equivalent to kernel learning, or anything like that. That is, if we think of gradient descent as a mapping from training sets onto hypothesis functions, this paper doesn't show that this mapping is the same as some other mapping called "kernel learning". It's not the case that there is some fixed kernel (which depends on the NN's architecture and on the loss function), and running GD with that NN is equivalent to running kernel learning with *that* kernel. Instead, the paper just shows that after training, we can retroactively construct a kernel that matches gradient descent's final hypothesis. This almost seems trivial to me. If you take *any* function, you could probably construct some kind of kernel so that that function would belong to the RKHS.
    To me the interest of a result like "deep networks are kernel machines" would be to reduce gradient learning to kernel learning, showing that the former (which is mysterious and hard to reason about) is equivalent to the latter (which is simple and easy to reason about, because of how it just finds a minimum-norm fit for the data). This paper definitely does not do that. Instead it shows that GD is equivalent to kernel learning *provided* we run a different algorithm first, one which finds a good kernel (the path kernel) based on the training data. But that "kernel finding" module seems just as hard to reason about as gradient descent itself, so really all we've done is push back the question of how GD works to the question of how that process works.

  • @jinlaizhang312
    @jinlaizhang312 4 ปีที่แล้ว +1

    Can you explain the AAAI best paper 'Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting' ?

  • @moudar981
    @moudar981 4 ปีที่แล้ว +1

    thank you for the very nice explanation. What I did not get is the dL/dy. So, L could be 0.5 (y_i - y_i*)^2. Its derivative is (y_i - y_i*). Does that mean that if a training sample x_i is correctly classified (the aforementioned term is zero), then it has no contribution to the formula? Isn't that counter intuitive? Thank you so much.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      a bit. but keep in mind that you integrate out these dL/dy over training. so even if it's correct at the end, it will pull the new sample (if it's similar) into the direction of the same label.

  • @THEMithrandir09
    @THEMithrandir09 4 ปีที่แล้ว +1

    I get that the new formalism is nice for future work, but isn't it intuitive that 'trainedmodel = initialmodel + gradients x learningrates'?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      true, but making the formal connection is sometimes pretty hard

    • @THEMithrandir09
      @THEMithrandir09 4 ปีที่แล้ว +1

      @@YannicKilcher oh yes sure, this work is awesome, it reminds me of the REINFORCE paper. I just wondered why that intuition wasn't brought up explicitly. Maybe I missed it though, I didn't fully read the paper yet.
      Great video btw!

  • @pranavsreedhar1402
    @pranavsreedhar1402 4 ปีที่แล้ว +1

    Thank you!

  • @arthdh5222
    @arthdh5222 4 ปีที่แล้ว +1

    Hey Great video, what do you use for annotating on the pdf, also which software do you use for it? Thanks!

  • @WellPotential
    @WellPotential 4 ปีที่แล้ว +1

    Great explanation of this paper! Any thoughts why they didn't include dropout? Seems like if they added a dropout term in the dynamical equation that the result wouldn't reduce down to a kernel machine anymore.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      maybe to keep it really simple without stochasticity

  • @andres_pq
    @andres_pq 3 ปีที่แล้ว

    Is there any code demonstration?

  • @paulcurry8383
    @paulcurry8383 4 ปีที่แล้ว +1

    I don’t understand the claim that models don’t learn “new representations”. Do they mean that the model must use features of the training data (which I think is trivially true), or that the models store the training data without using any “features” and just the result of the points on the loss over GD? In the latter it seems that models can be seen as doing this, but it’s not well understood how they actually store a potentially infinitely
    sized Gram Matrix. I’m also tangentially interested in how SGD fits into this.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +1

      yea I also don't agree with the paper's claim in this point. I think it's just a dual view. i.e. extracting useful representations is the same as storing an appropriate superposition of the training data.

  • @G12GilbertProduction
    @G12GilbertProduction 4 ปีที่แล้ว +1

    12:57 Hypersurface with a extensive tensor line? That's so looks like Fresnelian.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      sorry that's too high for me :D

  • @socratic-programmer
    @socratic-programmer 4 ปีที่แล้ว

    Makes you wonder the effect of something like skip connections or different network topologies has on this interpretation, or even something like the Transformer with the attention layer. Maybe that attention allows the network to more easily match against similar things and rapidly delegate tokens to functions that have already learnt to 'solve' that type of token?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      entirely possible

    • @kyuucampanello8446
      @kyuucampanello8446 2 ปีที่แล้ว

      Dot similarity with softmax is kind of 'equivalent' with distance. So I guess it's kind of similar like when we calculate the velocity gradient of a particle with sph method by using kernel function onto the distances between neighbouring particles and multiplying with their velocities to build a velocity-function corresponding to the distances.
      In attention machinism, it might be a function of the tokens' values corresponding to similarities

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic 4 ปีที่แล้ว

    The same thing can be said about the brain itself, neurons just store a superposition of the training data (the input from the senses when we were infants, weighed against the evolutionary "weights" stored in the DNA, and in every day experience whenever we see any object or motion, our neurons immediately compare that input to the weights that it stores in a complex language of dozens of neutrotransmitters and calcium ion and tries to find out what it best matches up with.. The brain is a kernel machine. That doesn't depreciate its power, neither the brain or the neural networks. they memorize.. that doesn't mean they're not intelligence. Intelligence is not magic, it IS essentially just memorizing input.

  • @faidon-stelioskoutsourelak464
    @faidon-stelioskoutsourelak464 2 ปีที่แล้ว

    1) Despite the title, the paper never makes use in any derivation of the particulars of an NN and its functional form. Hence the result is not just applicable to NNs but to any differentiable model e.g. a linear regression.
    2) What is most puzzling to me is the path dependence. I.e. if you run your loss-gradient-descent twice from two different starting points which nevertheless converge to the same optimum, the path kernels i.e. the impact of each training data-point on the predictions, would be (in general) different. The integrand though in the expression of path kernels, involves dot products of gradients. I suspect that in the initial phases of training, i.e. when the model has not yet fit to the data, these gradients would change quite rapidly (and quite randomly) and most probably these dot products would be zero or cancel out. Probably only close to the optimum will these gradients and the dot-products stabilize and contribute the most to the path integral. This behavior should be even more pronounced in stochastic gradient descent (intuitively). The higher the dimension of the unknwon parameters, the more probable it'd be that these dot products are zero, even close to the optimum unless there some actual underlying structure that it is discoverable by the model.

  • @DamianReloaded
    @DamianReloaded 4 ปีที่แล้ว

    Learning the most general features and learning to generalize well ought to be the same thing.

  • @veloenoir1507
    @veloenoir1507 3 ปีที่แล้ว +1

    If you can make a connection with the Kernel Path and a resource-efficient, general architecture this could be quite meaningful, no?

  • @Sal-imm
    @Sal-imm 4 ปีที่แล้ว +2

    Mathematically limit definition of a function (for e.g.) and comes out of a new conclusion, that might be heuristic.

  • @guillaumewenzek4210
    @guillaumewenzek4210 4 ปีที่แล้ว +1

    I'm not found of the conclusion. The NN at inference doesn't have access to all the historical weights, and runs very differently from their kernel. For me 'NN is a Kernel' would implies that K only depends on the final weights. OTOH I've no issue if a_i is computed from all historical weights.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      You're correct, of course. This is not practical, but merely a theoretical connection.

    • @guillaumewenzek4210
      @guillaumewenzek4210 4 ปีที่แล้ว

      Stupid metaphor: "Oil is a dinosaur". Yes there is a process that converts dinosaur into oil, yet they have very different properties. Can you transfer the properties/intuitions of a Gaussian kernel to this path kernel?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      Sure, both are a way to measure distances between data points

  • @eliasallegaert2582
    @eliasallegaert2582 4 ปีที่แล้ว

    This paper is from the author of "The master algorithm" where the big picture is explored of multiple machine learning techniques. Very interesting! Thanks Yannic for the great explanation!

  • @twobob
    @twobob 2 ปีที่แล้ว

    Thanks. That was helpful

  • @veedrac
    @veedrac 4 ปีที่แล้ว +1

    Are you aware of, or going to cover, Feature Learning in Infinite-Width Neural Networks?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +1

      I'm aware of it, but I'm not an expert. Let's see if there's anything interesting there.

  • @mathematicalninja2756
    @mathematicalninja2756 4 ปีที่แล้ว +13

    Next paper: Extracting MNIST data from its trained model

    • @Daniel-ih4zh
      @Daniel-ih4zh 4 ปีที่แล้ว +2

      Link?

    • @salim.miloudi
      @salim.miloudi 4 ปีที่แล้ว

      Isn't it what visualizing activation maps does

    • @vaseline.555
      @vaseline.555 4 ปีที่แล้ว +1

      Possibly Inversion attack? Deep leakage from gradients?

  • @albertwang5974
    @albertwang5974 4 ปีที่แล้ว +3

    A paper: one plus one is a merge of one and one!

  • @kazz811
    @kazz811 4 ปีที่แล้ว +1

    Nice review. Probably not a useful perspective though. SGD is critical obviously (ignoring variants like momentum, Adam which incorporate path history) but you could potentially extend this using the path integral formulation (popularized in Quantum mechanics though applies in many other places) by constructing it as an ensemble over paths for each mini-batch procedure, the loss function replacing the Lagrangian in Physics. The math won't be easy and it likely needs someone with higher level of skill than Pedro to figure that out.

    • @diegofcm6201
      @diegofcm6201 4 ปีที่แล้ว

      Thought the exact same thing. Even more like it when it states about “superposition of train data weighted by kernel path”. Reminds me a lot about wave functions

    • @diegofcm6201
      @diegofcm6201 4 ปีที่แล้ว

      It also looks like something from calculus of variations: 2 points (w0 and wf) connected by a curve that’s trying to optimize something

  • @michaelwangCH
    @michaelwangCH 4 ปีที่แล้ว +2

    Deep learning is further step of ML evolution. The Kernel methods are known since early 60's.
    No surprise at all.

    • @michaelwangCH
      @michaelwangCH 4 ปีที่แล้ว

      Hi, Yannic.
      Thanks for the explanation, looking forward for your next video.

  • @MachineLearningStreetTalk
    @MachineLearningStreetTalk 4 ปีที่แล้ว +3

    Does this mean we need to cancel DNNs?

    • @herp_derpingson
      @herp_derpingson 4 ปีที่แล้ว +1

      ...or Kernel Machines. Same thing.

    • @diegofcm6201
      @diegofcm6201 4 ปีที่แล้ว +1

      *are approximately

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +3

      Next up: Transformers are actually just linear regression.

  • @sergiomanuel2206
    @sergiomanuel2206 4 ปีที่แล้ว +1

    What happens if we do a training of just one step. It could be started from the last step of an existing training (this theorem doesn't require to have random weight at the start). In this case we don't need to store all the trainig path 😎

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      yes, but the kernel must still be constructed from all the gradient descent path, that includes the path to obtain the initialization in your case

    • @sergiomanuel2206
      @sergiomanuel2206 4 ปีที่แล้ว

      @@YannicKilcher first: wonderful videos, thank you!!! Second: Correct me if I am wrong. The theorem doesn't tell us anything about the initialization weights. I am thinking about one-step-training with w0 obtained from a previews trainig. If we do one step of gradient descent using all the dataset, there is just one optimal path in the direction dw=- lr*dL/dw, this training will lead us to w1. Using w0 and w1 we can build the kernel and evaluate it. I think it is correct because all information about the trainin is in the last step (using the nn we can make predictions using just the last weigths, w1).

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      @@sergiomanuel2206 sounds reasonable. W0 also appears in the Kernel

  • @hoaxuan7074
    @hoaxuan7074 4 ปีที่แล้ว +1

    With really small nets you can hope to more or less fully explore the solution space say using evoutionary algorithms. There are many examples on YT. In small animals with a few hundred neuron you do see many specialized neurons with specific functions. In larger nets I don't think there is any training algorithm that can actually search the space of solutions to find any good solution. Just not possible ----- except there in a small sub-space of statistical solutions where each neuron responds to the general statistics of the neurons in the prior layer. Each neuron being a filter of sorts. I'm not sure why I feel that sub-space is easier to search through?
    An advantage would be good generalization and avoiding many brittle over-fitted solutions that presumably exist in the full solution space. A disadvantage would be the failure to find short compact logical solutions that generalize well, should they exist.

  • @Usernotknown21
    @Usernotknown21 4 ปีที่แล้ว +2

    When you're broke with no money, the last thing you want to hear is theoretically;. Especially theoretically and paper in the same sentence.

  • @drdca8263
    @drdca8263 4 ปีที่แล้ว +1

    The derivation of this seems nice, but, maybe this is just because I don't have any intuition for kernel machines, but I don't get the interpretation of this?
    I should emphasize that I haven't studied machine learning stuff in any actual depth, have only taken 1 class on it, so I don't know what I'm talking about
    If the point of kernel machines is to have a sum over i of (something that doesn't depend on x, only on i) * (something that depends on x and x_i) ,
    then why should a sum over i of (something that depends on x and i) * (something that depends on x and x_i)
    be considered, all that similar?
    The way it is phrased in remark 2 seems to fit it better, and I don't know why they didn't just give that as the main way of expressing it?
    Maybe I'm being too literal.
    edit: Ok, upon thinking about it more, and reading more, I think I see more of the connection, maybe.
    the "kernel trick" involves mapping some set to some vector space, and then doing inner products there, and for that reason, taking the inner product that naturally appears here, and wanting to relate it to kernel stuff, seems reasonable.
    And so defining the K^p_{f,c}(x,x_i) , seems also probably reasonable.
    (Uh, does this still behave line an inner product? I think it does?
    Yeah, uh, if we send x to the function that takes in t and returns the gradient of f(x,w(t)) with respect to w, (where f is the thing where y = f(x,w) ),
    that is sending x to a vector space, specifically, vector valued functions on the interval of numbers t comes from,
    and if we define the inner product of two such vectors as being the integral over t of the inner product of the value at t,
    This will be bilinear, and symmetric, and anything with itself should have positive values whenever the vector isn't 0, so ok yes it should be an inner product.
    So, it does seem (to me, who, remember, I don't know what I'm talking about when it comes to machine learning) like this gives a sensible thing to potentially use as a kernel.
    How unfortunate then, that it doesn't satisfy the definition they gave at the start!
    Maybe there's some condition in which the K^g_{f,w(t)}(x,x_i) and the L'(y^*_i, y_i) can be shown to be approximately uncorrelated over time, so that the first term in this would approximately not depend on x, and so it would approximately conform to the definition they gave?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +1

      your thoughts are very good, maybe you want to check out some of the literature of neural tangent kernel, because that's pretty much into this direction!

  • @jeremydy3340
    @jeremydy3340 4 ปีที่แล้ว

    Don't Kernel methods usually take the form of a weighted average of examples? sum( a_i * y_i * K ) . The method given here is quite different, and depends on the labels largely implicitly via how they change the path c(t) through weight space. It isn't clear to me that y( x' ) is at all similar to y_i, even if K(x', x_i) is large. And the implicit dependence through c(t) on all examples means (x_i, y_i) may be extremely important even if K(x', x_i) is small.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      it depends on the output, yes, but not on the actual labels. and yes, the kernel is complicated by construction, because it needs to connect gradient descent, so the learning path and the gradient are necessarily in there

  • @vertonical
    @vertonical 4 ปีที่แล้ว +2

    Yannic Kilcher is a kernel machine.

  • @djfl58mdlwqlf
    @djfl58mdlwqlf 3 ปีที่แล้ว

    1:31 you made me worse by letting me into these kind of topics... your explanation is brilliant..... plz keep me enlightened until I die....

  • @dhanushka5
    @dhanushka5 ปีที่แล้ว

    Thanks

  • @NeoKailthas
    @NeoKailthas 4 ปีที่แล้ว +3

    Now someone prove that humans are kernel machines

  • @frankd1156
    @frankd1156 4 ปีที่แล้ว +3

    wow...my head get hot a little bit lol

  • @erfantaghvaei3952
    @erfantaghvaei3952 3 ปีที่แล้ว +1

    Pedro is cool guy, sad to see the hate on him for opposing the distorts surrounding datasets

  • @mrjamess5659
    @mrjamess5659 3 ปีที่แล้ว

    Im having multiple enligthments while watching this video.

  • @paulcarra8275
    @paulcarra8275 3 ปีที่แล้ว

    Hi first the author make a presentation here "th-cam.com/video/m3b0qEQHlUs/w-d-xo.html&lc=UgwjZHYH9cRyuGmD6e14AaABAg" Second (repeating a comment made on the presentation above) I was wondering why should we go through the whole learning procedure and not instead start at the penultimate step with the corresponding b and w's, wouldn't it save almost computational time ? I mean if the goal is not learn the DNN but to get an additive representation of it (ignoring the non linear transform "g" of the Kernel Machine) Regards

  • @albertwang5974
    @albertwang5974 4 ปีที่แล้ว

    I cannot understand why such kind of topic can be a paper.

  • @raunaquepatra3966
    @raunaquepatra3966 4 ปีที่แล้ว +2

    Kernel is all you need 🤨

  • @IoannisNousias
    @IoannisNousias 4 ปีที่แล้ว

    What an unfortunate choice of variable names. Every time I heard “ai is the average...” it threw me off. Too meta.

  • @willkrummeck
    @willkrummeck 4 ปีที่แล้ว +1

    why do you have parler?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +1

      yea it's kinda pointless now isn't it :D

  • @shuangbiaogou437
    @shuangbiaogou437 4 ปีที่แล้ว

    I knew this 8 years ago and I mathmatically proved a perceptron is just a dual form of linear kernel machine. A MLP is just a linear kernel machine with its input being transformed.

  • @az8134
    @az8134 4 ปีที่แล้ว +1

    i thought we all know that ...

  • @dontthinkastronaut
    @dontthinkastronaut 4 ปีที่แล้ว

    hi there

  • @marouanemaachou7875
    @marouanemaachou7875 4 ปีที่แล้ว +1

    First

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic 4 ปีที่แล้ว +1

    im just afraid this will lead us BACK into kernel machines and programming everything by hand, resulting in much more robotic, calculator-esque models, not the AI models that we have. ITs better to keep it in the black box. If you look inside you'll jinx it and the magic will die, and we'll just have dumb calculator robots again

  • @conduit242
    @conduit242 3 ปีที่แล้ว

    “You just feed in the training data” blah blah, the great lie of deep learning. The reality is ‘encodings’ hide a great deal of sophistication, just like compression ensemble models. Let’s see a transformer take a raw binary sequence and match zpaq-5 at least on the Hutter Prize 🤷🏻‍♂️ choosing periodic encodings, stride models, etc are all the same. All these methods, including compressors, are compromised theoretically

  • @sphereron
    @sphereron 3 ปีที่แล้ว

    Dude, if you're reviewing Pedro Dominguez's paper despite his reputation as a racist and sexist, why not use your platform to give awareness to Timnit Gebru's work on "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?". Otherwise your bias is very obvious here.

  • @getowtofheyah3161
    @getowtofheyah3161 4 ปีที่แล้ว

    So boring who freakin’ cares