I don't fully get the gist here but if the middle is an explicit algorithm, wouldn't that need a well defined input meaning the output of the first net would be a well defined feature set thus the first net is defined and static and wouldn't need to be part of a full backprop chain? or is the middle part somehow also a general learning algo compatible with backprop??
I could be wrong, but it sounds to me like they use an analytical perturbative method to get around the fact that a distribution divides the two neural networks.
I know its an example but it is unclear to me why you would need a second neural network after the shortest path to get the size of the path. This whole approach seem to depend on the fact that we infer new information from the solution. But without that 2nd network, it is impossible to improve the first graph generation network. Hm.. would it be possible to use that method without the last layers.. without the last layers MLE( - label ) is the loss but then you’re stuck right after the discrete part.
Basically the takeaway is that if we formulate a problem as a dot product of the solution and the problem statement, then we can bypass taking gradients of the discrete solver because we know that the gradients wrt the output of the solver, can be subtracted from the input of the solver and it would still converge. . So we are perturbing the problem before running through Dijkstra. Then we do the subtract theta from z gradient trick to get a new problem which should be solved more easily with Dijkstra. Then we run this new problem through Dijkstra again and find the difference in solutions of the problem. The trick is then to say that the difference (gradient) in solutions of the problem is actually same as the difference (gradient) in the problem definitions. . My intuition fails me. I think this would only work for a very small subset of problems.
36:00 I am also wondering how it is guaranteed that the target distribution created by subtracting z gradient from theta has on expectation a lower loss than the original distribution. This is totally dependent on the blackbox algorithm isn't it. Shortest paths could be totally different with tiny changes in theta. And if it was guaranteed, why not use the z gradient directly on theta and skip all the target distribution stuff?
OUTLINE:
0:00 - Intro & Overview
4:25 - Sponsor: Weights & Biases
6:15 - Problem Setup & Contributions
8:50 - Recap: Straight-Through Estimator
13:25 - Encoding the discrete problem as an inner product
19:45 - From algorithm to distribution
23:15 - Substituting the gradient
26:50 - Defining a target distribution
38:30 - Approximating marginals via perturb-and-MAP
45:10 - Entire algorithm recap
56:45 - Github Page & Example
Paper: arxiv.org/abs/2106.01798
Code (TF): github.com/nec-research/tf-imle
Code (Torch): github.com/uclnlp/torch-imle
Our Discord: discord.gg/4H8xxDF
This is the type of paper I would normally gloss over and skip. Thanks for taking the time to explain it!
awesome explanation with multiple careful recapping , thank you!!
I check your channel for new video everyday. Keep it up!
Hahaha so many layers 🤣
Thanks for yet another explaino, Yannic. 🤲
i am confused >
Because of how the problem is encoded as an inner product between z and theta
@@YannicKilcher ah ok in fact this is very clearly explained but i had to watch it a second time because i am dumb. very clever!
Here's the other video about the paper :) th-cam.com/video/hb2b0K2PTxI/w-d-xo.html
I don't fully get the gist here but if the middle is an explicit algorithm, wouldn't that need a well defined input meaning the output of the first net would be a well defined feature set thus the first net is defined and static and wouldn't need to be part of a full backprop chain? or is the middle part somehow also a general learning algo compatible with backprop??
That's the point: the middle part is a general algorithm, but you can still use backpropagation
I could be wrong, but it sounds to me like they use an analytical perturbative method to get around the fact that a distribution divides the two neural networks.
Oh man what a trip, it's pretty sneaky of them just to avoid computing gradient of the solver.
what happened to the constraints C during MAP?
Makes sense so far.
I know its an example but it is unclear to me why you would need a second neural network after the shortest path to get the size of the path. This whole approach seem to depend on the fact that we infer new information from the solution. But without that 2nd network, it is impossible to improve the first graph generation network.
Hm.. would it be possible to use that method without the last layers.. without the last layers MLE( - label ) is the loss but then you’re stuck right after the discrete part.
Pretty good video :)
Basically the takeaway is that if we formulate a problem as a dot product of the solution and the problem statement, then we can bypass taking gradients of the discrete solver because we know that the gradients wrt the output of the solver, can be subtracted from the input of the solver and it would still converge.
.
So we are perturbing the problem before running through Dijkstra. Then we do the subtract theta from z gradient trick to get a new problem which should be solved more easily with Dijkstra. Then we run this new problem through Dijkstra again and find the difference in solutions of the problem. The trick is then to say that the difference (gradient) in solutions of the problem is actually same as the difference (gradient) in the problem definitions.
.
My intuition fails me. I think this would only work for a very small subset of problems.
36:00 I am also wondering how it is guaranteed that the target distribution created by subtracting z gradient from theta has on expectation a lower loss than the original distribution. This is totally dependent on the blackbox algorithm isn't it. Shortest paths could be totally different with tiny changes in theta. And if it was guaranteed, why not use the z gradient directly on theta and skip all the target distribution stuff?
My back prop blew up around 40:00 😂
Mixed-MLP in the reverse looks like this: 36:53. :D
Hi Yannic, let's have papers from video modality.
🖐
I might be totally wrong, but I think generally people can't/won't watch a 40 minute video every 3 days.
I didn't think so either when I started this, but here we are 🤷🏽♀️
@@YannicKilcher Acquired audience
Possibly! But you and other subscribers are not like other people :)
I legit love these though
such papers are very complicated and without mathematical deep understanding almost impossible to follow, good explanations are therefore necessary.