Hi community, it is only natural, that when we encounter a new method, we search for the nearest, "known method" or "explanation" to it. Because why should we learn something new, when we already have learned something simpler? Same happens here: "Is this not a simple model distillation?". Well, while both methods aim to replicate the behavior of an existing (Large Language) model, our old "model distillation" focuses on compressing a known model into a smaller one using supervised learning techniques, leveraging full access to the teacher model's outputs and maybe even additional info. In contrast, the method I've presented in this video is about reverse-engineering an unknown model's distribution through interactive queries, employing advanced mathematical techniques to construct a compact approximation without direct access to the model's structure, parameters, training data or internally learned sequences. I know it is more complex, and if you think, "it is about the same" .... this is absolute okay with me. But please don't post it like you are a professor of mathematics and have complete understanding and therefore declare "it is absolutely the same"... you might misinform others. However, if you ask, for a comparison, for further clarification, why not read the arxiv pre-print from MIT for a more detailed understanding. You might discover new ideas ...
I'm just going to to put this out there. This is exactly how we develop a Theory of Mind. Replace "steal" with "understand" and "LLM" with "person" and you get something intimately familiar to any human being.
When you go really hard on predicting the next token you are building something that creates a near perfect model of the representations of that token.
I'm disappointed that the paper didn't present actual real world copies. If they had a method, they should have tested it in the real world and shown that it works.
Oh no, our collective "private" data that models trained on without permission might be revealed... and that would be evidence of a obvious corporate crime. 👩⚖
The problems I have with this paper: 1) There's nothing here to clarify the scaling vs variance. No indication of how bad "polynomial" can get, nor how few the queries are relative to the size of a model - polynomial on a low rank representation of a 1E10 parameter LLM. Let's suppose rank = 3000 as seen in some LLMs, then we have 2 1E5*3000 = 3E8 element matrices, or 6E8. If the "polynomial" is quadratic, this is perhaps 12E16 queries. While maybe possible, it's not affordable. If linear, it's still an expensive proposition if the number of queries is "only" 100:1. 2) No experimental results, I expect because of item 1) above - it's not practical 3) No codebase. It's a nice mathematical paper, but it feels without some indicator of scaling vs variance, it may be "efficient" relative to other methods, but still utterly impractical. I hope that the authors follow this up with clarification, or experimental results and code. The same comments are true of their earlier paper that they draw on to estimate low rank HMMs by conditional probes.
As I mentioned in my video, this method is not for the average AI user on an average consumer GPU (amount: 1) sitting at home with average interests. Thank you for mentioning all the classical arguments of scaling and affordability, as I mentioned in my other video that META is now scaling their Llama models for non-civilian purposes, this method is not really targeted for the "affordability" guy ....
To obfuscate this would interfere with explainability. The problem is the idea of proprietary foundation / general knowledge models. I'm supportive of specialized SLMs trained on proprietary data and the need for businesses to have data privacy to maintain marketplace competitiveness. But perhaps general knowledge LLMs need to be public domain.
Why not try asking o1 or Claude 3.5 about that last theorem? Given you can ground the discussion with the paper you may have that second brain within your reach. It’s interesting that they cast it in terms of model stealing when it seems this could work as distillation in general? (Perhaps that was the original goal, and this black box case fell out as an interesting idea?)
You guys do not get the idea here. This is giving the various AIs' to communicate with each other. How long do you think it will take that those various models become one overall GAI that is no longer under human control.
great job. unfortunately, there is no experimental proof. i believe that you can mimic the llm in a specific task but stealing it all ability will be very difficult.
If there is a mathematical proof, that is valis (and I can understand), I need no experimental proof. It is not about a single specific task, it is about the complete model, since we test the complete mathematical space of all representative sequences.
@@code4AI Just because it is theoretically possible doesn't mean modern LLMs are sufficiently low-rank enough for this "stealing" to be practical. Theoretically the large integers used in public key crypto can be factored, but it's not a practical means of attack. Experimental results would provide for a calibration of expectations. It's possible that while theoretically possible, it could cost more to "steal" a model than to train a new one from scratch.
@egoincarnate Do you have an facts that would support your statement " it would cost more to "steal" a model than to train a new one from scratch"? Since this idea was just published yesterday, you can't have empirical data.
@@code4AI I see thank you. I have not read the paper I just scrolled through to see the end. the story was quite exciting and ended up like an open question.
Hi community, it is only natural, that when we encounter a new method, we search for the nearest, "known method" or "explanation" to it. Because why should we learn something new, when we already have learned something simpler? Same happens here: "Is this not a simple model distillation?".
Well, while both methods aim to replicate the behavior of an existing (Large Language) model, our old "model distillation" focuses on compressing a known model into a smaller one using supervised learning techniques, leveraging full access to the teacher model's outputs and maybe even additional info.
In contrast, the method I've presented in this video is about reverse-engineering an unknown model's distribution through interactive queries, employing advanced mathematical techniques to construct a compact approximation without direct access to the model's structure, parameters, training data or internally learned sequences.
I know it is more complex, and if you think, "it is about the same" .... this is absolute okay with me. But please don't post it like you are a professor of mathematics and have complete understanding and therefore declare "it is absolutely the same"... you might misinform others.
However, if you ask, for a comparison, for further clarification, why not read the arxiv pre-print from MIT for a more detailed understanding. You might discover new ideas ...
I'm just going to to put this out there. This is exactly how we develop a Theory of Mind. Replace "steal" with "understand" and "LLM" with "person" and you get something intimately familiar to any human being.
When you go really hard on predicting the next token you are building something that creates a near perfect model of the representations of that token.
I'm disappointed that the paper didn't present actual real world copies. If they had a method, they should have tested it in the real world and shown that it works.
Don't be disappointed. You can build it yourself and even optimize it.
Good training material for my rescue attempt of Wintermute.
Best of luck!
Oh no, our collective "private" data that models trained on without permission might be revealed... and that would be evidence of a obvious corporate crime. 👩⚖
The problems I have with this paper:
1) There's nothing here to clarify the scaling vs variance. No indication of how bad "polynomial" can get, nor how few the queries are relative to the size of a model - polynomial on a low rank representation of a 1E10 parameter LLM. Let's suppose rank = 3000 as seen in some LLMs, then we have 2 1E5*3000 = 3E8 element matrices, or 6E8. If the "polynomial" is quadratic, this is perhaps 12E16 queries. While maybe possible, it's not affordable. If linear, it's still an expensive proposition if the number of queries is "only" 100:1.
2) No experimental results, I expect because of item 1) above - it's not practical
3) No codebase.
It's a nice mathematical paper, but it feels without some indicator of scaling vs variance, it may be "efficient" relative to other methods, but still utterly impractical. I hope that the authors follow this up with clarification, or experimental results and code. The same comments are true of their earlier paper that they draw on to estimate low rank HMMs by conditional probes.
As I mentioned in my video, this method is not for the average AI user on an average consumer GPU (amount: 1) sitting at home with average interests. Thank you for mentioning all the classical arguments of scaling and affordability, as I mentioned in my other video that META is now scaling their Llama models for non-civilian purposes, this method is not really targeted for the "affordability" guy ....
@@code4AI The issue I have is I'm not even sure Google can afford it - there's no data or bounds.
To obfuscate this would interfere with explainability. The problem is the idea of proprietary foundation / general knowledge models. I'm supportive of specialized SLMs trained on proprietary data and the need for businesses to have data privacy to maintain marketplace competitiveness. But perhaps general knowledge LLMs need to be public domain.
Why not try asking o1 or Claude 3.5 about that last theorem? Given you can ground the discussion with the paper you may have that second brain within your reach. It’s interesting that they cast it in terms of model stealing when it seems this could work as distillation in general? (Perhaps that was the original goal, and this black box case fell out as an interesting idea?)
Smile. No chance with current AI. Maybe in 2 to 5 years?
I'm very curious about witch prompts you use to simplify the papers and maintain the formulas
Exactly. 🧙♀️
You guys do not get the idea here. This is giving the various AIs' to communicate with each other. How long do you think it will take that those various models become one overall GAI that is no longer under human control.
Thats never gone happend , this are mathematical models , they dont have will
Is this different from teacher-student "distillation"?
Great question. I pinned a reply to your question to the top of the comments, since multiple subscriber asked the same question.
isn't it same as knowledge distillation and model pruning?
Great question. I pinned a reply to your question to the top of the comments, since multiple subscriber asked the same question.
great job. unfortunately, there is no experimental proof. i believe that you can mimic the llm in a specific task but stealing it all ability will be very difficult.
If there is a mathematical proof, that is valis (and I can understand), I need no experimental proof. It is not about a single specific task, it is about the complete model, since we test the complete mathematical space of all representative sequences.
@@code4AI Just because it is theoretically possible doesn't mean modern LLMs are sufficiently low-rank enough for this "stealing" to be practical. Theoretically the large integers used in public key crypto can be factored, but it's not a practical means of attack. Experimental results would provide for a calibration of expectations. It's possible that while theoretically possible, it could cost more to "steal" a model than to train a new one from scratch.
@egoincarnate Do you have an facts that would support your statement " it would cost more to "steal" a model than to train a new one from scratch"? Since this idea was just published yesterday, you can't have empirical data.
@@code4AI Sorry, that should have been "could", not "would". corrected
@@code4AI I see thank you. I have not read the paper I just scrolled through to see the end. the story was quite exciting and ended up like an open question.
This sounds alike distilation technique... It could be used as striping out island or sleeping areas of an model.. but meh
Great question. I pinned a reply to your question to the top of the comments, since multiple subscriber asked the same question.