Why do this by masking and unmasking whole tokens or words? Why not pretrain some kind of latent space for each token/word and then do the diffusion in the latent space? Then the diffusion becomes much simpler. Of course you still need to convert from the latent space into the best token/word after that, but that should be relatively straightforward as well.
Useful latent spaces for text have remained an extremely challenging problem for years, starting from the fact that no one really ever got Text VAEs or GANs to actually work.. We have nice ways of mapping text to embeddings for tasks like retrival, but these spaces do not have a nice regularized structure that would permit things like gaussian diffusion to work well. Certainly agree that if it were easy to do this it would make sense to run standard diffusion.
Making this process discrete seems very strange to me. Why not noise the token embeddings themselves (e.g. at pure noise levels a given token embedding is made up of the embeddings of all tokens, at zero noise it is made up of a one-hot vector like normal). And as you do diffusion you can update this token-embedding probability space since you have the logits. After n inference steps you will probably end up with tokens that probably don't converge to a single token but instead map to some subset of tokens that should all be roughly equivalent in semantic space, so you can just randomly sample from said distribution based on the final logits. Tokens you've already generated will be one-hot, noised tokens will be blended as described.
Surprisingly, many people have tried this with lots of fancy approaches. So far nothing is really close to auto regressive models. Many things go wrong, but particularly the last step of mapping back to specific words seems to be tricky.
@@srush_nlp Hmm... that is very surprising to me. The mapping seems like the most straightforward part. Do you know what this technique is called in academia / if there are any papers published on this idea? Also, after thinking about it, I'm pretty sure you don't want to uniformly smear across all tokens, but instead make the embedding the average unconditional embedding (e.g. count all tokens in your dataset, and make the pure noise regime equal to the average token).
I made a Retrieval Base chatbot from scratch, but I'm not a professional, but the main component was compressing the vocabulary with synonyms, training the model on the compressed vocabulary to make it grok faster. I have a feeling that approach would allow for very small and intelligent models. What do you think about compressing the vocabulary?
Really cool stuff! It’s a shame it’s not quite at the level of auto regressive models (especially for DNA), but I’m excited about future work in the field. Love the explanation, it made reading the paper much more digestible
@srush_nlp Great explanation! How do you think discrete diffusion models should be modified to enable long context sequence generation comparable to LLMs?
Why do this by masking and unmasking whole tokens or words? Why not pretrain some kind of latent space for each token/word and then do the diffusion in the latent space? Then the diffusion becomes much simpler. Of course you still need to convert from the latent space into the best token/word after that, but that should be relatively straightforward as well.
Useful latent spaces for text have remained an extremely challenging problem for years, starting from the fact that no one really ever got Text VAEs or GANs to actually work.. We have nice ways of mapping text to embeddings for tasks like retrival, but these spaces do not have a nice regularized structure that would permit things like gaussian diffusion to work well. Certainly agree that if it were easy to do this it would make sense to run standard diffusion.
Making this process discrete seems very strange to me. Why not noise the token embeddings themselves (e.g. at pure noise levels a given token embedding is made up of the embeddings of all tokens, at zero noise it is made up of a one-hot vector like normal). And as you do diffusion you can update this token-embedding probability space since you have the logits.
After n inference steps you will probably end up with tokens that probably don't converge to a single token but instead map to some subset of tokens that should all be roughly equivalent in semantic space, so you can just randomly sample from said distribution based on the final logits. Tokens you've already generated will be one-hot, noised tokens will be blended as described.
Surprisingly, many people have tried this with lots of fancy approaches. So far nothing is really close to auto regressive models. Many things go wrong, but particularly the last step of mapping back to specific words seems to be tricky.
@@srush_nlp Hmm... that is very surprising to me. The mapping seems like the most straightforward part. Do you know what this technique is called in academia / if there are any papers published on this idea?
Also, after thinking about it, I'm pretty sure you don't want to uniformly smear across all tokens, but instead make the embedding the average unconditional embedding (e.g. count all tokens in your dataset, and make the pure noise regime equal to the average token).
I made a Retrieval Base chatbot from scratch, but I'm not a professional, but the main component was compressing the vocabulary with synonyms, training the model on the compressed vocabulary to make it grok faster. I have a feeling that approach would allow for very small and intelligent models. What do you think about compressing the vocabulary?
Really cool stuff! It’s a shame it’s not quite at the level of auto regressive models (especially for DNA), but I’m excited about future work in the field. Love the explanation, it made reading the paper much more digestible
Did you guys just recently read the original BERT paper, added a random masking, a few repeats and done?
Yeah! just read it over the summer. good paper.
Really good video. I have to improve on my math, but i get the general idea. Will try to implement the idea
@srush_nlp Great explanation! How do you think discrete diffusion models should be modified to enable long context sequence generation comparable to LLMs?
See 4.2 in the paper! It talks about how to use MDLM for autoregressive modeling, which results in text of arbitrary length
Really liked it! This could work on ARC better.