The authors of the paper claim that their trained LCM can: -- Process information in sentences/concepts rather than individual words -- 'reason' at an abstract level (independent of words) -- LCMs are supposed to be language and modality independent at input/output layers (makes them scalable in an unbiased way). -- Better readability of longer input and better coherence at generating longer output -- Better zero-shot performance I'm trying to compile a more detailed video on this so stay tuned and take care :)
But what is the difference between an LLM in terme of learning something new? If I'd send a huge article to GPT, it still would understand, help to summarise it or etc.. Or is it just about the difficult concepts which GPTs struggle with like math, physics and so on, and those LCMs are better in learning and understanding it?
I really expect this to be the path forward for quality and reliability. I've been working (only in written notes) in the concept space for knowledge accumulation for decades (as a sparse interest). I think of it as having less overfitted spaghetti strands in your model, paths that can give a specific answer to a specific question without being a generalized useful result compared to LLMs.
Thanks Erik for sharing your insights about the concept space. Have you also worked with the embedding space for concept mapping? I believe this is what LCMs are designed for. I also agree that LLMs are too restrictive for models that are supposed to 'mimic' human behaviour (writing, reasoning, problem solving). Thanks for watching :)
@analyticsCamp My "work" is only theoretical, meaning I've been thinking in this direction and taking some handwritten notes. It seems like getting into concept space you must disambiguate all input text (in English the same word can have many different meanings). For LLMs, this tends to just sort itself out given enough samples. But concepts should be more concise from the start so it is important to understand the word sense (which definition of each word). I really like your videos and you have a great voice.
Thank you for your encouraging words. Your research sounds really cool. Yes, as you mentioned, at the semantic level, because of the substantially more possible next sentence/concept, the task becomes more challenging than simply predicting the next token as in LLMs. So even with a long context window as input, the ambiguity you referred to, becomes unavoidable. But as they assume, diffusion-based LCMs should be able to learn a probability distribution over an output embedding space (theoretically) so probably we should just sit and wait for someone who would find a solution (e.g., beam search?)
That's right. I also believe we just needed a model that can 'mimic' human behaviour such as conceptual understanding rather than just regurgitate the training data.
The things described in this video as being fundamental differences between LLMs and LCMs are not so. LLMs, too, operate at the level of concepts, abstractions, and hierarchies (reached through the stacking of many transformer levels) and can deal with multiple languages simultaneously. So, what is the biggest difference between LCMs and LLMs? I am not sure. Maybe it’s a more modular approach, with a somewhat more clear attempt to separate the language-specific parts from the conceptual parts? Or is it the introduction of separate processing/training procedures dedicated only to the conceptual (post-SONAR) part?
Thanks for your comment. I agree with some of your statements, and disagree with the rest! There ARE fundamental differences between LLMs and LCMs. Here I'm not picking instances of LLMs and LCMs, but generally speaking, LLMs after token segmentation, still continue to process tokens (even with the Transformer architecture), [note, the 'attention' mechanism still weights the importance of tokens not concepts], while LCMs convert the words to concepts (even though concepts are represented by sentences but one main concept is retained for each) in an embedding space for concept reduction. This makes LCMs basically language-and-modality agnostic at the two ends. While LLMs can handle multiple languages (e.g., for translation tasks, etc.), they process one main language at the input and continue with the same in the embedding (whether it's positional encoding or not). This also makes LLMs highly dependent on the training data for each separate language involved. Another big difference is that LLMs (for natural language generation) operate on the next token prediction, but LCMs operate on the next sentence/concept prediction, which is significantly more challenging because of the semantic disambiguation involved. There are, of course, other main differences that I would be getting into in the future videos. Thanks for watching :)
Both yes and no. Their experiments were performed with models with 1.6B parameters and then scaled to 7B, so it is in a comparable camp with most LLMs. At the same time, LCM outperformed LLAMA in summary tasks (full explanation in their paper). But If we're talking about probability distribution, then yes, because the next sentence/concept prediction is substantially more challenging than the next token prediction. In theory, the diffusion-based LCMs should be able to learn a probability distribution over an output embedding space, but this needs more future research. Thanks for watching :)
@analyticsCamp Look, i had similar idea with "concepts" in categories morphisms. And that's really cool. But in neural networks this probably will not work in the end. Some concepts requiers very dynamic changes in the model and also can have very big size window for performative handling. Yet we will see, of course.
@@ТимофейТимощенко-н9ю That's right, but the authors have already mentioned these in the study limitations. They also mentioned that SONAR is based on the bitext training data with relatively short sentences and that for longer sequences this can be problematic (the window size you also thought of). But it is fairly early to conclude I assume. If you happen to come across a study with large-sequence training sets, it would be good to drop a comment for me too.
@ there is only something like "adaptive fractal analysis in time series", where size of the window can change dynamically (and that's a very rare topic in analisys with small information on it), but some similiar technique in LCM would be very complex to implement and even develop. Also there is a problem, that some concepts can be not local. Concepts would be cooler to see in Symbolic NLP. There would be no problems with window size, locality and dynamicness.
Thanks for the input on 'adaptive fractal analysis'. As for the Symbolic NLP method, I doubt any deterministic algo would help with concepts; they are way more complicated than to teach via rules. I came across this article about CLUE (Conceptual Language Understanding Engine) which basically tried to apply concepts via rule-based analysers, but I haven't seen it going any far, and to be honest, the movement with deep learning with 'learning' through updating the weights for now seem more optimal than sitting there and hand-code conceptual understanding. Thanks for the interesting discussion, take care :)
cool stuff! need to see if i can get some lower compute variation up and running. btw, will appreciate a longer video on the paper. thanks!
Thanks, sure I'll try to do that in between my other videos :)
I'd LOVE a more detailed video!
Let's see how many viewers vote for this :)
@analyticsCampme too. I vote for this. When will an LCM come out???
@@dandushi9872 Soon indeed! I'm working on it :)
What capabilities will an LCM have over an LLM? I understand that it can understand whole sentences but what are the benefits?
The authors of the paper claim that their trained LCM can:
-- Process information in sentences/concepts rather than individual words
-- 'reason' at an abstract level (independent of words)
-- LCMs are supposed to be language and modality independent at input/output layers (makes them scalable in an unbiased way).
-- Better readability of longer input and better coherence at generating longer output
-- Better zero-shot performance
I'm trying to compile a more detailed video on this so stay tuned and take care :)
But what is the difference between an LLM in terme of learning something new? If I'd send a huge article to GPT, it still would understand, help to summarise it or etc.. Or is it just about the difficult concepts which GPTs struggle with like math, physics and so on, and those LCMs are better in learning and understanding it?
That's right. Essentially better conceptual embedding >> longer contextual-window understanding >> better reasoning capabilities >> less hallucination
Wow, this is so cool, AI now understand concept !
Let's hope so. It's the initial stage of LCM and we don't have enough data to know the full potential of this type of models, but I'm very positive :)
I really expect this to be the path forward for quality and reliability. I've been working (only in written notes) in the concept space for knowledge accumulation for decades (as a sparse interest). I think of it as having less overfitted spaghetti strands in your model, paths that can give a specific answer to a specific question without being a generalized useful result compared to LLMs.
Thanks Erik for sharing your insights about the concept space. Have you also worked with the embedding space for concept mapping? I believe this is what LCMs are designed for. I also agree that LLMs are too restrictive for models that are supposed to 'mimic' human behaviour (writing, reasoning, problem solving). Thanks for watching :)
@analyticsCamp My "work" is only theoretical, meaning I've been thinking in this direction and taking some handwritten notes. It seems like getting into concept space you must disambiguate all input text (in English the same word can have many different meanings). For LLMs, this tends to just sort itself out given enough samples. But concepts should be more concise from the start so it is important to understand the word sense (which definition of each word). I really like your videos and you have a great voice.
Thank you for your encouraging words. Your research sounds really cool.
Yes, as you mentioned, at the semantic level, because of the substantially more possible next sentence/concept, the task becomes more challenging than simply predicting the next token as in LLMs. So even with a long context window as input, the ambiguity you referred to, becomes unavoidable. But as they assume, diffusion-based LCMs should be able to learn a probability distribution over an output embedding space (theoretically) so probably we should just sit and wait for someone who would find a solution (e.g., beam search?)
This could be quite an advance. Working on concepts rather than word or word fragments makes a lot of sense.
That's right. I also believe we just needed a model that can 'mimic' human behaviour such as conceptual understanding rather than just regurgitate the training data.
The things described in this video as being fundamental differences between LLMs and LCMs are not so. LLMs, too, operate at the level of concepts, abstractions, and hierarchies (reached through the stacking of many transformer levels) and can deal with multiple languages simultaneously. So, what is the biggest difference between LCMs and LLMs? I am not sure. Maybe it’s a more modular approach, with a somewhat more clear attempt to separate the language-specific parts from the conceptual parts? Or is it the introduction of separate processing/training procedures dedicated only to the conceptual (post-SONAR) part?
Thanks for your comment. I agree with some of your statements, and disagree with the rest! There ARE fundamental differences between LLMs and LCMs. Here I'm not picking instances of LLMs and LCMs, but generally speaking, LLMs after token segmentation, still continue to process tokens (even with the Transformer architecture), [note, the 'attention' mechanism still weights the importance of tokens not concepts], while LCMs convert the words to concepts (even though concepts are represented by sentences but one main concept is retained for each) in an embedding space for concept reduction. This makes LCMs basically language-and-modality agnostic at the two ends.
While LLMs can handle multiple languages (e.g., for translation tasks, etc.), they process one main language at the input and continue with the same in the embedding (whether it's positional encoding or not). This also makes LLMs highly dependent on the training data for each separate language involved.
Another big difference is that LLMs (for natural language generation) operate on the next token prediction, but LCMs operate on the next sentence/concept prediction, which is significantly more challenging because of the semantic disambiguation involved.
There are, of course, other main differences that I would be getting into in the future videos. Thanks for watching :)
Such concepts will require a lot more parameters than just tokens...
Both yes and no. Their experiments were performed with models with 1.6B parameters and then scaled to 7B, so it is in a comparable camp with most LLMs. At the same time, LCM outperformed LLAMA in summary tasks (full explanation in their paper).
But If we're talking about probability distribution, then yes, because the next sentence/concept prediction is substantially more challenging than the next token prediction. In theory, the diffusion-based LCMs should be able to learn a probability distribution over an output embedding space, but this needs more future research.
Thanks for watching :)
@analyticsCamp Look, i had similar idea with "concepts" in categories morphisms. And that's really cool. But in neural networks this probably will not work in the end. Some concepts requiers very dynamic changes in the model and also can have very big size window for performative handling.
Yet we will see, of course.
@@ТимофейТимощенко-н9ю That's right, but the authors have already mentioned these in the study limitations. They also mentioned that SONAR is based on the bitext training data with relatively short sentences and that for longer sequences this can be problematic (the window size you also thought of). But it is fairly early to conclude I assume. If you happen to come across a study with large-sequence training sets, it would be good to drop a comment for me too.
@ there is only something like "adaptive fractal analysis in time series", where size of the window can change dynamically (and that's a very rare topic in analisys with small information on it), but some similiar technique in LCM would be very complex to implement and even develop. Also there is a problem, that some concepts can be not local.
Concepts would be cooler to see in Symbolic NLP. There would be no problems with window size, locality and dynamicness.
Thanks for the input on 'adaptive fractal analysis'. As for the Symbolic NLP method, I doubt any deterministic algo would help with concepts; they are way more complicated than to teach via rules. I came across this article about CLUE (Conceptual Language Understanding Engine) which basically tried to apply concepts via rule-based analysers, but I haven't seen it going any far, and to be honest, the movement with deep learning with 'learning' through updating the weights for now seem more optimal than sitting there and hand-code conceptual understanding.
Thanks for the interesting discussion, take care :)