323
251 628

Effect of Warm Restarts on Stochastic Gradient Descent

Exponentially Faster Language Modeling

Evolutionary Optimization of Model Merging Recipes

MoE-Level Performance Without The Added Computation

Hidden Pitfalls of Cosine Similarity Loss

Open-Endedness is Essential for Artificial Superhuman Intelligence

มุมมอง: 0

วีดีโอ

Effect of Warm Restarts on Stochastic Gradient Descent

มุมมอง 5919 ชั่วโมงที่ผ่านมา

arxiv.org/abs/1608.03983 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

Exponentially Faster Language Modeling

มุมมอง 2.8Kวันที่ผ่านมา

Fast Feedforward Networks arxiv.org/abs/2308.14711 Exponentially Faster Language Modeling arxiv.org/abs/2311.10770 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunado...

Evolutionary Optimization of Model Merging Recipes

มุมมอง 7822 วันที่ผ่านมา

Evolutionary Optimization of Model Merging Recipes arxiv.org/abs/2403.13187 Support my learning journey either by clicking the Join button above or becoming a Patreon member! patreon.com/Tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

MoE-Level Performance Without The Added Computation

มุมมอง 1.9K3 วันที่ผ่านมา

arxiv.org/abs/2405.03133 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

Hidden Pitfalls of Cosine Similarity Loss

มุมมอง 1.2K6 วันที่ผ่านมา

The Hidden Pitfalls of the Cosine Similarity Loss arxiv.org/abs/2406.16468v1 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

Open-Endedness is Essential for Artificial Superhuman Intelligence

มุมมอง 1.6K7 วันที่ผ่านมา

arxiv.org/abs/2406.04268 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

Sigma-GPTs: A New Approach to Autoregressive Models

มุมมอง 2.6K8 วันที่ผ่านมา

arxiv.org/abs/2404.09562 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

Information over-squashing in language tasks

มุมมอง 2.1K9 วันที่ผ่านมา

arxiv.org/abs/2406.04267 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

parallel processes in multi-hop LLM reasoning

มุมมอง 1.7K10 วันที่ผ่านมา

Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning arxiv.org/abs/2406.13858 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

The Structured Task Hypothesis

มุมมอง 1.7K14 วันที่ผ่านมา

arxiv.org/abs/2406.04216 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

SpaceByte: Deleting Tokenization from Large Language Modeling

มุมมอง 17K15 วันที่ผ่านมา

arxiv.org/abs/2404.14408 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

Cultural Accumulation in Reinforcement Learning

มุมมอง 1.2K16 วันที่ผ่านมา

Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning arxiv.org/abs/2406.00392 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

Transformers Represent Belief State Geometry in their Residual Stream

มุมมอง 6K17 วันที่ผ่านมา

arxiv.org/abs/2405.15943 Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss this stuff with other Tunadorks on Discord discord.gg/64fWcSDGsJ All my other links linktr.ee/tunadorable

Brand New AI Papers This Week - July 12, 2024

มุมมอง 96720 วันที่ผ่านมา

Read the Substack podcast/newsletter: open.substack.com/pub/evintunador/p/this-weeks-new-ai-papers-july-12 The scripts I use to automate the paper finding process: github.com/evintunador/arxiv-summaries-workflow Support me either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo! patreon.com/Tunadorable account.venmo.com/u/tunadorable Discuss with other Tunadorks...

the dangers of centralized #ai

มุมมอง 54120 วันที่ผ่านมา

ความคิดเห็น

@jeremywalsh5177 วันที่ผ่านมา
Please cover mixture of million experts
@Tunadorable 20 ชั่วโมงที่ผ่านมา
way ahead of u
@novantha1 วันที่ผ่านมา
I'm not sure how to feel about FFFs to be honest. They do have nice properties in inference, but it's easy to forget that depending on the paradigm, some MoE setups have favorable training dynamics (taking fewer total operations to train the model), and I find it a lot harder to visualize how to actually implement FFFs off the top of my head because there's not really any educational content on the low level functionality of the technique. On the other hand, as noted, they are faster on CPU than MoE implementations, and you could actually imagine a situation where even if you were training on CPU, for some reason, if you were doing a style of training which was hybrid inference/training (ie: Let's Verify Step by Step, Scaling laws with Board games, etc) where you need to perform a lot of inference per training step and you're actually generating synthetic data for the model as part of the training process, you may actually come out favorably with FFFs, for example. I really do think that in the end in order for them to get adoption what they need most is educational content / articles / independent implementations for people to really get a feel for the things they can be used for.
@Tunadorable วันที่ผ่านมา
before educational content we need more verifications that they actually work well, but yes I think you're spot-on
@OpenSourceAnarchist วันที่ผ่านมา
I look forward to your videos so much, coughs and all. I love the high level overviews and your thoughts connecting various papers and concepts together. Feel better and thank you! <3
@luke2642 วันที่ผ่านมา
Interesting. Strangely the fast feed forward paper has only 4 citations. Good GitHub project implementation though.
@technokicksyourass วันที่ผ่านมา
Could be useful where latency is more important than accuracy. For example in real time object detection on edge devices.
@wanfuse วันที่ผ่านมา
this is very good for remote LLMs to run locally, less data to use less latency to run the models remotely
@thomasmitchell2514 วันที่ผ่านมา
Also great in enterprise settings where inference cost, data/compute governance, and interoperability are more important than the few % differences in performance. Most domain specific uses are pretty specific anyway. But I think in enterprise you’d also see this combined with lots of domain specific small models into a large MoE and then apply something like AlphaCode2 or Q* to large enterprise-wide problems. Very cool! Happy he covered these papers!
@zralok วันที่ผ่านมา
This has to create more hallucinations / easier adversarial examples, not nice.
@Tunadorable วันที่ผ่านมา
interesting i’d love to hear your intuition as to why that’d be the case. i did see a linkedin post a few weeks ago claiming some of the results here weren’t replicable but i can’t seem to find said post and this video was pre-recorded awhile ago
@WCKEDGOOD วันที่ผ่านมา
Sounds like there needs to be a model trained on models to generate new models. Those models then compete the best one, then go to retrain the generative model.
@phobosmoon4643 วันที่ผ่านมา
how are you so prolific what the hell man
@Tunadorable วันที่ผ่านมา
hahaha my goal starting the second week of august is 6-7 videos per week. easy to record one when it’s basically just a livestream, no video editing or preparation beforehand (other than reading & highlighting the paper ofc)
@be1tube วันที่ผ่านมา
25:30 CoLA = Corpus of Linguistic Acceptability - grammar checking with a binary output: acceptable or unacceptable
@lexer_ วันที่ผ่านมา
Maybe I am just projecting my own hopes and dreams onto the first paper but it kind of sounds like a viable strategy to train an architecture that can decide how much it needs to think about an answer. It feels like this concept is almost builtin already into the architecture but just not being used in this way. Like, the process of narrowing down which weights to actually use for inference could stop at varying depths. In my mind this sounds almost like a kind of tree of nodes where you essentially have different parts of the underlying model at the leaf nodes. Is that actually kind of what is happening here? It might not be possible to figure out a good way for a model like that to reliably learn the right selection of weights depending on how hard the problem is though. It seems strange that this possibility wasn't mentioned anywhere. I guess I will have to go through the paper and probably also the code to see if this is actually possible? Because I can imagine a whole lof of practical implementation limitations and maybe the architecture can not even be bent into this kind of shape to begin with.
@gui1236100 วันที่ผ่านมา
"Like, the process of narrowing down which weights to actually use for inference could stop at varying depths" only if this would have predictible impact on output quality. some times, stopping before could introduce noise instead of expanding the model ability to output quality thing. For varying amount of power used to answer a query, it seems to me like tranformers have a limit where they only feed forward. Maybe we would need some part that has reccuring behavior in the network. so the model could try solutions ones after the other and test 20 tokens later does the first token chosen still makes sense considering the x branches tested internally. With only feed forward, you would need to train the model to handle any situation in one pass, since it only feeds forward. Maybe I am saying bullshit, I don't really know what i am talking about
@lexer_ วันที่ผ่านมา
@@gui1236100 Not at all bs. I agree with the description of the fundamental limitation in terms of pure feed-forward. What I imagined wouldn't really be a fix to overcome this introspection problem that seems to be inherently recurrent. What I described would only ever be a stop-gap and not really a solution towards agi or anything. I only had deployment performance in mind where the ability to vary the amount of compute on a per-token and more generally per-task basis might allow for large cost savings.
@thenoblerot วันที่ผ่านมา
Oy
@Tunadorable วันที่ผ่านมา
Oi*
@surajsamal4161 วันที่ผ่านมา
bro love youre channel
@tuckerhart510 วันที่ผ่านมา
Omg I built one of these once! Didn’t use a binary tree, so I’ll have to try that!
@TheProgressEngine วันที่ผ่านมา
Since the FFF Tree divides the input space into regions, and each layer is the input to the next layer, how would FFF interact with Grokking? We've seen grokking work by also forming regions within the parameter space. So could FFF regions help guide grokking and speed it up? Or am I confusing the input space regions and parameter space regions?
@Tunadorable วันที่ผ่านมา
yeah splicing a weight matrix is different from forming implicit boundaries between data in representation space
@user-qw1rx1dq6n วันที่ผ่านมา
Built something similar with LSTMs a while ago where I thought to myself since attention heads act only on a sub space of feature vectors I could save parameters by breaking up the feature vector. I saw no improvement. (That’s my own fault though)
@GNARGNARHEAD วันที่ผ่านมา
coolbeans
@Tunadorable วันที่ผ่านมา
coolcoolcool
@Morereality วันที่ผ่านมา
YEW
@Tunadorable วันที่ผ่านมา
YYYYYYYYYYYYYYYEW
@GNARGNARHEAD 2 วันที่ผ่านมา
oi that's pretty cool
@GNARGNARHEAD 2 วันที่ผ่านมา
I wonder how long until we can actually appreciate the feature space to the point we can tell how well a model will perform an a specific task based on its weights.. and, scale 🤔
@dinhero21 2 วันที่ผ่านมา
6:12 I think the down arrow is signifying a newline character, so that would be predicting: " mbf[" which basically means "a value of the array mbf, with proper indentation" which isn't *that* weird of a token
@RickeyBowers 2 วันที่ผ่านมา
No need to bash the brain - the LLM is just a slice of some feature of the human brain.
@BooleanDisorder 2 วันที่ผ่านมา
MoE will also always end up with one modality per expert too even if you try, unlike now, to make the embeddings themselves truly multimodal. That's why you need a coherent single network without experts to get the most out of multimodality (true multimodal embeddings). RAG is the way forward I think for agentic behavior, where it also learns to process certain info more or less. As a thought process I guess. A diffusion type of effect where it constructs the generation one part at a time and then gives a final generation when it's finished.
@Tunadorable 2 วันที่ผ่านมา
One of us must be confused, could you clarify? Experts are chosen using a simple linear layer router at each token without any separation by modality, and with an extra loss term that encourages roughly even activation. Meaning that a given expert can and likely will be called some % of the time even on tokens with different modalities since they have no architectural reason to separate by modality; they separate by tokens. Are you referring to some research I've not heard of showing that experts naturally do tend to specialize towards tokens different modalities? If so I'd love to see it. But even then I'm not sure how what you're saying would apply to a "multi-modal token", which I assume would be referring to a single vector that simultaneously references data from different modalities.
@BooleanDisorder 2 วันที่ผ่านมา
@@Tunadorable Sorry. I will try to explain better. When I refer to “truly multimodal embeddings,” I'm think of a system where the representation of information from different modalities is deeply integrated, rather than simply concatenated or processed in parallel. While you're correct that experts are chosen without explicit separation by modality, in practice, they often tend to specialize. This specialization can occur due to the inherent differences in the statistical properties of different modalities, even if not explicitly architected that way. This is not something I know a paper about, but something that I have come to learn from somewhere. So no, I can’t prove it but it seems to have stuck in my mind. Make of that as you will. The router may indeed work at the token level, but the challenge lies in creating higher-level representations that genuinely blend information across modalities. Individual tokens, even if they contain information from multiple modalities, may not capture the complex interactions between modalities that we're after. So, you’d want high level representations to be “routed” rather than tokens. My point about a “coherent single network” is that it might be better suited to develop these truly integrated multimodal representations. Without the separation into experts, the network might be forced to learn more generalized, cross-modal features. I believe RAG could be a promising approach for more flexible, context-aware processing. It could allow the system to dynamically adjust its focus on different types of information, similar to how humans shift attention between sensory inputs and would give it a general way to learn new abilities in context, I think. The idea of diffusion like process is that the generation process could be more iterative, refining the output by considering multiple modalities over several steps, rather than making a single pass through separated experts. Maybe even make each iterate done by separate but similar networks that have been trained for each step rather than a perfect score at once. One could stop at certain network when a satisfactory answer has been reached. Maybe the network itself can be taught to judge that with self-supervising autoregressed embeddings each iteration. Like a grade for each iteration where it stops when it has reached “good enough?” Note that my idea of multiple networks is more akin to layers than experts, but where each "layer" here is it's own network, albeit with somewhat different depths and so on. edit: My inspiration behind the several networks in parallell is the Neocortical Columns in the brain.
@Tunadorable 2 วันที่ผ่านมา
ah that makes more sense very cool
@BloodlinesNewTimes 2 วันที่ผ่านมา
Oooooh , I get it , your a 👾
@BloodlinesNewTimes 2 วันที่ผ่านมา
Lets use emojis instead for emotional clearity😂😂😂
@girlmoment669 2 วันที่ผ่านมา
#1 yapper
@immortalityIMT 2 วันที่ผ่านมา
You need to show us your LLM training rig sometime.
@lexer_ 2 วันที่ผ่านมา
I always had the impression that most of the model merging that is going on is kind of like brute-force on the benchmarks. You create a new model with similar performance but the way you merged them happened to adjust the weights in a way that it nudges the model to give more correct answers on benchmarks, not because of actual performance improvements but just by chance and because of prompt sensitivity. This of course only works sometimes which is why there are lots of terrible merges and some decent ones. Furthermore, if you actually use these merged models they almost always perform way worse in a more open-ended situation that isn't just a regular benchmark. But even though the model merging started out as a kind of benchmark-gaming, this paper seems to impy that there might actually be something more here. If you can use this to pick the best parts of different models in a relatively cheap way, then the idea of multiple diverging online-training models might actually be a viable way towards more rl-based training goals that interact with the real world. I really hope these techniques actually work out and are not just an evolutionary algorithm learning how to overfit on benchmarks because I wouldn't be surprised if that is still all that is happening even in this paper. We are current far too willing to take papers and their results at face value and look at them in a generally positive light. That is not how the scientific process is supposed to work. Science is about being less wrong so the focus should be more on figuring out which of our ideas might be wrong, and less on which might be right.
@dadsonworldwide3238 2 วันที่ผ่านมา
We dig out complexity of yesterday's axioms that gave it room for oder error we put it in our world tech & and material sciences if we don't re allocate affinities in the quest for eqaul measure it stagnates like all our grand theory in all feilds have . My ancestors was ready to math map the left over finger print code of life dna long before the Swedish scientist first identified it. General knowledge falls out in the laps of eccentric movements of thought then it has to fight the greater community who says stuff like it's just book or a rock no lost civilization is under it in turkey. This is how we are. Astronomical energy density measured in biology per scale is very impressive. It only took us reforming English digging up the past founding a nation industrialize the world to build computation. Had civil war ,ww1 & 2 plus infighting not occured leaving the eccentric fundamentalist Christian alone the transitor age would've occurred before darwin . We would've had usa shining capital on the hill long long ago. Whatever best helps simulate fine-tuned atoms lattus structure and body along with longitude latitude really hasn't taken off in hardware yet, I would imagine this is years out before form & shape products and services are realized & built . Maths is obviously a necessity In set up more so than after the fact architecture .
@pawsjaws 2 วันที่ผ่านมา
Thanks for doing all of this man. I'm using this along with some advanced distilling prompts to document all these.
@macchiato_1881 2 วันที่ผ่านมา
0 views in 1 minute? Tunadorable fell off 😔😔
@metaprotium 3 วันที่ผ่านมา
I like your idea at 9:00, that would be useful for training large numbers of experts very sparsely
@hjups 3 วันที่ผ่านมา
The problem with MoE isn't necessarily RAM, but memory bandwidth. You can think of the problem in terms of arithmetic intensity (FLOPS/Byte). In the case of batched inference, you have a higher arithmetic intensity using dense models since the weights are shared across each token / batch. However, with MoE, you can't guarantee this behavior where in worst case each token needs a different set of weights (i.e. minimal arithmetic intensity). And with modern hardware, the bottleneck is memory bandwidth not compute (i.e. why flash attention is effective). Merging segments like this is only going to exacerbate the bandwidth issue, especially if each token in the batch uses a different weighted combination of experts. One solution to this problem was proposed by the switch transformer (and what GPT4 is likely doing), where the experts live on different GPUs and are routed over the NVLink rather than data dependent memory reads, but that also won't work for merging the FFN. This work is still interesting though, and has stronger applicability to summarization rather than generation (e.g. if you trained the experts without a fixed fallback dropout, maybe you could use a common FFN for generation, but use MoE for building the KV cache?).
@alfinal5787 3 วันที่ผ่านมา
Loving the more assertive tone.
@Tunadorable 3 วันที่ผ่านมา
i wish i knew what you meant but im pretty tone deaf
@alfinal5787 วันที่ผ่านมา
@@Tunadorable lol clever comeback
@farrael004 3 วันที่ผ่านมา
Where's my "oi" in the beginning of the video? 😢
@Tunadorable 3 วันที่ผ่านมา
haha sometimes i forget
@SinanAkkoyun 3 วันที่ผ่านมา
"lory" ~= "ooi"
@Alice_Fumo 3 วันที่ผ่านมา
This is interesting. I find the segment length of 256 also surprising. It does seem very long. To me it didn't seem like having a fixed segment length is required by the approach, so to me it would make sense to have a maximum segment length and additionally split segments in text for example with each paragraph or sentence end or new line with a minimum segment length of something like 16 tokens or so. The way it is done in this paper, the segment boundaries do not care if there was a semantic shift in the middle of the segment, which I would assume is not good for expert specialization.
@mrpocock 3 วันที่ผ่านมา
Can we think of MoE models as being like dense networks trained with an extremely strong sparse/dropout process, but where this network ablation has very strong correlations rather than being sampled from white noise?
@Tunadorable 3 วันที่ผ่านมา
interesting hmmm. so the dropout process is dependent upon the input data (the correlations you mentioned) and because of that i don’t think it’d be possible to make any actual rigorous strictly mathematically definable connection between the two concepts. that being said the sparsity part of the connection is there. at some point (mid-late august?) i’ll be releasing a video on a paper called something like “a million experts” that i think you’d be interested in as it’s a bit closer to your description than regular MoE setups are
@mrpocock 3 วันที่ผ่านมา
@@Tunadorable Your observation about the dropout being data-dependent is valid. So we could probably formalise this as a function from some choosing layer L_c to a covariance matrix over per-weight dropout probabilities. That lets us get rid of the additive/sigmoid layer that re-integrates the individual agents entirely. There are probably tricks that can be played in how that covariance matrix is calculated from L_c that allow us to incrementally add new agents using "dead" weights from an otherwise dense layer or stack of layers. Sounds like a job for a first year PhD student to test out...
@tornyu 3 วันที่ผ่านมา
Did they mention whether their technique reduces information squashing, like you covered in that other paper this week?
@Tunadorable 3 วันที่ผ่านมา
they did not, and i believe i recorded this video before that one. my memory is a bit lacking but i think that whenever you ask this one to do regular auto-regressive decoding it would have the same issue. however i seem to remember this one being able to do a more diffusion style decoding in which case i don’t think it would have the same over-squashing issue. again i read/recorded this paper awhile ago so i could be wrong
@daves.software 4 วันที่ผ่านมา
Hmm... splitting on spaces.... in a non-LLM programming context, that's called... (wait for it)... tokenization.
@Lolleka 5 วันที่ผ่านมา
Fascinating. So using the combo of norm layer and gradient descent is sub-optimal in terms of computational efficiency? The more you know.
@revimfadli4666 5 วันที่ผ่านมา
Reminds me of the transformer/attention permutation invariance that David Ha used for reinforcement learning