Scalable MatMul-free Language Modeling (Paper Explained)

Yannic Kilcher

มุมมอง 28 798

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 1 ส.ค. 2024
Matrix multiplications (MatMuls) are pervasive throughout modern machine learning architectures. However, they are also very resource intensive and require special accelerators (GPUs). This paper explores architectures that do away with MatMuls and use quantization and recurrence to keep performance up.
OUTLINE:
0:00 - Intro
2:30 - MatMul is everywhere
5:55 - Ternary accumulation as a substitute for matrix multiplication
16:35 - Replacing attention layers with recurrent layers
32:40 - Replacing dense layers with ternary channel mixing
38:30 - Language modelling results & scaling laws
45:00 - Other experimental results
48:20 - Conclusion
Paper: arxiv.org/abs/2406.02528
Code: github.com/ridgerchu/matmulfr...
Abstract:
Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at this https URL.
Authors: Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
TH-cam: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 112

@user-vw5pg5vr3g 24 วันที่ผ่านมา ⁺⁴¹
Loved that references for BitNet are 10 and 11
@eoghanf 24 วันที่ผ่านมา ⁺¹⁹
Your point about estimating whether non-straight lines cross based on three datapoints is a very good one. HOWEVER, the reason for giving them the benefit of the doubt on the training dynamics side is that the *inference* time power efficiency gain (which you don't spend any time on!) is massive. From the abstract "We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency". That's pretty amazing.
@ttul 25 วันที่ผ่านมา ⁺²⁶
The FPGA angle is what's interesting about this research. The paper proposes replacing all feed-forward operations in large language models with more computationally efficient operations, mostly by using ternary weights (i.e. -1, 0, and 1 are the only allowed values). Ternary weights are basically a simple logic gate with only three permitted operations:
a) Change the sign of the input (i.e. flip the sign bit and copy the rest)
b) Output zero
c) Copy the input to the output
If your goal is to make a neural network scream on hardware, having only three simple operations to choose from means you can use simple logic gates. The researchers tried this out in FPGAs and this is a promising area of research. From FPGA's it's not a big leap to ASICs, which nets the most power efficient computation theoretically possible. So if ternary gate networks can be made to scale, everyone should be excited.
Caveats:
1. The attention mechanism is replaced with a parallelizable form of recurrent neural network because applying ternary operations to attention does not train.
2. A linearized Gated Recurrent Unit (GRU) architecture allows for parallel computation; this is a neat trick.
3. The channel mixer (a feed-forward equivalent) uses dense layers with ternary accumulation operators.
Results show performance comparable to traditional Transformers, with better scaling properties at larger model sizes.
Yannick expresses some skepticism about the projected crossover point where this architecture would outperform traditional Transformers.
But I think the really interesting thing about this is the FPGA/ASIC aspect.
@robmacl7 25 วันที่ผ่านมา ⁺¹
You could also reduce some work by pre processing the weights to just drop the zero entries, but this would be somewhat a nuisance for a hardware realization because the work needed would vary by output element.
@hjups 24 วันที่ผ่านมา ⁺¹
@@robmacl7 Why would variable work be an issue? You replace a deterministic sequence with signal barriers that only occur at synchronization points in the compute graph.
The bigger issue with dropping zero entries would be the extra step needed for decompression into a dense operation (e.g. stored as RLE or a Sparse format), and then aligning fetches to DRAM bursts.
@philiptren2792 24 วันที่ผ่านมา ⁺⁷
19:15 I think the model will learn to be more efficient with the extra accuracy. We can increase the length of the vector and the model will learn to use higher accuracy for the important values and a lower one for the ones where precision doesn’t matter as much, saving unnecessary precision. It’s like quantizing each and every weight of the model independently and exactly the right amount.
@clray123 24 วันที่ผ่านมา ⁺¹
What I missed in the video and in the paper is an interpretation of replacing the weights with -1, 0, 1. And that would be: matrix multiplication xW is just calculation of n vector dot products - one dot product between x and each row of W. A dot product of two vectors is max when the vectors point in the same direction, min when the vectors point in the opposite direction, 0 if they are orthogonal. So it's basically deciding "let's glue all the KQV vectors, whose direction we compare with x, to the base axes (of the coordinate system), rather than allow them to point in any direction". I think that's what they call "privileged bases" in interpretability research. But given that you can only fit so many orthogonal vectors in n dimensions (and a lot more "almost" orthogonal vectors), it feels like it should impact the ability of the model to uniquely represent inputs.
@unvergebeneid 17 วันที่ผ่านมา ⁺¹
Anything that uses balanced ternary is already a superior method in my book :D
@HansKonrad-ln1cg 24 วันที่ผ่านมา ⁺¹
i have heard that after training you can basically throw away 90% of a network without changing the behaviour too much. that is because most of the weights are near zero which basically means a non-existent connection of the neurons. so if you omitt the calculation right away by taking it as exactly zero with the ternary values you save a lot of time that would have otherwise been spent with multiplying by zero for no reason.
@RPG_Guy-fx8ns 24 วันที่ผ่านมา ⁺¹
if you have a layer of 64 neurons, the weights would be 16 bytes per neuron. You can use a look up table with 256 entries, instead of summing the binary digits. That way, most of the math is just turned into jumps into that table, finding 2 sums to subtract. its 16 boolean AND operations, to compare the previous layer output and this neuron's weights, 16 array lookups, adding them up as 2 totals, then subtracting the 2 bytes. That would be extremely fast compared to other neural networks, but I wonder if it can match the quality of other solutions.
@Mordenor 20 วันที่ผ่านมา ⁺²
Thank you Mr Yannic for explaining MatMul free Language Modelling to your viewers!
@eoghanf 24 วันที่ผ่านมา ⁺²
I would really be interested in knowing more about the how the Straight-Through Estimator allows these things to train. That's the big mystery to me.
@josehugoelsas8699 16 วันที่ผ่านมา ⁺¹
One important thing to notice is that this approach is trading off very regular, very high numerical intensity normal matmul, with very sparse, very memory irregular filtering operations to do the ternary if statements.
For me it is not clear if this will yield any improvement over present GPU or other accelerator architectures.
Also, it relies heavily on quantization, which can be fragile depending on the situation. It is not much of a problem for inference, but can be a problem for training.
Multiplying floats, specially dense matrices, is cheap, what is expensive is moving data, and I don't see how this paper improves on this front.
@wolpumba4099 25 วันที่ผ่านมา ⁺⁵¹
*Summary*
*Problem:*
* *(**2:30**)* Matrix multiplications (MatMuls) are the core of modern machine learning, but they are resource-intensive and require specialized hardware like GPUs.
*Proposed Solution:*
* *(**0:00**)* This paper proposes eliminating MatMuls entirely from large language models (LLMs) while maintaining competitive performance.
* *(**16:35**)* The architecture replaces:
* *(**16:35**)* *Attention layers* with parallelizable recurrent layers inspired by GRUs.
* *(**5:55**)* *Dense layers* with "ternary accumulation," using quantized weights limited to -1, 0, and 1. This replaces multiplication with simpler selection and addition operations.
*Key Findings:*
* *(**38:30**)* *Performance:* The MatMul-free models perform on par with state-of-the-art Transformers at scales up to 2.7 billion parameters.
* *(**38:30**)* *Scaling Laws:* The performance gap between MatMul-free models and traditional Transformers seems to decrease with increasing model size, suggesting a potential crossover point where MatMul-free models become more efficient. However, the video author expresses skepticism about this extrapolation.
* *(**45:00**)* *Hardware Efficiency:* The proposed architecture significantly reduces memory usage and latency. Implementing it on custom hardware like FPGAs, optimized for ternary operations, could lead to even greater efficiency gains.
*Author's Opinion (Yannic Kilcher):*
* *(**48:20**)* The research is exciting and promising for edge computing and energy-efficient AI.
* *(**48:20**)* He remains skeptical about:
* Whether MatMul-free models can truly surpass traditional Transformers in performance, especially for complex tasks.
* The validity of extrapolating scaling laws based on limited data points.
* The simplification trade-offs (like removing state-dependent hidden state updates) might limit the architecture's ultimate capabilities.
*Overall:*
The paper offers a compelling alternative to traditional MatMul-heavy LLMs, with potential for improved hardware efficiency. While challenges and open questions remain, it presents a promising direction for future research and development.
i used gemini 1.5 pro to summarize the transcript
@interstellarsurfer 24 วันที่ผ่านมา ⁺⁵
I guess Gemini isn't completely useless. 🤷‍♂️
@theupsider 17 วันที่ผ่านมา
thats what LLMs are for. thanks
@pauldruhg2992 24 วันที่ผ่านมา ⁺⁶
Why stop with terniary? Go for powers of two and bit shifting. Speed and precision win-win.
@WalterSamuels 24 วันที่ผ่านมา
Can you elaborate?
@danielg3857 22 วันที่ผ่านมา ⁺¹
@@WalterSamuels he means replacing ternary logic gates with three possible outputs(1, 0, -1), with just binary logic gates/functions, to benefit from even better math hacks so to speak, you can do neat tricks with binary numbers/functions; haven't even watched most of the video mind you, just reading the abstract and comments so far
@pauldruhg2992 18 วันที่ผ่านมา
@@WalterSamuels multiplication and division by powers of two can be replaced with bit-shifting, which is faster
@eruiluvatar236 24 วันที่ผ่านมา ⁺¹
I believe that you could still implement a fast "ternary multiplication" in a current GPU by using logic gates operating on multiple weights per register. Matmults are crazy fast on GPUs but by squeezing multiple weights together in a single register it might end up being faster.
@jmirodg7094 24 วันที่ผ่านมา ⁺¹
It is only a first attempt I'm keen to see the following papers...
@VladMysla 19 วันที่ผ่านมา
30:26 in hidden state it actually depends on the previous state to select what to forget
@FryGuy1013 24 วันที่ผ่านมา ⁺²
As someone who has written CUDA code, this is relatively straightforward to do on GPUs. So your concern seems kind of unfounded that it will be basically the same performance as a full floating point multiplications
@Noxeus1996 23 วันที่ผ่านมา ⁺⁶
As someone who has written most of the llama.cpp CUDA code, matrix multiplications on GPUs are only so fast due to specialized hardware, i.e. tensor cores. Without specialized instructions for Bitnet or whatever I doubt that the performance will be (much) better than just doing dense 16 bit matrix multiplications unless you also quantize the activations to 4/8 bits.
@VincentKun 25 วันที่ผ่านมา ⁺²
About data dependency did you saw the Illusion of State in state space model paper?
Every time they try to get to something recurrent they lose parallelization and state dependency is one of those cases
@hjups 24 วันที่ผ่านมา ⁺¹
With usefulness, there's still an underlying assumption that 1) the comparable performance will hold with increased scale / specialized models, and 2) properties required for improved reliability in transformers also translate to this architecture.
My guess is that (1) depends on the task / benchmark, and (2) is unlikely to occur (SSMs are missing some of these properties), which will set an upper bound on the model size and usability. That said, this approach is probably applicable for more classical NLP tasks which are easier than generative AI, and maybe some sort of low-effort HCI (e.g. take this JSON packet and convert it into a human understandable response).
@pavalep 19 วันที่ผ่านมา
thanks for the informative vid :)
@ronhightower6549 25 วันที่ผ่านมา ⁺⁷³
Hopefully the research community gets these fundamental improvements figured out before Sam Altman spends a trillion dollars on data centers running Nvidia MatMul devices.
@danielmewes 24 วันที่ผ่านมา ⁺²
Might still need it for training?
@TheNerd484 24 วันที่ผ่านมา ⁺⁷
It would be funny if this happens like a month after he buys them. It would also mean we get a lot of cheap compute cards
@eadweard. 24 วันที่ผ่านมา ⁺⁵
@@TheNerd484Resentment-powered compute.
@clray123 24 วันที่ผ่านมา ⁺¹
Too late. Also, Anthropic spends substantial resources on interpretability of transformer-based models. As far as I'm aware, these interpretability gains do not translate easily into other architectures.
@jswew12 24 วันที่ผ่านมา ⁺¹
@@danielmewescorrect me if I an wrong, but isn’t training also possible on the FPGA they introduce? It’s been a couple weeks since I read the paper and I haven’t finished this video, but I could have sworn that all the operations they need for training are programmed into the FPGA and are shown to be better than GPU equivalents. Could be a problem of scale maybe?
@sentinelav 24 วันที่ผ่านมา ⁺²
40:25 "More bang for your flop" 💀
@adamrak7560 24 วันที่ผ่านมา ⁺⁴
Dot product in-memory architectures would be extremely fast and efficient for the inference. Less so for training.
So _if_ we change the architecture there are relatively simple ways we could add a few order of magnitudes to the inference performance.
@Balorng 24 วันที่ผ่านมา
Inference speed equals model performance because, currently, algorithms like "Graph of Thoughts", extensive multi-agentic systems, "smart RAG" and, most importantly, metacognition in general is extremely inference-heavy (you can generate orders of magnitude of "subcounscious" tokens for each one shown to the user), so is generating oodles of very high-quality training data to create "leaner" yet more performant models using much less data by eliminating junk. I particularly liked the idea of creating multiple "interlocking" variants of data designed to combat llm flaw of A = B, B =/= A "reversal curse" and otherwise their inability to truly generalize.
My pet "internal model of LMM attention" is actually DNA sequencing. A huge pattern is broken apart into small chunks and then pieced together into new patterns by having them mech with each other using semantic distance similarity - that explains both the strong and weak points of LMMs. While I think that using graph RAG and symbolic logic metacognitive systems is still a must to make LMMs truly useful, simply having more patterns that are "rotated/translated" this way and that should create better "illusion of general intelligence" at the very least...
@hjups 24 วันที่ผ่านมา
"Extremely fast and efficient" is relative. Samsung and SK Hynix already do that with their HBM-PIM, but are only able to get a 2x-3x improvement. That's at most 2 orders of magnitude (in base 2). That 2x is still valuable, but it's limited by communication depth (sum trees can't be faster than log2 N), and the technology nodes used by DRAM are relatively slow compared to CMOS.
@adamrak7560 23 วันที่ผ่านมา
@@hjupsHBM-PIM is a generic processor near each pair of DRAM banks, with a quite underpowered FPU. It is not a highly parallel and specific dot-product engine. So for AI inferencing it is unsurprisingly very weak. For AI inferencing we only need a dot-product engine, and very little control circuitry, or registers.
@hjups 23 วันที่ผ่านมา
@@adamrak7560 That's incorrect. The HBM-PIM implementations are a special-function SIMD ALU near each bank (they have an ISA of 16 instructions or something small like that), one of which has a dot-product sum tree (I can't recall which one it was).
And you do need more than just a dot-product engine for efficient inference. You also need the ability to perform element-wise addition, multiplication, and some movement operations for transpose.
@jimbo8853 25 วันที่ผ่านมา ⁺¹⁷
Devs learning linear algebra to upskill for AI in shambles
@Decocoa 25 วันที่ผ่านมา ⁺²
Joking aside mate why would devs need Linear Algebra for Ai? Surely the basics from high school should be sufficient? You abstract away the layers and optimisers with TF?
@jamescunningham8092 24 วันที่ผ่านมา ⁺²³
@@DecocoaTo be truly effective in an environment where the state of the art changes all the time, you need at least a little understanding of how things work. Without any understanding of linear algebra you’d be at a big disadvantage.
@coversine479 24 วันที่ผ่านมา ⁺⁴
@@Decocoa if you don't know LA and Calculus you can't understand AI papers. Period. But if you are just an application developer using someone else's AI API obviously you don't need to know how it works internally to use it
@alan2here 20 วันที่ผ่านมา ⁺¹
Evolution, the models are the species, we cause mutation and are also the environment, speciation is common.
@WalterSamuels 24 วันที่ผ่านมา
Look into VSA (hyperdimensional computing), and balanced ternary notation.
@serhanciftlikci3651 24 วันที่ผ่านมา
I think it all boils down to the classical idea or bias-variance tradeoff. Using ternary weights results with a biased model (hence the big loss gap compared to the transformer at the start). They can populate more weights but it will remove all gains from the inference. If they can also find a component to increase the variance of the system, it may be the new way to train LLMs in the future.
@adeeelh 18 วันที่ผ่านมา
+100 to the rant at 25:32 about researchers relying on tricks instead of the main idea of the paper. It's my biggest pet peeve with deep learning papers.
@abdulshabazz8597 7 วันที่ผ่านมา
This algorithm can be further adapted to arbitrary, non-binary bit-arrays to further improve their performance by first factoring the RHS matrices into primes, which are essentially then viewed as unary values, and summing each tensor of primes and their product's in parallel...
@hasko_not_the_pirate 23 วันที่ผ่านมา
19:20 Isn’t the essential trade-off that they encode learned models in a 1.6 bit “ternary” data type rather than a 8 bit, 16 bit, or 32 bit float data type for the weight matrix? It seems likely that you would need roughly 20 times as many weights to encode the same information as a float32 weight matrix, which would then increase compute complexity accordingly.
@JoeTaber 24 วันที่ผ่านมา
I wonder if a tenstorrent device would be able to process these operations efficiently.
@JBoy340a 8 วันที่ผ่านมา
The FPGA is interesting. It would be interesting what to see what this means for a portable real-time devices.
@AleksandrUmnov 24 วันที่ผ่านมา ⁺¹
6:24 the pigeon moment
@clray123 24 วันที่ผ่านมา
I have a nagging suspicion that the attention complication they do after the ternary quantizing of the QKV weights is there to recover (as in "store elsewhere") the same weights that they claim to have dropped...
@ssssssstssssssss 24 วันที่ผ่านมา ⁺¹
I saw this the other day and really liked how they claim not to be doing matrix multiplication while still doing matrix multiplication. It's just an efficient implementation of a special case. It makes me feel a bit disappointed despite the contribution of the paper looking to be quite solid.
@FredericoKlein 23 วันที่ผ่านมา
a multiplication by 2 is just a bit shift in binary (in floating point, its just adding 1 to the exponent, isnt it?)
So they could have done 2, 4, 8,... and -2, -4, -8.. couldnt they?
@alan2here 23 วันที่ผ่านมา ⁺¹
PC's today can get a tertiary value into 2 bits, utilising 75% of the space, and compute with it fairly efficiently. Maybe not so practical to compute with but 3 tertiaries also fit into 5 bit giving 84%, and 10 tertiary values in 16 bits (2 bytes) utilising 90%. 😮
@alan2here 23 วันที่ผ่านมา
Unfortunately 2^n is never equal to 3^n for any integer other than 0
@evilby 23 วันที่ผ่านมา ⁺¹
TTT on the way?
@norlesh 24 วันที่ผ่านมา ⁺¹
How does this effect the GPU poor such as myself (humble RTX 2080) - I'm wondering how this would perform implemented as something like llama.cpp tailored to run on CPU and system ram with the GPU just for icing when available.
@JBoy340a 8 วันที่ผ่านมา
Yes. As a fellow 2080 owner I often run into issues with resources. it would be nice to see these sort of issues go away.
@KevinHorecka 24 วันที่ผ่านมา ⁺¹⁵
"stay hydrated" was a shockingly helpful reminder that I haven't drank any water today. Thanks!
@erickmarin6147 24 วันที่ผ่านมา
Been trying to verilog something like that myself for a while
@fiNitEarth 23 วันที่ผ่านมา
Well didn’t they compare their model to transformer ++ which also quantizes its weights to trinary?
@khaledbouzaiene3959 25 วันที่ผ่านมา
i wich you explain the fpga or asic part how this done using addition or element wise instead of matrix multiplication
@hjups 24 วันที่ผ่านมา ⁺¹
The authors don't go into detail nor is the RTL code in their repo. From their description and diagram, it's a stand-alone DMA unit, which takes in the address of the ternary matrix, the address of the activation matrix (most likely), and the address of the destination matrix (most likely). Then it fetches a column of the transposed ternary matrix to store in a local buffer, and streams the rows of the activation matrix into an accumulator, which then gets written back to the destination address.
@clray123 24 วันที่ผ่านมา
I don't understand putting this linearized architecture in the same basket as state-space models at 30:22. The (selective) "accumulation of the past" in state-space models (specifically Mamba) makes the next state data-dependent (namely on all the selectively accumulated past data). Not just on the next token. Or are you saying that because of the selectivity newer tokens may have no chance of using information from older tokens that have been rejected by selection (but this is kinda the tradeoff for not having to maintain a KV cache of indefinite length).
@aitarun 23 วันที่ผ่านมา
1 bit and 1.56 bit llm paper came long back. I wonder why are not these models available yet. There are quantized models but no model is available which was trained @ 1 or 1.56 bits. Seems like some accuracy related issues not making them worthy as their full precision counter part.
@ekstrapolatoraproksymujacy412 24 วันที่ผ่านมา
Attention layer is needed for in context learning and in context learning capability is strongly corelated with intelligence, architectures like RWKV struggles with this, looking at a loss and most of the current benchmarks is very misleading regarding actual performance, those things mostly measures how much the model remembered not how well it generalizes, that's why mobody really uses those "modern rnn" thingies, they only look good on paper, not in practice.
@rockapedra1130 24 วันที่ผ่านมา ⁺¹
18:16 I like the duplication hack. I wonder if brains use that. Synapses would be +1 = excitatory synapse, -1 = inhibitory synapse, 0 = no synapse, other numbers = multiple synapses. Maybe. Who knows. LOL
@LuizFernando-hv1td 21 วันที่ผ่านมา
I think you would be interested in looking into SNNs! From what I understand, when you include the time dimension, something like this happens in the form of spike frequency.
@rockapedra1130 21 วันที่ผ่านมา
@@LuizFernando-hv1tdhey, that's pretty cool! If we add spiking frequency and an "integration window" to the mix then it works even better! Then we can do: spike freq * int window * (num exc synapses - num inh synapses) = value! That allows arbitrary precision with ternary synapses. If I were a brain engineer, I'd do that! Probably everybody does already ... Lol.
@adityashukla9840 24 วันที่ผ่านมา
Can you please make a video on DUCK net
@eaglefacts990 13 วันที่ผ่านมา
What PDF editor do You use??
@TheNerd484 24 วันที่ผ่านมา
IMO, if any architecture will yield actually intelligent AIs, it would look very similar to this. I think training would be the main hard part.
I'm of the opinion that if this model were trained such that it does not have to output a token on every iteration, you would see significant performance improvement basically for free.
@albinoameise 22 วันที่ผ่านมา
But your idea of simply repeating the input tokens for attention does not necessarily result into too many tokens. Because you can use this np.where operation once in a step before doing that to thin out the input tokens with a ternary thinning matrix and then replicating and 'attending' only those with values > 0.
So I find your idea at least worthy to try!
@TheTruthOfAI 22 วันที่ผ่านมา
this paper is wild as hell... even coming out with FPGA solutioning.. to be honest, is one of those papers that i dont fully entirely 101% grasp.. i did tried some of this ternary multilateration approach.. according the "book", its numerical floating precision by example on 13 operators reaches 100% precision of float16.. truth is on the battlefield it doesnt perform good within my experiments.
@MrBioloidboy 18 วันที่ผ่านมา
Sentient ai is here! Can I try brain tech data science integrations now?
@aneeshprasobhan 25 วันที่ผ่านมา ⁺¹⁰
NVIDIAs shares rely on this paper not getting too much attention xD
@tarumath319 24 วันที่ผ่านมา ⁺¹
They would just need to add ternary accelerators and maybe more int8 ones.
@eadweard. 24 วันที่ผ่านมา ⁺⁴
Is that a pun?
@aneeshprasobhan 24 วันที่ผ่านมา
@@eadweard. i tried xD
@aneeshprasobhan 8 วันที่ผ่านมา
@@eadweard. i tried xD
@kazedcat 7 วันที่ผ่านมา
Nvidia could just add ternary operation to their GPU. It is a super simple hardware "copy if 1, zero out if zero and negate if -1". They only need to add a single new instruction VTerAcc "Vector Ternary Accumulate"
@mrpocock 24 วันที่ผ่านมา ⁺²
Is this not an opinionated relu?
@charstringetje 24 วันที่ผ่านมา
Am I the first to see that Q=K=V, and that we can reduce all MatMul to ⅓ the current operations without introducing other operations? 🙃 3:44
@charstringetje 24 วันที่ผ่านมา
Oh, I spoke too soon... Handwaving follows.
@clray123 24 วันที่ผ่านมา
The weight matrices are "obviously" supposed to be different, but in some cases the same K and V submatrices are reused for subsets of Q (or for all Q), indeed leading to memory savings (although not to 1/3). See papers on multi-query attention (MQA -> all Qs share same KV) and grouped-query attention (GQA -> some Qs share same KV).
@bjarke7886 24 วันที่ผ่านมา
ESM3 ESM3 ESM3 ESM3 ESM3 ESM3
@kop-lg7lo 25 วันที่ผ่านมา
kinda cool, but surely we not ready for this type of architecture
@tarumath319 24 วันที่ผ่านมา
A lot of people talk about bitnet and that improvement over it but the big guys in AI like OpenAI seem to not care about it.
@clray123 24 วันที่ผ่านมา ⁺¹
Sunk cost fallacy. The hardware they've already paid for needs to be amortized first. It's very difficult to admit to investors they've burnt so much money by committing to an unripe architecture.
@hermannschmidt9788 24 วันที่ผ่านมา
Bitcoin mining used to be run on GPUs first. Then came the FPGAs, followed by ASICs. I wonder if this progression will apply to transformer networks as well. This would put Nvidia out of business. Calculating a hash value is a simpler task, however.
@clray123 24 วันที่ผ่านมา
Why do you think Nvidia would be incapable of manufacturing (and foremost patenting) these other circuits?
@hermannschmidt9788 24 วันที่ผ่านมา
@@clray123 I just followed the mining analogy. They stayed with the GPUs, which is their core competence, and gave away this business.
@cherubin7th 23 วันที่ผ่านมา ⁺⁵
Nvidia is cooked
@cherubin7th 23 วันที่ผ่านมา ⁺¹
jk
@g_glop 23 วันที่ผ่านมา ⁺¹
MatMul? i'm allergic
@christospapadopoulos7894 18 วันที่ผ่านมา ⁺¹
8 authors for a scientific paper is absurd, at this point who even is the main one?
@seanreynoldscs 24 วันที่ผ่านมา ⁺²
I’m calling BS. They are approximating the floating points by having overly large weights matrices. This paper could also be called, having a smaller network sometimes outperformed a larger network for small datasets.
@Navhkrin 24 วันที่ผ่านมา ⁺²
Big doubt this approaches scales. It is giving me vibes of kind of research that works for that one specific tailor engineered scenario and sucks for everything else. Otherwise we would have seen significantly higher amount of experiments in various settings
@clray123 24 วันที่ผ่านมา
That is one stupid argument to make, with that approach you can disqualify any new idea ("the idea must obviously be bad otherwise we would have seen it before").
@deltamico 24 วันที่ผ่านมา ⁺²
It's more like "the idea must be bad because otherwise the author would be willing to explore it's capabilities in different settings" which is not always true but absolutely has grounds
@clray123 24 วันที่ผ่านมา ⁺¹
@@deltamico But this whole "but does it scale" argument assumes the researchers have infinite money to burn on hardware. They obviously don't, that's why they explore new ideas with smaller models.
@Jononor 23 วันที่ผ่านมา ⁺¹
Integer quantization is standard practice in edge/mobile/TinyML. Sub byte quantization and even binary networks have considerable research in the last decade. Most research has been on CNN, Transformers and LLMs has not seen as much research yet - but it is coming. No one knows if ternary or no matmul will be the best representation though...
@naromsky 24 วันที่ผ่านมา
That's one boring paper.

ต่อไป

เล่นอัตโนมัติ

Why Does Diffusion Work Better than Auto-Regression?