Thank you for sharing this insightful video. In the introduction of Mamba, it says "parellelizable training", can you explain how parallel training is possible in an autoregressive model?
I think you might be looking for the "selective scan" part of Mamba. In section 3.3.2 of the paper arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf, they say "To avoid the sequential recurrence, we observe that despite not being linear it can still be parallelized with a work-efficient parallel scan algorithm (Blelloch 1990; Martin and Cundy 2018; Smith, Warrington, and Linderman 2023)". In short, they use a well known parallel algorithm trick to calculate a prefix sum. See en.wikipedia.org/wiki/Prefix_sum#Parallel_algorithms and you'll notice the similarity. Hope this helps!
can you share slides ?
Thank you for sharing this insightful video. In the introduction of Mamba, it says "parellelizable training", can you explain how parallel training is possible in an autoregressive model?
Teacher forcing
Follow this video and you will have hands on understanding why AR model could be trained in parallel. th-cam.com/video/kCc8FmEb1nY/w-d-xo.html
I think you might be looking for the "selective scan" part of Mamba. In section 3.3.2 of the paper arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf, they say "To avoid the sequential recurrence, we observe that despite not being linear it can still be parallelized with a
work-efficient parallel scan algorithm (Blelloch 1990; Martin and Cundy 2018; Smith, Warrington, and Linderman
2023)". In short, they use a well known parallel algorithm trick to calculate a prefix sum. See en.wikipedia.org/wiki/Prefix_sum#Parallel_algorithms and you'll notice the similarity. Hope this helps!
... какие то откровения ML джуна