Hello! Thank you for the interesting paper. On the last equation of the "Linear Attention" slide, why do you not cancel out the Φ(Qi)^T in the equation since it's in both the numerator and the denominator?
Dear Angelos, it's the great method, but I did not find in your repo practical examples of using such transformers builders. Could you please add some examples related to NLP task such intent recognizer? May be it will be fast to take some ready tokenizer like in hugging-face and combine with your method? Also will be great to unveal some routines for train, test such modules. In video you have mentioned 2 experiments with MNIST and CIFAR-10 but in repo no such train/test examples exist.
Thanks for the interest in the paper. I would disagree that it is misleading. Our kernel is nowhere near as optimized as the default CUDNN GEMM implementations. Also we were as transparent as possible in the paper and we see the most important contribution to be the formulation instead of the CUDA implementation. Finally, the CUDA kernel is only used for training and not for inference that is implemented with the provided PyTorch operations.
Great paper overall ! Benchmarks on NMT or MLM tasks would be appreciated :)
Hello! Thank you for the interesting paper. On the last equation of the "Linear Attention" slide, why do you not cancel out the Φ(Qi)^T in the equation since it's in both the numerator and the denominator?
Dear Angelos, it's the great method, but I did not find in your repo practical examples of using such transformers builders. Could you please add some examples related to NLP task such intent recognizer? May be it will be fast to take some ready tokenizer like in hugging-face and combine with your method? Also will be great to unveal some routines for train, test such modules. In video you have mentioned 2 experiments with MNIST and CIFAR-10 but in repo no such train/test examples exist.
Ingenious idea! Admirable.
Hi, is there a pretrained weights we can download? Or do we have to train from scratch, thanks!
I guess you could finetune from a normal softmax-based attention model, continuing from the same qkv weights.
You are using a custom cuda kernel and compare it to non-custom cuda implementations of the transformer and lsh? That is misleading.
Thanks for the interest in the paper. I would disagree that it is misleading. Our kernel is nowhere near as optimized as the default CUDNN GEMM implementations. Also we were as transparent as possible in the paper and we see the most important contribution to be the formulation instead of the CUDA implementation. Finally, the CUDA kernel is only used for training and not for inference that is implemented with the provided PyTorch operations.