#236
ฝัง
- เผยแพร่เมื่อ 4 ก.พ. 2025
- Pre-trained language models are increasingly important components across multiple information retrieval (IR) paradigms. Late interaction, introduced with the ColBERT model and recently refined in ColBERTv2, is a popular paradigm that holds state-of-the-art status across many benchmarks. Performance-optimized Late Interaction Driver (PLAID) dramatically speeds up the search latency of late interaction. Without impacting quality, PLAID swiftly eliminates low-scoring passages using a novel centroid interaction mechanism that treats every passage as a lightweight bag of centroids. PLAID uses centroid interaction as well as centroid pruning, a mechanism for sparsifying the bag of centroids, within a highly-optimized engine to reduce late interaction search latency by up to 7× on a GPU and 45× on a CPU against vanilla ColBERTv2, while continuing to deliver state-of-the-art retrieval quality. This allows the PLAID engine with ColBERTv2 to achieve latency of tens of milliseconds on a GPU and tens or just few hundreds of milliseconds on a CPU at large scale, even at a scale of 140M passages.
In this video, I talk about the following: How does ColBERTv2 work? How does PLAID (Performance-optimized Late Interaction Driver) differ from ColBERTv2? How does PLAID work? How does PLAID perform?
For more details, please look at arxiv.org/pdf/...
Santhanam, Keshav, Omar Khattab, Christopher Potts, and Matei Zaharia. "PLAID: an efficient engine for late interaction retrieval." In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1747-1756. 2022.