Data Deduplication using Locality Sensitive Hashing - Matti Lyra

Index 2024 Talk: Vector Search and the FAISS Library

Pete Zerger, Author of CISSP: The Last Mile #podcast #cissp #cisspexam #career #cybersecurity

🔴LIVE เชียร์สด : แมนเชสเตอร์ ยูไนเต็ด พบ เลสเตอร์ ซิตี้ | รุด ฟาน นิสเตลรอย คุมผีแดงนัดส่งท้าย MW11

พระพุทธรูปกินคน | หลอนไดอารี่ EP.254

OHANA บ้าพลัง EP.126 : เกมการ์ดโอฮาน่า x นินิว โย ฝน

Billion Scale Deduplication using Approximate Nearest Neighbours| Idan Richman Goshen, Sr Ds@Lusha

PyData

มุมมอง 5 127

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 12 พ.ย. 2024
At Lusha we are dealing with contacts profiles, lots of contacts profiles. It is by nature messy, and a single entity can have several representations in this type of data. In addition to the time and money spent moving messy data through the various pipelines, it is difficult to search in, not to mention the valuable information lost in the process. It would be ideal if we could merge all records of the same entity, even if they differ slightly (“Alagra Jones”, “Alagra Smith-Jones”). Comparing combinations of all pairs is possible on a small scale, but impossible when dealing with billions of records.
A set of algorithms known as approximate nearest neighbours is becoming more popular for solving such challenges and allowing the use of text-embeddings and clustering at large scales.
This talk will offer a brief overview of ANN algorithms and demonstrate how we can apply them to get a reasonable size subset of candidates, which we can then pass into a classifier for a match/no-match outcome. I’ll demonstrate how we handle such a task at scale, how we evaluate the two steps, and the tools we use.
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 1

@kevon217 6 หลายเดือนก่อน
Fascinating approach.

ต่อไป

เล่นอัตโนมัติ

Data Deduplication using Locality Sensitive Hashing - Matti Lyra

Data Deduplication using Locality Sensitive Hashing - Matti Lyra

Index 2024 Talk: Vector Search and the FAISS Library

Index 2024 Talk: Vector Search and the FAISS Library

Pete Zerger, Author of CISSP: The Last Mile #podcast #cissp #cisspexam #career #cybersecurity

Pete Zerger, Author of CISSP: The Last Mile #podcast #cissp #cisspexam #career #cybersecurity

🔴LIVE เชียร์สด : แมนเชสเตอร์ ยูไนเต็ด พบ เลสเตอร์ ซิตี้ | รุด ฟาน นิสเตลรอย คุมผีแดงนัดส่งท้าย MW11

🔴LIVE เชียร์สด : แมนเชสเตอร์ ยูไนเต็ด พบ เลสเตอร์ ซิตี้ | รุด ฟาน นิสเตลรอย คุมผีแดงนัดส่งท้าย MW11

พระพุทธรูปกินคน | หลอนไดอารี่ EP.254

พระพุทธรูปกินคน | หลอนไดอารี่ EP.254

OHANA บ้าพลัง EP.126 : เกมการ์ดโอฮาน่า x นินิว โย ฝน

OHANA บ้าพลัง EP.126 : เกมการ์ดโอฮาน่า x นินิว โย ฝน

โชคชะตาความซวย • คุณโอ๊ต 9 บาท | 9 พ.ย. 67 | THE GHOST RADIO

โชคชะตาความซวย • คุณโอ๊ต 9 บาท | 9 พ.ย. 67 | THE GHOST RADIO

Hanna van der Vlis - Clusterf*ck: A Practical Guide to Bayesian Hierarchical Modeling in PyMC3

Hanna van der Vlis - Clusterf*ck: A Practical Guide to Bayesian Hierarchical Modeling in PyMC3

PyNNDescent Fast Approximate Nearest Neighbor Search with Numba | SciPy 2021

PyNNDescent Fast Approximate Nearest Neighbor Search with Numba | SciPy 2021

10. Introduction to Learning, Nearest Neighbors

10. Introduction to Learning, Nearest Neighbors

[CVPR20 Tutorial] Billion-scale Approximate Nearest Neighbor Search

[CVPR20 Tutorial] Billion-scale Approximate Nearest Neighbor Search

Juan Luis- Expressive and fast dataframes in Python with polars | PyData NYC 2022

Juan Luis- Expressive and fast dataframes in Python with polars | PyData NYC 2022

Research talk: Approximate nearest neighbor search systems at scale

Research talk: Approximate nearest neighbor search systems at scale

James Powell- Why do I need to know Python- I'm a pandas user | PyData NYC 2022

James Powell- Why do I need to know Python- I'm a pandas user | PyData NYC 2022

Probabilistic Record Linkage of Hospital Patients - Chris Oakman

Probabilistic Record Linkage of Hospital Patients - Chris Oakman

Elon Musk fires employees in twitter meeting DUB

Elon Musk fires employees in twitter meeting DUB

ตัวแรงตัวท็อป รวมมือถือเรือธง มีราคาไหนบ้างมาชม #houkandbank #shorts #reels #thailandmobileexpo2024

ตัวแรงตัวท็อป รวมมือถือเรือธง มีราคาไหนบ้างมาชม #houkandbank #shorts #reels #thailandmobileexpo2024

ซิมเทพเน็ต 15 Mbps ไม่อั้น โทรฟรีทุกค่าย เดือนละ 100 ถูกที่สุด แต่ความจริง...

ซิมเทพเน็ต 15 Mbps ไม่อั้น โทรฟรีทุกค่าย เดือนละ 100 ถูกที่สุด แต่ความจริง...

Samsung VS Apple Anti-shake function comparison Samsung mobile phone mobile phone digital digital

Samsung VS Apple Anti-shake function comparison Samsung mobile phone mobile phone digital digital

หาว่ามือถือของดีเครื่องไหนแพง

หาว่ามือถือของดีเครื่องไหนแพง

พรีวิว realme GT 7 Pro - Snap 8 Elite รุ่นแรกของโลก & แบต 6,500 mAh รุ่นแรกของ realme 🤯

พรีวิว realme GT 7 Pro - Snap 8 Elite รุ่นแรกของโลก & แบต 6,500 mAh รุ่นแรกของ realme 🤯

Be Sure to Remember this Tip! How to Wire Up Ethernet Plugs the Easy Way #shorts #diy #tips #cable

Be Sure to Remember this Tip! How to Wire Up Ethernet Plugs the Easy Way #shorts #diy #tips #cable

One CPU To Rule Them All - Ryzen 7 9800X3D Review

One CPU To Rule Them All - Ryzen 7 9800X3D Review

Durability test of Galaxy Note 8 vs S24 Ultra 😁 #galaxynote8 #s24ultra #iphonexr

Durability test of Galaxy Note 8 vs S24 Ultra 😁 #galaxynote8 #s24ultra #iphonexr