How to Create a BM25 Index in Python with Rank BM25 (Search Engine)

Python Tutorials for Digital Humanities

มุมมอง 5 667

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 20 ต.ค. 2024
Join this channel to get access to perks:
/ @python-programming
If you enjoy this video, please subscribe.
✅Be my Patron: / wjbmattingly
✅PayPal: www.paypal.com...
Repo: github.com/wjb...
Rank BM25 Repo: github.com/dor...
If there's a specific video you would like to see or a tutorial series, let me know in the comments and I will try and make it.
If you liked this video, check out www.PythonHumanities.com, where I have Coding Exercises, Lessons, on-site Python shells where you can experiment with code, and a text version of the material discussed here.
You can follow me at:
/ wjb_mattingly

ความคิดเห็น • 20

@jesusmtz29 2 ปีที่แล้ว ⁺⁴
I love how you take the time to show how it can produce incorrect result. It's very helpful
@jesusmtz29 2 ปีที่แล้ว ⁺¹
Is there a nice way to.combine this library with spacy?
@python-programming 2 ปีที่แล้ว
Thanks for that comment! It is good to know that others find that approach helpful. Good question about spaCy. There would be. I am thinking of how to do it now and I think you would use the doc container tokens as the sequence text but how you put it in the spaCy pipeline would depend on what you want it to do. Also, you would need to put it in a custom component. If you wanted to have it sit outside of spaCy, you could save your doc containers as an index and then use bm25 to search results and then populate that the results by checking the index of Doc containers.
@karndeepsingh ปีที่แล้ว
how we can extract the trained weights from trained bm25 model?
@kenchang3456 6 หลายเดือนก่อน
Hi. Did you ever get around to making a video to store metadata in a dictionary that accompanied a tokenized index? Thanks for sharing.
@SOUFTVOFFICIEL ปีที่แล้ว
how can we use inverted index with BM25 ... or we don't need Inverted Index in case we use BM25 model
@venkatesanr9455 2 ปีที่แล้ว ⁺¹
Thanks for your valuable videos. I have one doubt, I have many documents after semantic search in which some documents are having same contents with slightly different filenames as it is saved and backuped in different time period. Can you provide a way to have only one documents from this same content having documents because other document which resembles same content, not required. Whether cosine similarity helps here to choose one document from set of same contents having documents.
@python-programming 2 ปีที่แล้ว
Thanks for the comment and question. Would you mind rephrasing this a bit? I just want to make sure I understand the core part of your question.
@venkatesanr9455 2 ปีที่แล้ว
@@python-programming I have handled this by having pdf content of different filenames and droping duplicates/keep the last using pandas dataframe. I think semantic search(symmetric/asymetric) can be done by using bi_encoder/cross_encoder. Can you discuss this please
@SOUFTVOFFICIEL ปีที่แล้ว
how can we use inverted index with BM25 ... or we don't need Inverted Index in case we use BM25 model
@lukasmarteleur9318 ปีที่แล้ว ⁺¹
Does this library work with text in different languages than English?
@python-programming ปีที่แล้ว ⁺¹
I have used it with Latin and it worked fine for me. So it should work with most Western languages.
@wakam229 2 ปีที่แล้ว ⁺¹
I want my query to be all my corpus sentences, is it possible? Like instead of "windy london" be "hello there good man!", " it is quite windy at london"...
@python-programming 2 ปีที่แล้ว
Yes absolutely. You would just adjust the index accordingly
@whoami6821 ปีที่แล้ว ⁺¹
how can we use BM25L with this package?
@python-programming ปีที่แล้ว ⁺¹
Great question! You simply call the BM25L class instead, see line 137: github.com/dorianbrown/rank_bm25/blob/master/rank_bm25.py
@whoami6821 ปีที่แล้ว ⁺¹
@@python-programmingthank you!!!! Also I’m wondering if you know how to combine sentence transformers with pm25 for a better searching results?
@python-programming ปีที่แล้ว ⁺¹
@@whoami6821 No problem! In this scenario, I would recommend using a sentence transformer to vectorize your documents and then use Annoy for the searching algorithm. I don't have a video on doing this with texts, but I do with using a CLIP model (images and text).
@SOUFTVOFFICIEL ปีที่แล้ว
how can we use inverted index with BM25 ... or we don't need Inverted Index in case we use BM25 model
@superfreiheit1 8 หลายเดือนก่อน
Awesome Video quality.

ต่อไป

เล่นอัตโนมัติ