Found this after stumbling around for a good overview of BM25 & SBERT. This is a fantastic initial introduction - enough detail and introduces the right concepts that people can double down on for further learning. Thank you James!
haha happy to hear, I've committed to making videos so I'll be here for a long time 😅 check out the similarity search playlist if you're interested in these things, just finished it!
Thanks, this was very helpful! I've recently started using SQL Server's full text search capabilities to drive course searches on our college website, but it was all a bit of a "black box" thing. No idea how it worked, I just trusted that it *did* work! Until I got a query from someone who wanted to know how to alter their search results to change the order that we display them on the website. I'm no stranger to complicated mathematical formulae, but I took one look at the BM25 formula on wikipedia and cried! Your explanation made it so much easier to understand what was going on. Now comes the hard part. Explaining how the staff member in question can alter their data to boost their results... 😬
@@jamesbriggs Yup ;) I'm gettting into NLP so your videos have been super useful. Just finished my first project that uses both sparse and dense embeddings: share.streamlit.io/tomwalczak/pubmed-abstract-analyzer And as you say in the video, dense embeddings and complex models don't always work better, at least not out-of-the-box. Looking forward to more vids :)
@@tomwalczak4992 That's a very cool project, first one too? I'm impressed! Awesome to see you're getting into it though, looking forward to seeing you around!
point taken and understood about eg..the similar meaning, but using different words.....however in practice just a straight-forward word-to-word with frequency stats works pretty good because: words have usage frequencies, so anybody MEANING to say is gonna say not .....like 100-to-1 odds.....and ? well that's extremely rare..... is gonna be 100s of times more frequent in this context than ..... ==then further ACROSS languages (eg English, Japanese) the word frequencies dont necessarily translate....sometimes frequent English words are infrequent in Japanese and vice versa....
Hi James, fantastic video!!! A question: Using BERT to extract dense representations with hidden_state or last_hidden_state layers and we perform masked_embeddings = embeddings * mask (where mask is the attention_mask BERT output) to put 0 value in padding tokens maybe we need also to consider the special tokens [cls] and [sep]? I mean, the attention mask for these special tokens are 1. So when using some hidden layer from BERT we need perform a slice masked_embeddings = masked_embeddings[ : , 1:sep_token_pos,], where sep_token_pos is the [sep] position in sequence: [[cls], tokens of the sequence [sep], [pad],[pad]...]
hey Paulo, good question. I believe the other sentence transformer models that build these embeddings keep both, but I have never seen them explicitly state that they do (or why) in any papers, so I can't say for sure sorry! Nonetheless, my understand is that the CLS and SEP tokens are included within the embeddings as they still contain useful information about the input data. The CLS token itself can actually be used in building sentence embeddings (although it is ineffective compared to mean pooling afaik). The significance of that being that the CLS token contains enough information about the sequence to be (somewhat) effectively used as a single vector representation of the whole sequence, therefore it contains quite useful information about the sequence that would be lost if removed. As for the SEP token, I don't believe it is as important as the CLS, but I can't say I know how relevant it is. I'd be curious to see a comparison of embedding performance with/without the CLS/SEP tokens though. I'm sure it has been tested but I've never seen something like that mentioned
@@jamesbriggs Hi James I did a test with MNR loss. During tokenization process I set the tokenizer parameter add_special_tokens=False and I got 0.83 against 0.81 with the default value (True). Need to test only without [SEP] token to make the results more robust, thank for the reply :)
I think the bert-base-nli-tokens are deprecated now according to the Hugging Face website. Which Sentence Transformer model should we now use for SBERT?
I like mpnet models the most for generic sentence vets huggingface.co/flax-sentence-embeddings/all_datasets_v3_mpnet-base I cover a load of models, training methods, etc in this playlist: th-cam.com/play/PLIUOU7oqGTLgz-BI8bNMVGwQxIMuQddJO.html Hope it helps :)
it's a great video!! I need your opinion sir James. In this video you are using Cosine Similarity to calculate the distance. What do you think if we combine these methods with ANN (approximate Nearest Neighbor) with angular distance. is that better than use cosine similarity?
Hey Iven, thanks! I think you should absolutely use ANN - definitely if you have lots of vectors. As for cosine similarity vs angular similarity, angular similarity can distinguish better between already very similar vectors, but I'm not sure if it is too important in most use-cases. Most applications from pretty smart people tend to use cosine similarity, so that is (for me) evidence that cosine similarity is 'good enough' If you're interested in ANN and more of this, I have a big playlist on it here th-cam.com/play/PLIUOU7oqGTLhlWpTz4NnuT3FekouIVlqc.html Hope it helps :)
I've not heard of a way but it could be possible - Google's algorithms uses a lot of different things though (BERT included), so I'm not sure if it would be possible to identify specific parts of it like TF-IDF
hey I have a few articles+videos on this, what does your training data look like? If you have sentence pairs + scores you can use MSE loss which I cover at the end of: www.pinecone.io/learn/gpl/ If you don't have training data and just text data you can use unsupervised methods like GPL (above), GenQ, or TSDAE (all found here): www.pinecone.io/learn/nlp/ If you have sentence pairs *without* labels you can use softmax or preferably MNR loss: www.pinecone.io/learn/fine-tune-sentence-transformers-mnr/
Excellent video as always :) kinda makes me wonder why I bother spending my grant money on training courses when your whole channel is simply better. I had a question about using S-BERT for similarities between documents, rather than sentences within a document. Could I just average the embedding for the sentences within each document and calculate cosine similarity between these? Or is there a better way? Thanks!
you can do this but it's not that effective, other option would be to try and compare all paragraphs and take an average score or create some sort of threshold like "if 5 paragraphs similarity > 0.8" etc. It's hard to do. I have a free 'course' on semantic search here, hopefully you can save some more of your grant money: www.pinecone.io/learn/nlp/
You need to tighten up your math notation. Writing f(t, D) for the "total number of terms in the document" is really confusing and makes no sense. What is t in this function? You either need to sum over all t in D, which you haven't written, or you should just get rid of the t and use some function g(D) to denote the total number of terms in the document. When you get onto BM25 it's even worse, I'm not sure your explanation of your notation is even correct. It should be f(q, D) on the denominator, the same q that is in the numerator, not f(t, D), whatever that means.
7 mins to understand TF-IDF, youre my saviour
Found this after stumbling around for a good overview of BM25 & SBERT. This is a fantastic initial introduction - enough detail and introduces the right concepts that people can double down on for further learning. Thank you James!
28:50....both B and G SHARE this phrase and its several words, so THAT's why they share a high similarity score...
Really helped clear up BM25 for me! Huge thank you so sharing this!
Not many views yet, but please dont stop making content. This is the best video i have found in a week of searching.
haha happy to hear, I've committed to making videos so I'll be here for a long time 😅
check out the similarity search playlist if you're interested in these things, just finished it!
Thanks, this was very helpful! I've recently started using SQL Server's full text search capabilities to drive course searches on our college website, but it was all a bit of a "black box" thing. No idea how it worked, I just trusted that it *did* work! Until I got a query from someone who wanted to know how to alter their search results to change the order that we display them on the website. I'm no stranger to complicated mathematical formulae, but I took one look at the BM25 formula on wikipedia and cried! Your explanation made it so much easier to understand what was going on.
Now comes the hard part. Explaining how the staff member in question can alter their data to boost their results... 😬
Great work! You are a great teacher! Although, I know these concepts but I enjoyed a lot watching it.
I am into document similarity ranking and I love your videos! Thank you so much :)
Great to hear! I made a full (and free) course on semantic search if you're interested :) www.pinecone.io/learn/nlp
Great work man!
This is super helpful! Thank you for this video.
extremely simple explanation!!!!!!!!
Really good, simple explanations. Also really liked your Udemy course.
hey Thomas, yes I remember you left a review on the course? Great to see you here too and thanks!
@@jamesbriggs Yup ;) I'm gettting into NLP so your videos have been super useful. Just finished my first project that uses both sparse and dense embeddings: share.streamlit.io/tomwalczak/pubmed-abstract-analyzer
And as you say in the video, dense embeddings and complex models don't always work better, at least not out-of-the-box. Looking forward to more vids :)
@@tomwalczak4992 That's a very cool project, first one too? I'm impressed!
Awesome to see you're getting into it though, looking forward to seeing you around!
point taken and understood about eg..the similar meaning, but using different words.....however in practice just a straight-forward word-to-word with frequency stats works pretty good because: words have usage frequencies, so anybody MEANING to say is gonna say not .....like 100-to-1 odds.....and ? well that's extremely rare..... is gonna be 100s of times more frequent in this context than .....
==then further ACROSS languages (eg English, Japanese) the word frequencies dont necessarily translate....sometimes frequent English words are infrequent in Japanese and vice versa....
that Bert outcome is certainly cool!!!!! you made my day man!! awesome! how can we support you? (besides likes etc.)
comments like this! Really happy it helped :)
Masterful !! Thx for this and all your other stuff !!
Glad you're enjoying them!
I'm doing an uni project in this matter and your explanation was on point! Thank you
great explanations! thanks!
Excellent video thank you!
Great explanation!
Thanks this was helpful
Nice explanation
thank you so muchhhhhhh
Great work!
bro super good explanations
Hi James, fantastic video!!! A question: Using BERT to extract dense representations with hidden_state or last_hidden_state layers and we perform masked_embeddings = embeddings * mask (where mask is the attention_mask BERT output) to put 0 value in padding tokens maybe we need also to consider the special tokens [cls] and [sep]? I mean, the attention mask for these special tokens are 1. So when using some hidden layer from BERT we need perform a slice masked_embeddings = masked_embeddings[ : , 1:sep_token_pos,], where sep_token_pos is the [sep] position in sequence: [[cls], tokens of the sequence [sep], [pad],[pad]...]
hey Paulo, good question. I believe the other sentence transformer models that build these embeddings keep both, but I have never seen them explicitly state that they do (or why) in any papers, so I can't say for sure sorry!
Nonetheless, my understand is that the CLS and SEP tokens are included within the embeddings as they still contain useful information about the input data. The CLS token itself can actually be used in building sentence embeddings (although it is ineffective compared to mean pooling afaik). The significance of that being that the CLS token contains enough information about the sequence to be (somewhat) effectively used as a single vector representation of the whole sequence, therefore it contains quite useful information about the sequence that would be lost if removed.
As for the SEP token, I don't believe it is as important as the CLS, but I can't say I know how relevant it is.
I'd be curious to see a comparison of embedding performance with/without the CLS/SEP tokens though. I'm sure it has been tested but I've never seen something like that mentioned
@@jamesbriggs Hi James I did a test with MNR loss. During tokenization process I set the tokenizer parameter add_special_tokens=False and I got 0.83 against 0.81 with the default value (True). Need to test only without [SEP] token to make the results more robust, thank for the reply :)
@@pfinardii oh so it's better? Wow I'll have to try it too - that's awesome :)
I think the bert-base-nli-tokens are deprecated now according to the Hugging Face website. Which Sentence Transformer model should we now use for SBERT?
I like mpnet models the most for generic sentence vets huggingface.co/flax-sentence-embeddings/all_datasets_v3_mpnet-base
I cover a load of models, training methods, etc in this playlist:
th-cam.com/play/PLIUOU7oqGTLgz-BI8bNMVGwQxIMuQddJO.html
Hope it helps :)
I'm a bit confused. Is SBERT just the embedding layer, which is fed to a ML Model, or is it also the model itself to do e.g. text classification?
Can we classify tabular data where each row is one dataset
it's a great video!! I need your opinion sir James. In this video you are using Cosine Similarity to calculate the distance. What do you think if we combine these methods with ANN (approximate Nearest Neighbor) with angular distance. is that better than use cosine similarity?
Hey Iven, thanks! I think you should absolutely use ANN - definitely if you have lots of vectors. As for cosine similarity vs angular similarity, angular similarity can distinguish better between already very similar vectors, but I'm not sure if it is too important in most use-cases. Most applications from pretty smart people tend to use cosine similarity, so that is (for me) evidence that cosine similarity is 'good enough'
If you're interested in ANN and more of this, I have a big playlist on it here th-cam.com/play/PLIUOU7oqGTLhlWpTz4NnuT3FekouIVlqc.html
Hope it helps :)
@@jamesbriggs Thank you for your opinion and the playlist is quite amazing..! It helps me a lot.. thank you !
Is there a way to "reverse" TFIDF to see if Google uses it in his algorithm?
I've not heard of a way but it could be possible - Google's algorithms uses a lot of different things though (BERT included), so I'm not sure if it would be possible to identify specific parts of it like TF-IDF
how to train sbert with a specific domain?
hey I have a few articles+videos on this, what does your training data look like? If you have sentence pairs + scores you can use MSE loss which I cover at the end of:
www.pinecone.io/learn/gpl/
If you don't have training data and just text data you can use unsupervised methods like GPL (above), GenQ, or TSDAE (all found here):
www.pinecone.io/learn/nlp/
If you have sentence pairs *without* labels you can use softmax or preferably MNR loss:
www.pinecone.io/learn/fine-tune-sentence-transformers-mnr/
Excellent video as always :) kinda makes me wonder why I bother spending my grant money on training courses when your whole channel is simply better. I had a question about using S-BERT for similarities between documents, rather than sentences within a document. Could I just average the embedding for the sentences within each document and calculate cosine similarity between these? Or is there a better way? Thanks!
you can do this but it's not that effective, other option would be to try and compare all paragraphs and take an average score or create some sort of threshold like "if 5 paragraphs similarity > 0.8" etc. It's hard to do. I have a free 'course' on semantic search here, hopefully you can save some more of your grant money:
www.pinecone.io/learn/nlp/
@@jamesbriggs Thanks a lot :) I've been working through your pinecone course and am really liking it so far!
want some scripts or subtiltes for your video, thank you!
Hey is there any way we can get in contact with you?
Yes on the 'About' page of my YT channel you'll be able to find my email
@@jamesbriggs DMed you on Instagram
@@jamesbriggs DMed you
@@edgar23vargas53 got it
@@jamesbriggs shot you an email
You need to tighten up your math notation. Writing f(t, D) for the "total number of terms in the document" is really confusing and makes no sense. What is t in this function? You either need to sum over all t in D, which you haven't written, or you should just get rid of the t and use some function g(D) to denote the total number of terms in the document. When you get onto BM25 it's even worse, I'm not sure your explanation of your notation is even correct. It should be f(q, D) on the denominator, the same q that is in the numerator, not f(t, D), whatever that means.
You don’t know what b and k in bm25, do you? 😏