Your videos and articles are a breeze to follow, James! They've truly made my learning journey smoother and more enjoyable. Thanks for all the hard work!
Great explanation James! I want to ask what is the parameter of the feed forward neural network? What is the size of the weight vector/matrix in it so it can be outputed as a probability distribution consist ~30.000 class.
Really interesting stuff. But how about if u want to use Bert in a different language. All the vids I saw were based on the english language. A video of creating a Bert model from scratch in a different language with some simple corpus of text would be nice. It would be also helpful if u can explain in a side note what u have to do if you want to transform your english example in another language...
Hey James! Thanks a lot for the clear explanation of how MLM works for bert. I have a question tho - so we're using only the 'encoder' part of the transformer during MLM to encode the sentence right, So how does the 'decoder' of bert get trained?
Another great piece. Also I have a doubt, while calculating the weights in encoder(attention layer) , what will the initial value of masked token ? Since there should be a numerical value to calculate probability and find a loss value.
@@tomcruise794 each token will have a vector representation in each encoder, for BERT-base this is a vector containing 768 values (of which there are 512 in each encoder - one for each token). The final vector is passed to a feed-forward NN which outputs another vector containing ~30K values (the number of tokens in BERTs vocabulary), we then apply softmax to this. The loss function can then be calculated as the difference between this softmax probability distribution (our prediction) and a one-hot encoded vector of the real token That's pretty long sorry! Does it make sense?
@james thanks for the detailed explanation. But my question is if any word is masked by masked token, then what will be it's initial value/vector representation. Will it be zero? Because there should be some initial value of masked token to calculate probability.
@@tomcruise794 I'm not sure I fully understand! Maybe you are referring to the initial vector in BERTs embedding array? Where the mask token (103) would be replaced by a specific vector which would then be fed into the first encoder block? In that case the initial vector representation wouldn't be zero, it would look like any other word (as far as I'm aware), and before BERT was pretrained these values will have bee initialized with random values (before being optimized in some way to create more representative initial vectors).
Hello , this video is beneficial. But I am getting error when passing the inputs as double argument in model. It says it got unexpected argument "label". Can you please tell what I am doing wrong in it?
Thanks for this wonderful explanation Sir, I want to build my own voice dataset to train medical terms model for auto speech recognition please help me I don't how I can start what is the structure of dataset?
Good video. Have a question: Tokenizer results in shorter sequences vs raw characters and the probability distribution of each token is more even than the distribution of the characters. My question is how important is tokenization to BERT performance?
thanks! The model must be able to represent relationships between tokens and embed some meaning into each token. If we make 1 character == 1 token, that leaves us with (in English) 26 tokens that the model must encode the "meaning" of language into just 26 tokens, so it is limited. If we use sub-word tokens like with bert, we have 30K+ tokens to spread that "meaning of language" across - I hope that makes sense!
@@jamesbriggs I get 30K tokens means high-level semantics. BERT still must learn relationship between these tokens to perform. So what is the degradation of BERT at 30K tokens, vs 20K tokens, vs 10K tokens, ... 26 tokens? I can't find any mention of the above.
the way I've used it so far is for fine-tuning, you can use the same methods for pretraining for fine-tuning BERT to more specific language (improving performance on specific use-cases), but it's pretty open-ended and I'm planning to do some on training from scratch on a different language I'll upload a series intro soon too :)
@@jamesbriggs yes, training for scratch on a new language would be so muchhhh helpful !!!, I'll be waiting for those videos :D, thanks a lot, your channel is a gem!
yes MLM is used to train the 'core' bert models, things like text classification, Q&A, etc are part of the additional 'heads' (or extra layers) added to the end of the transformer models, so you'd train with MLM, then follow that through by training on some text classification task This video will take you through the training for classification: th-cam.com/video/pjtnkCGElcE/w-d-xo.html
Thanks a lot for this masterpiece. I do have something unusual going on. It shows me that BertTokenizer is not callable when it should be able to. I checked and realised that __call__ configuration was introduced from transformers v3.0.0 onwards so I updated my module. Still it throws the same error. Any help here?
Hi...If we mask tokens after tokenization of the text sequence then would it not lead to masking subwords instead of actual words ? any thoughts on the consequences of this ?
yep that's as intended, because BERT learns the relationships between words and subwords, so BERT learns that the word 'live' (or 'liv', '##e') is a different tense but same meaning as the word 'living' (or 'liv', '##ing'). In a sense the way we understand words can be viewed as 'subword', because I can read 'liv' and associate the word with the action 'to live' and then read the suffix '-ing' and understand the action 'to live in the present' - hope that makes sense! In more practical terms it also reduces the vocab size, rather than having the words ['live', 'living', 'lived', 'be', 'being', 'give', 'giving'] we have ['liv', 'be', 'giv', '-ing', '-ed']
Hi, I followed the Hugging Face tutorial for MLM, but it does not seem to work with emojis - any idea on how to do this? For example, I have a dataset containing tweets, with each tweet containing one emoji - and I want to use MLM to predict the emoji for a tweet. Thanks.
Hi Anand, I haven't used BERT with emojis, but it should be similar to training a new model from scratch. Huggingface have a good tutorial here: huggingface.co/blog/how-to-train That should be able to help. In particular this tutorial use Byte-level encodings, which should work well with emojis. I'm working on a video covering training BERT from scratch, hopefully that will help too :) Hope you manage to figure it out!
@@jamesbriggs Hi James - thanks for the reply. I will take a look at that tutorial - will it work with my own dataset? Also, keep up the great content!
Your videos and articles are a breeze to follow, James! They've truly made my learning journey smoother and more enjoyable. Thanks for all the hard work!
Text Extractor can only recognize languages that have the OCR pack installed.
Leam more about supported languages
How do we decide mask value
Great explanation James! I want to ask what is the parameter of the feed forward neural network? What is the size of the weight vector/matrix in it so it can be outputed as a probability distribution consist ~30.000 class.
Really intuitive and easy to understand. Thank you very much, bro!
welcome!
Very nice video. Thanks. We are also waiting for a solution video with Tensorflow.
Thanks! Great video, can you make a video for MLM using T5-1? it will be very helpful I couldn't find much on that.
Excellent, James!
Really interesting stuff. But how about if u want to use Bert in a different language. All the vids I saw were based on the english language. A video of creating a Bert model from scratch in a different language with some simple corpus of text would be nice. It would be also helpful if u can explain in a side note what u have to do if you want to transform your english example in another language...
hey Henk, yes I've had a lot of questions on this, will be releasing something on it soon
@@jamesbriggs thanks, looking forward👍
Hello. Bhai kese ho
Hey James! Thanks a lot for the clear explanation of how MLM works for bert. I have a question tho - so we're using only the 'encoder' part of the transformer during MLM to encode the sentence right, So how does the 'decoder' of bert get trained?
Thanks for the informational video. Enjoyed it. When will you upload the training model through mlm?
Here you go th-cam.com/video/R6hcxMMOrPE/w-d-xo.html :)
Another great piece. Also I have a doubt, while calculating the weights in encoder(attention layer) , what will the initial value of masked token ? Since there should be a numerical value to calculate probability and find a loss value.
@@tomcruise794 each token will have a vector representation in each encoder, for BERT-base this is a vector containing 768 values (of which there are 512 in each encoder - one for each token).
The final vector is passed to a feed-forward NN which outputs another vector containing ~30K values (the number of tokens in BERTs vocabulary), we then apply softmax to this.
The loss function can then be calculated as the difference between this softmax probability distribution (our prediction) and a one-hot encoded vector of the real token
That's pretty long sorry! Does it make sense?
@james thanks for the detailed explanation. But my question is if any word is masked by masked token, then what will be it's initial value/vector representation. Will it be zero? Because there should be some initial value of masked token to calculate probability.
@@tomcruise794 I'm not sure I fully understand! Maybe you are referring to the initial vector in BERTs embedding array? Where the mask token (103) would be replaced by a specific vector which would then be fed into the first encoder block?
In that case the initial vector representation wouldn't be zero, it would look like any other word (as far as I'm aware), and before BERT was pretrained these values will have bee initialized with random values (before being optimized in some way to create more representative initial vectors).
Hello , this video is beneficial. But I am getting error when passing the inputs as double argument in model. It says it got unexpected argument "label". Can you please tell what I am doing wrong in it?
Thank you for the video! What's the best way how to get on the track in Deep Learning? Any hints? Thank you!
Thank YOUUU Very much,missing piece in my NLP journey.
Haha awesome, happy it helps!
Thanks for this wonderful explanation Sir, I want to build my own voice dataset to train medical terms model for auto speech recognition please help me
I don't how I can start what is the structure of dataset?
Very friendly and intuitive introduction. Many thanks for this nice video.
More than welcome, thanks!
Thank you James for this video. You've explained everything so well.
Great to hear! Thanks for watching :)
Very helpful bruh, thank you!
Thanks, man. Also visualize the encoded vectors in a video.
happy you enjoyed - I find it helps to visualize these things
Hello!
How to use BERT to predict the word standing still [MASK]?
you can use the HuggingFace 'fill-mask' pipeline huggingface.co/transformers/main_classes/pipelines.html#fillmaskpipeline
Good video. Have a question: Tokenizer results in shorter sequences vs raw characters and the probability distribution of each token is more even than the distribution of the characters. My question is how important is tokenization to BERT performance?
thanks! The model must be able to represent relationships between tokens and embed some meaning into each token. If we make 1 character == 1 token, that leaves us with (in English) 26 tokens that the model must encode the "meaning" of language into just 26 tokens, so it is limited. If we use sub-word tokens like with bert, we have 30K+ tokens to spread that "meaning of language" across - I hope that makes sense!
@@jamesbriggs I get 30K tokens means high-level semantics. BERT still must learn relationship between these tokens to perform. So what is the degradation of BERT at 30K tokens, vs 20K tokens, vs 10K tokens, ... 26 tokens? I can't find any mention of the above.
How to use Bert for ROMAN URDU
So this series is basically how to pre-train BERT for any language or text from scratch right?
the way I've used it so far is for fine-tuning, you can use the same methods for pretraining for fine-tuning BERT to more specific language (improving performance on specific use-cases), but it's pretty open-ended and I'm planning to do some on training from scratch on a different language
I'll upload a series intro soon too :)
@@jamesbriggs yes, training for scratch on a new language would be so muchhhh helpful !!!, I'll be waiting for those videos :D, thanks a lot, your channel is a gem!
@james really impressive for ur explanation ... is it possible base on this MLM can also apply for text classification model ...
yes MLM is used to train the 'core' bert models, things like text classification, Q&A, etc are part of the additional 'heads' (or extra layers) added to the end of the transformer models, so you'd train with MLM, then follow that through by training on some text classification task
This video will take you through the training for classification:
th-cam.com/video/pjtnkCGElcE/w-d-xo.html
neat explanations! thanks!
This is very helpful! Thanks!
Thanks a lot for this masterpiece. I do have something unusual going on. It shows me that BertTokenizer is not callable when it should be able to. I checked and realised that __call__ configuration was introduced from transformers v3.0.0 onwards so I updated my module. Still it throws the same error. Any help here?
After updating you should be able to call it, I'd double check that your code is using the correct version
Hi...If we mask tokens after tokenization of the text sequence then would it not lead to masking subwords instead of actual words ? any thoughts on the consequences of this ?
yep that's as intended, because BERT learns the relationships between words and subwords, so BERT learns that the word 'live' (or 'liv', '##e') is a different tense but same meaning as the word 'living' (or 'liv', '##ing'). In a sense the way we understand words can be viewed as 'subword', because I can read 'liv' and associate the word with the action 'to live' and then read the suffix '-ing' and understand the action 'to live in the present' - hope that makes sense!
In more practical terms it also reduces the vocab size, rather than having the words ['live', 'living', 'lived', 'be', 'being', 'give', 'giving'] we have ['liv', 'be', 'giv', '-ing', '-ed']
@@jamesbriggs thanks for the clarification !
Hi, I followed the Hugging Face tutorial for MLM, but it does not seem to work with emojis - any idea on how to do this? For example, I have a dataset containing tweets, with each tweet containing one emoji - and I want to use MLM to predict the emoji for a tweet. Thanks.
Hi Anand, I haven't used BERT with emojis, but it should be similar to training a new model from scratch. Huggingface have a good tutorial here:
huggingface.co/blog/how-to-train
That should be able to help. In particular this tutorial use Byte-level encodings, which should work well with emojis.
I'm working on a video covering training BERT from scratch, hopefully that will help too :)
Hope you manage to figure it out!
@@jamesbriggs Hi James - thanks for the reply. I will take a look at that tutorial - will it work with my own dataset? Also, keep up the great content!
@@AnandP2812 I believe so yes, I haven't worked through it myself yet - but I see no reason as to why not - I will do thanks!
Thank you