NLP Demystified 2: Text Tokenization

Future Mojo

มุมมอง 15 801

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 27 พ.ย. 2024

ความคิดเห็น • 30

@futuremojo 2 ปีที่แล้ว ⁺¹
Timestamps:
00:00 Tokenization
00:12 Text as unstructured data
00:39 What is tokenization?
01:09 The challenges of tokenization
03:09 DEMO: tokenizing text with spaCy
07:55 Preprocessing as a pipeline
@pictzone ปีที่แล้ว ⁺¹⁰
This guy posted a mind-blowing series and then left. Thank you, you're a legend!
@anissahli-gl9ud ปีที่แล้ว ⁺²
Je viens de France et je viens juste de tomber sur cette superbe playlist qui est pour moi la plus complète sur youtube ! Merci, un grand merci à vous ! C'est difficile de trouver des formations d'une telle qualité.
@Ash2Tutorial หลายเดือนก่อน
Good and straight to the point. Also well narrated.
@MrNeelthehulk ปีที่แล้ว ⁺¹
Thanks for posting this series buddy!!
@alp1234alp1234 ปีที่แล้ว ⁺¹
Thank you so much for offering such high quality content 🎉
@futuremojo ปีที่แล้ว
Hope it helps!
@somerset006 ปีที่แล้ว ⁺¹
Nicely done, thanks!
@caiyu538 ปีที่แล้ว ⁺²
great lectures, thumb up.
@FrankCai-e7r ปีที่แล้ว ⁺¹
Great to know more about NLP concepts. In Hugging face tutorials, there are some concepts are not mentioned. I guess these concepts may be a little outdated in the era of transformer.
@rishidixit7939 หลายเดือนก่อน ⁺¹
For specific use cases like Understanding of Mathematical Expressions or Physics or Chemistry Notations should we make our own tokenizer or should we extend the current libraries
@futuremojo หลายเดือนก่อน ⁺¹
My policy is to always use what's out there first, test them to see if they meet my needs, and decide from there.
@CC-nz2oc ปีที่แล้ว
Hello. Thank you for such a detailed course. I have a question about using pre-trained language models. My language (Azerbaijani) is not yet available in the library. Are you covering this topic further or is it not worth wasting time on learning without this model?
@SatyaRao-fh4ny 11 หลายเดือนก่อน
These are very helpful videos, thank you! There are still a few concepts that are unclear. You have mentioned that documents are segmented to a list of sentences, and each sentence segmented into a list of tokens. This implies that the list of tokens is empty to begin with, and after tokenization, we end up with a list of tokens(token vocabulary?) specific to the corpus we provide. But later, when you start the tokenization using spaCy, you are loading some db??? What is this doing? Shouldn't spaCy just be a program/tool that has some "advanced rules" to tokenize a document that we provide, and create a new token vocabulary from scratch, and not use it's own db/list created from some unknown corpus as some starting point? And finally, why tokenize a sentence at a time- because a document size can be large? Could it have read in a fixed number of words at a time, say 100 words, and then tokenized them? A "sentence" should have no meaning for the tokenizer, is this right? Actually, how does a tokenizer even "know" when a sentence starts/ends?!? Thanks for any clarifications!
@nebvoice 6 หลายเดือนก่อน
the db you are referring is the statistical model that was trained on some annotated data(forgot the name here). That is the thing that tokenizes the given document or sentences. Spacy is just a module that helps us tokenize our data according to those statistical model. ... All this, I think so. Just a beginner....
@BadEnoughDudeRescues 2 ปีที่แล้ว ⁺¹
Hi, fantastic course! Wondering if by any chance there are solutions available to the exercises in the notebooks? I checked the github and collab but was unable to find solutions for the exercises.
@futuremojo 2 ปีที่แล้ว ⁺¹
Thank you!
I didn't publish solutions for the exercises but if you're stuck, email me and I'll help you out.
@gnorts_mr_alien ปีที่แล้ว ⁺²
you have a radio voice.
@FrankCai-e7r ปีที่แล้ว
Since hugging face and openAPI provide APIs for use, could we skip spaCy, NLTK, these relatively old library?
@futuremojo ปีที่แล้ว
spaCy uses transformers under the hood.
That being said, I would use HF libraries if you're looking to do more fine-grained work other than calling out to an LLM.
@FrankCai-e7r ปีที่แล้ว
@@futuremojo Thank you so much. I learned a lot of NLP concepts from your great lectures.
@oluOnline ปีที่แล้ว
Is it still possible to connect to a local runtime? I can't see an obvious connect button. May delete this if I solve it, thanks for any help!
@futuremojo ปีที่แล้ว ⁺¹
Hey Olu: yes, it's possible. These instructions worked for me:
research.google.com/colaboratory/local-runtimes.html
@oluOnline ปีที่แล้ว
@@futuremojo Thanks so much; very fast reply also!
Do you happen to know if colab has any quirks with zsh shell? Googling turns up nothing but first pip install in the notebook returns: `zsh:1: no matches found: spacy==3.* `
edit: seems to work without the ==3! now i'm trying to work out why it doesn't recognise it as a module...
@futuremojo ปีที่แล้ว ⁺¹
@@oluOnline I just tried on zsh and got the same error.
Googled it and found this:
stackoverflow.com/questions/30539798/zsh-no-matches-found-requestssecurity
When I use quotes like this:
pip install -U 'spacy==3.*'
It works!
@oluOnline ปีที่แล้ว
@@futuremojo Final question; import says module not found? Sorry for all these setup questions I'm unsure what's zsh, what's colab and what's python! (
It would be nice to have an extra page before 0 with an intro to tools used e.g. jupyter notebooks, colab, and i assume a load of other stuff)
@futuremojo ปีที่แล้ว
@@oluOnline Is the problem happening when you import spaCy? If so, I'm not getting that.
Here's a video I shot of me starting in an empty pipenv shell and installing spacy.
www.loom.com/share/252f86aab1394b3580840ea2f55cba54
My guess is that there's an environment issue where pip is installing it in one Python environment, but you're trying to import it in *another* Python environment. Are you using a tool like virtualenv to isolate environments?
@michaelcharlesthearchangel ปีที่แล้ว
Interesting to see AI developers reword phraseology concepts and language morphemes into "token" corporate key words.
English majors and Language doctorates are laughing, 😆:;🤣. And asking why 🤔?

ต่อไป

เล่นอัตโนมัติ

NLP Demystified 3: Basic Preprocessing (case-folding, stop words, stemming, lemmatization)