Great tutorial. How to find sentences that matches a selected cluster? Concordance only do one query at a time and corpus viewer will fetch the documents not the sentences.
Good question. It cannot be done easily, mostly because of the data structure underneath. You technically could tokenize on sentences separately and use the search option to look for specific words in sentences (but also one query at a time). We will think about to handle this. Thank you for the hint.
Hello and thank you for your great work!! I have a problem with "Document embedding" wedges which keeps giving me error when I run it after a corpus or preprocess text. I tried to use grimm_tale dataset and many other data and the error always appears. could you advice please
Possibly an issue with your internet connection? Are you behind a firewall or on a proxy? Alternatively, post the error on our Github page and we will try to help.
Hi, great video, thank you a lot! I have a question: I did a research on Scopus, and exported the .csv file with, e.g., 700 entries. I choose to include in my columns Author and Indexed Keywords, Title and Abstract of the article. The I choose the following widgets (in sequence): Preprocess text Document Embedding t-SNE However, if I choose Distances and then Hierarchical clustering, the clusters will NOT SHOW the "words" but the type of document (Article, Review, etc.) or other fields, such as "Title" or "Abstract", but the entire field is shown, not the tokenized words, e.g. "Feedback control of water supply in an NFT growing system" or "Light And CO2 interaction on peanut grown in nutrient film technique", not the single word. I hope it's clear what I mean. Thanks for all.
Document Embedding does not work on individual words. Instead, it returns document embeddings of fixed vector size, which are not interpretable. For this, you need Bag of Words. You can cluster on embeddings and then use bow features only for explanation, but they might not coincide. To explain individual clusters, you can use Box Plot or Word Enrichment. Alternatively, you can use Annotated Document Map after t-SNE, which will provide significant cluster words.
@OrangeDataMining, thanks for your answer. So, if I have understood, I cannot apply to the .csv file from Scopus the procedures of this video, but I can do by putting the entire documents (.pdf, .docx, etc.) in a folder. On the contrary, on the .csv file I can apply BOW after preprocess text, and then Box Plot, etc? I didn't find Annotated Document Map, but I do found Annotated Corpus Map, and I have applied after BOW.
@@angelo.signore No, you can certainly apply the same procedure to .csv. Just use Corpus widget and define your text field under "Text features". Sorry, I meant Annotated Corpus Map, you are right.
@@OrangeDataMining yes, I did the procedure on the .csv file. I choose File->Corpus and under "Text features" Keyword, Title and Abstract. Then Document Embedding->Proprocess Text->Distances, Hierarchical Clustering->Annotated Corpus Map The problem is in the hierchical clustering I cannot find the term "Words" in the drop-down menu "Annotations"
Nice to see you (all) again - more videos! Thanks!
thanks, great video. What add-on can I find the t-SNE module?
Great tutorial. How to find sentences that matches a selected cluster? Concordance only do one query at a time and corpus viewer will fetch the documents not the sentences.
Good question. It cannot be done easily, mostly because of the data structure underneath. You technically could tokenize on sentences separately and use the search option to look for specific words in sentences (but also one query at a time). We will think about to handle this. Thank you for the hint.
Hello and thank you for your great work!! I have a problem with "Document embedding" wedges which keeps giving me error when I run it after a corpus or preprocess text. I tried to use grimm_tale dataset and many other data and the error always appears. could you advice please
Possibly an issue with your internet connection? Are you behind a firewall or on a proxy? Alternatively, post the error on our Github page and we will try to help.
Hi, great video, thank you a lot!
I have a question: I did a research on Scopus, and exported the .csv file with, e.g., 700 entries. I choose to include in my columns Author and Indexed Keywords, Title and Abstract of the article.
The I choose the following widgets (in sequence):
Preprocess text
Document Embedding
t-SNE
However, if I choose Distances and then Hierarchical clustering, the clusters will NOT SHOW the "words" but the type of document (Article, Review, etc.) or other fields, such as "Title" or "Abstract", but the entire field is shown, not the tokenized words, e.g. "Feedback control of water supply in an NFT growing system" or "Light And CO2 interaction on peanut grown in nutrient film technique", not the single word.
I hope it's clear what I mean.
Thanks for all.
Document Embedding does not work on individual words. Instead, it returns document embeddings of fixed vector size, which are not interpretable. For this, you need Bag of Words. You can cluster on embeddings and then use bow features only for explanation, but they might not coincide. To explain individual clusters, you can use Box Plot or Word Enrichment. Alternatively, you can use Annotated Document Map after t-SNE, which will provide significant cluster words.
@OrangeDataMining, thanks for your answer.
So, if I have understood, I cannot apply to the .csv file from Scopus the procedures of this video, but I can do by putting the entire documents (.pdf, .docx, etc.) in a folder.
On the contrary, on the .csv file I can apply BOW after preprocess text, and then Box Plot, etc?
I didn't find Annotated Document Map, but I do found Annotated Corpus Map, and I have applied after BOW.
@@angelo.signore No, you can certainly apply the same procedure to .csv. Just use Corpus widget and define your text field under "Text features".
Sorry, I meant Annotated Corpus Map, you are right.
@@OrangeDataMining yes, I did the procedure on the .csv file.
I choose File->Corpus and under "Text features" Keyword, Title and Abstract.
Then Document Embedding->Proprocess Text->Distances, Hierarchical Clustering->Annotated Corpus Map
The problem is in the hierchical clustering I cannot find the term "Words" in the drop-down menu "Annotations"
@@angelo.signore No, Annotated Corpus Map follows t-SNE, not HC. After HC, you should use Box Plot or Word Enrichment (but this requires BoW before).