Semantic Word Maps and Clusters

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 มิ.ย. 2022
  • Text mining with Orange. Learn about Orange components that we have designed for analysis of text. The series includes tutorials on loading documents, construction of word clouds, keyword extraction, visualisation of document and word maps, text-based classification, and many other topics.
    In the series, we present and analyse a repository of proposals to the Slovene government that are available at file.biolab.si/text-semantics/.... Some of the original documents were written in the Slovene language, and we machine-translated them into English. An example of such translations is a sample of proposals at file.biolab.si/text-semantics/....
    In this video, we show how to:
    Import data from a repository into Orange (Import Documents widget)
    Preprocess text (Preprocess Text widget)
    Convert text to numeric vectors and reduce vector dimensions of the embedded documents (Document Embedding and t-SNE widgets)
    How to find clusters of related documents in a t-SNE map (t-SNE map)
    How to explore document content with the help of Word Cloud and Corpus Viewer widgets.
    Download Orange from:
    orangedatamining.com/download/
    License: GNU GPL + CC
    Music by: Damjan Jović - Dravlje Rec
    Website: orangedatamining.com

ความคิดเห็น • 13

  • @lemndemiel5263
    @lemndemiel5263 2 ปีที่แล้ว +2

    Nice to see you (all) again - more videos! Thanks!

  • @SpiritTracker7
    @SpiritTracker7 ปีที่แล้ว

    thanks, great video. What add-on can I find the t-SNE module?

  • @waeladel
    @waeladel 9 หลายเดือนก่อน

    Great tutorial. How to find sentences that matches a selected cluster? Concordance only do one query at a time and corpus viewer will fetch the documents not the sentences.

    • @OrangeDataMining
      @OrangeDataMining  9 หลายเดือนก่อน +1

      Good question. It cannot be done easily, mostly because of the data structure underneath. You technically could tokenize on sentences separately and use the search option to look for specific words in sentences (but also one query at a time). We will think about to handle this. Thank you for the hint.

  • @no8888one
    @no8888one ปีที่แล้ว

    Hello and thank you for your great work!! I have a problem with "Document embedding" wedges which keeps giving me error when I run it after a corpus or preprocess text. I tried to use grimm_tale dataset and many other data and the error always appears. could you advice please

    • @OrangeDataMining
      @OrangeDataMining  9 หลายเดือนก่อน

      Possibly an issue with your internet connection? Are you behind a firewall or on a proxy? Alternatively, post the error on our Github page and we will try to help.

  • @angelo.signore
    @angelo.signore 5 หลายเดือนก่อน

    Hi, great video, thank you a lot!
    I have a question: I did a research on Scopus, and exported the .csv file with, e.g., 700 entries. I choose to include in my columns Author and Indexed Keywords, Title and Abstract of the article.
    The I choose the following widgets (in sequence):
    Preprocess text
    Document Embedding
    t-SNE
    However, if I choose Distances and then Hierarchical clustering, the clusters will NOT SHOW the "words" but the type of document (Article, Review, etc.) or other fields, such as "Title" or "Abstract", but the entire field is shown, not the tokenized words, e.g. "Feedback control of water supply in an NFT growing system" or "Light And CO2 interaction on peanut grown in nutrient film technique", not the single word.
    I hope it's clear what I mean.
    Thanks for all.

    • @OrangeDataMining
      @OrangeDataMining  5 หลายเดือนก่อน

      Document Embedding does not work on individual words. Instead, it returns document embeddings of fixed vector size, which are not interpretable. For this, you need Bag of Words. You can cluster on embeddings and then use bow features only for explanation, but they might not coincide. To explain individual clusters, you can use Box Plot or Word Enrichment. Alternatively, you can use Annotated Document Map after t-SNE, which will provide significant cluster words.

    • @angelo.signore
      @angelo.signore 5 หลายเดือนก่อน

      ​ @OrangeDataMining, thanks for your answer.
      So, if I have understood, I cannot apply to the .csv file from Scopus the procedures of this video, but I can do by putting the entire documents (.pdf, .docx, etc.) in a folder.
      On the contrary, on the .csv file I can apply BOW after preprocess text, and then Box Plot, etc?
      I didn't find Annotated Document Map, but I do found Annotated Corpus Map, and I have applied after BOW.

    • @OrangeDataMining
      @OrangeDataMining  5 หลายเดือนก่อน

      @@angelo.signore No, you can certainly apply the same procedure to .csv. Just use Corpus widget and define your text field under "Text features".
      Sorry, I meant Annotated Corpus Map, you are right.

    • @angelo.signore
      @angelo.signore 5 หลายเดือนก่อน

      @@OrangeDataMining yes, I did the procedure on the .csv file.
      I choose File->Corpus and under "Text features" Keyword, Title and Abstract.
      Then Document Embedding->Proprocess Text->Distances, Hierarchical Clustering->Annotated Corpus Map
      The problem is in the hierchical clustering I cannot find the term "Words" in the drop-down menu "Annotations"

    • @OrangeDataMining
      @OrangeDataMining  5 หลายเดือนก่อน

      @@angelo.signore No, Annotated Corpus Map follows t-SNE, not HC. After HC, you should use Box Plot or Word Enrichment (but this requires BoW before).