Text analysis in R. Part 1: Preprocessing

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 ม.ค. 2025

ความคิดเห็น • 16

  • @asterixklang7213
    @asterixklang7213 3 ปีที่แล้ว +3

    This is so well explained. Thank you very much for sharing this!

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 3 ปีที่แล้ว

    Very nice coverage of text analysis and the main concepts.

  • @murielmoyahabo6078
    @murielmoyahabo6078 2 ปีที่แล้ว

    I really love this. I will love to see your documents before converting it into a corpus. I need to see the structure and what yiu have there

    • @kasperwelbers
      @kasperwelbers  2 ปีที่แล้ว

      Hi Muriel. Could you clarify which corpus you mean? In general, I think the easiest way to make a corpus is by using a data.frame as input, as also described here: tutorials.quanteda.io/basic-operations/corpus/corpus/

  • @larszijm5882
    @larszijm5882 3 ปีที่แล้ว

    You saved me man, thanks a lot!!

  • @kobeoncount
    @kobeoncount ปีที่แล้ว

    Dear Kasper, thank you very much for your videos. I am just getting into text analytics and I have a quick question. I am planning to work on Turkish language, and I don't know how to handle the stopwords and stemming processes. There are compatible files for TR to work through quanteda, but I don't know how to actually make them work. Could you please give some hints about that also? )

    • @kasperwelbers
      @kasperwelbers  ปีที่แล้ว +1

      Good question. I'm not an expert on Turkish, so I don't know how well these bag-of-word style approaches work for it, but there does seem to be some support for it in quanteda.
      Regarding stopwords, quanteda uses the stopwords package under the hood. That package has the functions stopwords_getlanguages to see which languages are supported. Importantly, you also need to set a 'source' that stopwords uses. The default (snowball) doesn't support Turkish (which I assume is TR), but it seems nltk does:
      library(stopwords)
      stopwords_getsources()
      stopwords_getlanguages(source = 'nltk')
      stopwords('tr', source = 'nltk')
      Similarly, for stemming it uses SnowballC. Same kind of process:
      library(SnowballC)
      getStemLanguages()
      char_wordstem("aslında", language='turkish')
      # (same should work for dfm_wordstem)
      So, not sure how well this works, but it does seem to be supported!

    • @kobeoncount
      @kobeoncount ปีที่แล้ว

      @@kasperwelbers This is so helpful, thank you!!

  • @rubenurbizagastegui36
    @rubenurbizagastegui36 3 ปีที่แล้ว

    How do you remove accents in different languages? Could you please give us some examples?

    • @kasperwelbers
      @kasperwelbers  3 ปีที่แล้ว +2

      Hi Ruben, I think you're looking for transliteration. Simply put, we can translate text into the ascii encoding, which doesn't have accents. This is available in base R (the iconv function), but I prefer using the stringi package:
      library(stringi)
      your_text = 'Der größte soufflé'
      stri_trans_general(your_text, "any-ascii")
      This is vectorized, so your_text can also be a vector with many texts. Note that this might fail, because depending on your system and how you imported/input the text you might need to specify the encoding. The transliteration from 'any' into 'ascii' is a bit rough, but surprisingly it often just works.

    • @rubenurbizagastegui36
      @rubenurbizagastegui36 3 ปีที่แล้ว

      Hi Welbers, Not. I am not looking for transliteration. I am looking for a way to deal with spanish accents at doin text analysis with Quanteda. it looks that Quanteda does not recognize accents. How to deal with spanish accents using Quanteda?

    • @kasperwelbers
      @kasperwelbers  3 ปีที่แล้ว +1

      @@rubenurbizagastegui36 But how do you then want to 'deal with spanish accents'? Your question was how to remove accents (which is often a good solution) but that is what you'd use transliteration for. Did you check the example code in my previous comment?

  • @gabrielbriziou1602
    @gabrielbriziou1602 2 ปีที่แล้ว

    I love you

  • @frojet0815
    @frojet0815 2 ปีที่แล้ว

    很棒 但希望有字幕 幫助非英語系的網友更容易觀看