Text analysis in R. Part 1: Preprocessing

Поделиться
HTML-код
  • Опубликовано: 20 янв 2025

Комментарии • 17

  • @steffffanos
    @steffffanos 6 дней назад +1

    You helped me immensely in web scraping and my next project is based in text modeling. I'm glad I found this, you are a godsend

  • @asterixklang7213
    @asterixklang7213 3 года назад +3

    This is so well explained. Thank you very much for sharing this!

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 3 года назад

    Very nice coverage of text analysis and the main concepts.

  • @murielmoyahabo6078
    @murielmoyahabo6078 2 года назад

    I really love this. I will love to see your documents before converting it into a corpus. I need to see the structure and what yiu have there

    • @kasperwelbers
      @kasperwelbers  2 года назад

      Hi Muriel. Could you clarify which corpus you mean? In general, I think the easiest way to make a corpus is by using a data.frame as input, as also described here: tutorials.quanteda.io/basic-operations/corpus/corpus/

  • @kobeoncount
    @kobeoncount Год назад

    Dear Kasper, thank you very much for your videos. I am just getting into text analytics and I have a quick question. I am planning to work on Turkish language, and I don't know how to handle the stopwords and stemming processes. There are compatible files for TR to work through quanteda, but I don't know how to actually make them work. Could you please give some hints about that also? )

    • @kasperwelbers
      @kasperwelbers  Год назад +1

      Good question. I'm not an expert on Turkish, so I don't know how well these bag-of-word style approaches work for it, but there does seem to be some support for it in quanteda.
      Regarding stopwords, quanteda uses the stopwords package under the hood. That package has the functions stopwords_getlanguages to see which languages are supported. Importantly, you also need to set a 'source' that stopwords uses. The default (snowball) doesn't support Turkish (which I assume is TR), but it seems nltk does:
      library(stopwords)
      stopwords_getsources()
      stopwords_getlanguages(source = 'nltk')
      stopwords('tr', source = 'nltk')
      Similarly, for stemming it uses SnowballC. Same kind of process:
      library(SnowballC)
      getStemLanguages()
      char_wordstem("aslında", language='turkish')
      # (same should work for dfm_wordstem)
      So, not sure how well this works, but it does seem to be supported!

    • @kobeoncount
      @kobeoncount Год назад

      @@kasperwelbers This is so helpful, thank you!!

  • @larszijm5882
    @larszijm5882 3 года назад

    You saved me man, thanks a lot!!

  • @rubenurbizagastegui36
    @rubenurbizagastegui36 3 года назад

    How do you remove accents in different languages? Could you please give us some examples?

    • @kasperwelbers
      @kasperwelbers  3 года назад +2

      Hi Ruben, I think you're looking for transliteration. Simply put, we can translate text into the ascii encoding, which doesn't have accents. This is available in base R (the iconv function), but I prefer using the stringi package:
      library(stringi)
      your_text = 'Der größte soufflé'
      stri_trans_general(your_text, "any-ascii")
      This is vectorized, so your_text can also be a vector with many texts. Note that this might fail, because depending on your system and how you imported/input the text you might need to specify the encoding. The transliteration from 'any' into 'ascii' is a bit rough, but surprisingly it often just works.

    • @rubenurbizagastegui36
      @rubenurbizagastegui36 3 года назад

      Hi Welbers, Not. I am not looking for transliteration. I am looking for a way to deal with spanish accents at doin text analysis with Quanteda. it looks that Quanteda does not recognize accents. How to deal with spanish accents using Quanteda?

    • @kasperwelbers
      @kasperwelbers  3 года назад +1

      @@rubenurbizagastegui36 But how do you then want to 'deal with spanish accents'? Your question was how to remove accents (which is often a good solution) but that is what you'd use transliteration for. Did you check the example code in my previous comment?

  • @gabrielbriziou1602
    @gabrielbriziou1602 2 года назад

    I love you

  • @frojet0815
    @frojet0815 2 года назад

    很棒 但希望有字幕 幫助非英語系的網友更容易觀看