Text analysis in R. Part 1b: Advanced preprocessing

Text analysis in R. Part 2: Analysis approaches

Teaching the tidyverse in 2023 | Mine Çetinkaya-Rundel

Victim - Animator vs. Animation 11

The History of Super Mario’s Hidden Ending

This Month Was Tough on Us..

Text analysis in R. Part 1: Preprocessing

Kasper Welbers

Просмотров 15 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 20 янв 2025

Комментарии • 17

@steffffanos 6 дней назад ⁺¹
You helped me immensely in web scraping and my next project is based in text modeling. I'm glad I found this, you are a godsend
@asterixklang7213 3 года назад ⁺³
This is so well explained. Thank you very much for sharing this!
@kasperwelbers 3 года назад
Thanks!
@haraldurkarlsson1147 3 года назад
Very nice coverage of text analysis and the main concepts.
@obakeng4287 2 года назад
Legendary 👌
@murielmoyahabo6078 2 года назад
I really love this. I will love to see your documents before converting it into a corpus. I need to see the structure and what yiu have there
@kasperwelbers 2 года назад
Hi Muriel. Could you clarify which corpus you mean? In general, I think the easiest way to make a corpus is by using a data.frame as input, as also described here: tutorials.quanteda.io/basic-operations/corpus/corpus/
@kobeoncount Год назад
Dear Kasper, thank you very much for your videos. I am just getting into text analytics and I have a quick question. I am planning to work on Turkish language, and I don't know how to handle the stopwords and stemming processes. There are compatible files for TR to work through quanteda, but I don't know how to actually make them work. Could you please give some hints about that also? )
@kasperwelbers Год назад ⁺¹
Good question. I'm not an expert on Turkish, so I don't know how well these bag-of-word style approaches work for it, but there does seem to be some support for it in quanteda.
Regarding stopwords, quanteda uses the stopwords package under the hood. That package has the functions stopwords_getlanguages to see which languages are supported. Importantly, you also need to set a 'source' that stopwords uses. The default (snowball) doesn't support Turkish (which I assume is TR), but it seems nltk does:
library(stopwords)
stopwords_getsources()
stopwords_getlanguages(source = 'nltk')
stopwords('tr', source = 'nltk')
Similarly, for stemming it uses SnowballC. Same kind of process:
library(SnowballC)
getStemLanguages()
char_wordstem("aslında", language='turkish')
# (same should work for dfm_wordstem)
So, not sure how well this works, but it does seem to be supported!
@kobeoncount Год назад
@@kasperwelbers This is so helpful, thank you!!
@larszijm5882 3 года назад
You saved me man, thanks a lot!!
@rubenurbizagastegui36 3 года назад
How do you remove accents in different languages? Could you please give us some examples?
@kasperwelbers 3 года назад ⁺²
Hi Ruben, I think you're looking for transliteration. Simply put, we can translate text into the ascii encoding, which doesn't have accents. This is available in base R (the iconv function), but I prefer using the stringi package:
library(stringi)
your_text = 'Der größte soufflé'
stri_trans_general(your_text, "any-ascii")
This is vectorized, so your_text can also be a vector with many texts. Note that this might fail, because depending on your system and how you imported/input the text you might need to specify the encoding. The transliteration from 'any' into 'ascii' is a bit rough, but surprisingly it often just works.
@rubenurbizagastegui36 3 года назад
Hi Welbers, Not. I am not looking for transliteration. I am looking for a way to deal with spanish accents at doin text analysis with Quanteda. it looks that Quanteda does not recognize accents. How to deal with spanish accents using Quanteda?
@kasperwelbers 3 года назад ⁺¹
@@rubenurbizagastegui36 But how do you then want to 'deal with spanish accents'? Your question was how to remove accents (which is often a good solution) but that is what you'd use transliteration for. Did you check the example code in my previous comment?
@gabrielbriziou1602 2 года назад
I love you
@frojet0815 2 года назад
很棒但希望有字幕幫助非英語系的網友更容易觀看

Следующие

Автовоспроизведение

Text analysis in R. Part 1b: Advanced preprocessing

Text analysis in R. Part 1b: Advanced preprocessing

Text analysis in R. Part 2: Analysis approaches

Text analysis in R. Part 2: Analysis approaches

Teaching the tidyverse in 2023 | Mine Çetinkaya-Rundel

Teaching the tidyverse in 2023 | Mine Çetinkaya-Rundel

Victim - Animator vs. Animation 11

Victim - Animator vs. Animation 11

The History of Super Mario’s Hidden Ending

The History of Super Mario’s Hidden Ending

This Month Was Tough on Us..

This Month Was Tough on Us..

SIDEMEN AMONG US MAGE ROLE: CAST A LIGHTNING STRIKE TO WIN

SIDEMEN AMONG US MAGE ROLE: CAST A LIGHTNING STRIKE TO WIN

Text Analysis Basics

Text Analysis Basics

How to use GPT for text analysis in R

How to use GPT for text analysis in R

Why R? 2020 | Ken Benoit - Why you should stop using other text mining packages and embrace quanteda

Why R? 2020 | Ken Benoit - Why you should stop using other text mining packages and embrace quanteda

كل ما تحتاج معرفته عن وظائف مجال الـ Data

كل ما تحتاج معرفته عن وظائف مجال الـ Data

Text analysis in R. Demo 1: Corpus statistics

Text analysis in R. Demo 1: Corpus statistics

Dictionary-Based Text Analysis

Dictionary-Based Text Analysis

Sentiment analysis with tidytext (R case study, 2021)

Sentiment analysis with tidytext (R case study, 2021)

Text analysis in R. Demo 2: Sentiment dictionaries

Text analysis in R. Demo 2: Sentiment dictionaries

Игра в калмэра в реальной жизни

Игра в калмэра в реальной жизни

КТО СМОГ ОБОГНАТЬ AUDI RS6 по СНЕГУ и ЗАБРАЛ ДЕНЬГИ?

КТО СМОГ ОБОГНАТЬ AUDI RS6 по СНЕГУ и ЗАБРАЛ ДЕНЬГИ?

ЦЫГАНКА ПРЕДСКАЗАЛА БУДУЩЕЕ #иванабрамов #юмор #цыганка #предсказание #концовка #shorts

ЦЫГАНКА ПРЕДСКАЗАЛА БУДУЩЕЕ #иванабрамов #юмор #цыганка #предсказание #концовка #shorts

Дана УАЙТ ПРИЗНАЛ ЕГО ЛУЧШИМ НА ПЛАНЕТЕ #мма

Дана УАЙТ ПРИЗНАЛ ЕГО ЛУЧШИМ НА ПЛАНЕТЕ #мма

В ДЕТСТВЕ ИГРАЕШЬ В МАШИНКИ НА КОВРЕ

В ДЕТСТВЕ ИГРАЕШЬ В МАШИНКИ НА КОВРЕ

Mysterious Notebook📕”Let’s Shake It”

Mysterious Notebook📕”Let’s Shake It”

Редакция. News: 152-я неделя

Редакция. News: 152-я неделя

ВИРУСНЫЕ ВИДЕО / Кот напал на хозяина?

ВИРУСНЫЕ ВИДЕО / Кот напал на хозяина?