Kasper Welbers
Kasper Welbers
  • Видео 11
  • Просмотров 192 355
Webscraping in R
!! This video was recorded a while ago, and some of the examples no longer work. For the first example (on wikipedia), please check the updated code in this RMarkdown document:
github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/rvest.md
And yeah I know, the video is pretty long! It's actually 2 parts (in hindsight). Up till 40:00 it's mainly introducing how this works, and after 40:00 it's walking through 2 demo's. If you're the type of person that first wants to see something in action, you can skip straight to 40:00, and then see whether you want to spend time on learning understand what's happening there (for which you can either use the video or RMarkdown document)....
Просмотров: 17 351

Видео

LDA Topic modeling in R
Просмотров 21 тыс.4 года назад
RMarkdown tutorial: github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/r_text_lda.md Video series about topic modeling: ruclips.net/video/ELct2RRENQM/видео.html More tutorial stuff: github.com/ccs-amsterdam/r-course-material Good article on preprocessing for unsupervised ml: pdfs.semanticscholar.org/95e0/c468a19afc6173053234c7fe660033363ffb.pdf
Multilevel models in R
Просмотров 19 тыс.4 года назад
This video is the second part of a tutorial video on GLM and Multilevel in R. It gives a general handwaving introduction, with the main goal of showing the R code. For a proper introduction into Multilevel modeling as a technique, we recommend this free manuscript Chapter from a great book on the topic: multilevel-analysis.sites.uu.nl/wp-content/uploads/sites/27/2018/02/02Ch2-Basic3449.pdf
GLM in R
Просмотров 60 тыс.4 года назад
In this video we walk through a tutorial for Generalized Linear Models in R. The main goal is to show how to use this type of model, focusing on logistic regression, and talk a bit about why it's a good tool to know. The tutorial discusses both GLM and multilevel models, but the video has been split into two parts. github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/advanced_modeli...
Basic statistics in R
Просмотров 2,4 тыс.4 года назад
An introduction to basic statistics in R, based on the following tutorial: github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/simple_modeling.md
Understanding the glm family argument (in R)
Просмотров 21 тыс.4 года назад
The goal of this video is to help you better understand the 'error distribution' and 'link function' in Generalized Linear Models. For a deeper understanding of GLM's, I'd recommend the book "Generalized Linear Models" by McCullagh and Nelder. This is a book well worth buying, but I also (somehow) found an online version: www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf
Text analysis in R. Demo 2: Sentiment dictionaries
Просмотров 5 тыс.4 года назад
This demo is part of a short series of videos on text analysis in R, developed mainly for R introduction workshops. A more detailed tutorial for the code discussed here can be found on our R course material Github page: github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/sentiment_analysis.md Vignette for how to use corpustools: cran.r-project.org/web/packages/corpustools/vignettes...
Text analysis in R. Demo 1: Corpus statistics
Просмотров 21 тыс.4 года назад
This demo is part of a short series of videos on text analysis in R, developed mainly for R introduction workshops. A more detailed tutorial for the code discussed here can be found on our R course material Github page: github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/R_text_3_quanteda.md
Text analysis in R. Part 2: Analysis approaches
Просмотров 6 тыс.4 года назад
This is a short series of videos on the basics of computational text analysis in R. It is loosely inspired by our Text analysis in R paper (vanatteveldt.com/p/welbers-text-r.pdf), closely related to our R course material Github page (github.com/ccs-amsterdam/r-course-material), and 42% love letter to quanteda.
Text analysis in R. Part 1b: Advanced preprocessing
Просмотров 4,8 тыс.4 года назад
This is a short series of videos on the basics of computational text analysis in R. It is loosely inspired by our Text analysis in R paper (vanatteveldt.com/p/welbers-text-r.pdf), closely related to our R course material Github page (github.com/ccs-amsterdam/r-course-material), and 42% love letter to quanteda. This specific video just adds some stuff about more advanced tools for preprocessing....
Text analysis in R. Part 1: Preprocessing
Просмотров 15 тыс.4 года назад
This is a short series of videos on the basics of computational text analysis in R. It is loosely inspired by our Text analysis in R paper (vanatteveldt.com/p/welbers-text-r.pdf), closely related to our R course material Github page (github.com/ccs-amsterdam/r-course-material), and 42% love letter to quanteda. Useful links # Low-level string processing: A good place to start is by learning how ...

Комментарии

  • @steqhanos826
    @steqhanos826 4 дня назад

    You helped me immensely in web scraping and my next project is based in text modeling. I'm glad I found this, you are a godsend

  • @sakifzaman
    @sakifzaman 20 дней назад

    hi| sorry for leaving a comment on an old post. dont know whether you will be able to reply. but i have a problem. i did exactly the way you did it. but at the ggplot code i'm getting this error : Warning messages: All aesthetics have length 1, but the data has 144 rows. ℹ Please consider using `annotate()` or provide this layer with data containing a single row. 2: In geom_smooth(method = "lm") : All aesthetics have length 1, but the data has 144 rows. ℹ Please consider using `annotate()` or provide this layer with data containing a single row.

    • @kasperwelbers
      @kasperwelbers 16 дней назад

      Hi @sakifzaman. I suspect that you might have typed the aesthetics (in the aes part) with "quotes" instead of `backticks`. The difference is easy to miss (backticks are like single quotes pointing in the other direction). In R, backticks are used for names that have spaces in them. So `GDP per capita` is the name of the column. If you use quotes (single or double) instead, ggplot interprets it as a value. So this is why it would say that you provide aesthetics of length 1 (just the value "GDP per capita", even though the data has 144 rows.

  • @Abdulaziz-yj1ns
    @Abdulaziz-yj1ns 25 дней назад

    Thank you so much very informative

  • @emmanuelgk4663
    @emmanuelgk4663 Месяц назад

    me finding this useful 4 years latter

  • @OriginalJoseyWales
    @OriginalJoseyWales Месяц назад

    reaction time should increase with sleep deprivation, no?

  • @WalterEunice-e1s
    @WalterEunice-e1s 3 месяца назад

    Doyle Plaza

  • @gotnolove923
    @gotnolove923 3 месяца назад

    Tabmodel doesnt work😮

    • @Whycantijustdeletethis
      @Whycantijustdeletethis 3 месяца назад

      Surely we can make it work. What error do you get?

    • @kasperwelbers
      @kasperwelbers 3 месяца назад

      @@gotnolove923 ah haha, that was me on another account that I was trying to delete.

  • @AndersonDouglas-v5c
    @AndersonDouglas-v5c 3 месяца назад

    Weissnat Shores

  • @HarlanEngdahl-e3l
    @HarlanEngdahl-e3l 3 месяца назад

    Hilll Streets

  • @KatrineBasil-c5n
    @KatrineBasil-c5n 4 месяца назад

    Tanner Rest

  • @StracheyAnnabelle-w8c
    @StracheyAnnabelle-w8c 4 месяца назад

    Garcia Paul Wilson William Young Karen

  • @Mojiborkhan-i1s
    @Mojiborkhan-i1s 4 месяца назад

    Thomas Paul Wilson Eric Hernandez Melissa

  • @DiamondScheiber-j9w
    @DiamondScheiber-j9w 4 месяца назад

    Kailey Islands

  • @gergerger53
    @gergerger53 4 месяца назад

    Very well put together. I think there should be some recognition of the fact some of the symbols are mixed up in the presentation. The systematic component should always be mu and mu goes into the link function to give eta and eta is the value that goes into the random component distribution. Otherwise the slides don't make sense. To take a random example, the probit regression slide, mu is not defined anywhere. But changing systematic component to mu and then changing binomial parameter to eta then fixes everything.

    • @kasperwelbers
      @kasperwelbers 4 месяца назад

      Hi Murphyalex. Thanks for your comment! The notation used here is based on the book in the description. I was also initially confused about using eta as the systematic component, and then defining mu from inside the link function rather than the output of the link function, but thats how the link function is defined, and when you read their runthrough of the generalization it makes sense (just looked it up again; page 42, highly recommended). Note that mu is still defined, but as the inverse of the link function over eta. So for example, for poisson the mean function for the poisson distribution is defined as mu = exp(eta), which is identical to eta = log(mu). Or am I missing something else that you're referring to?

  • @mindandresearch
    @mindandresearch 5 месяцев назад

    You should make more and more videos. You explained this on point! Like on R and everything on it you will surely be the best no doubt!

  • @gauravsutar9890
    @gauravsutar9890 6 месяцев назад

    Hello it was good to learn LDA from this video, but can you arrange any videos for Structural topic modelling full explanation ?

    • @kasperwelbers
      @kasperwelbers 6 месяцев назад

      Hi @guaravsutar9080, I'm afraid I haven't planned anything of the sort. It's been a while since I used topic modeling (simply because my research took me elsewhere), so I'm not fully up to speed on the current state of the field.

    • @gauravsutar9890
      @gauravsutar9890 6 месяцев назад

      @@kasperwelbers oh yes thank you so much Actually I’m going through it but some of the codes I’m not able to interpret in R

  • @mollymurphey4526
    @mollymurphey4526 7 месяцев назад

    How do I add my own csv file as the corpus?

  • @EurekaRaven
    @EurekaRaven 7 месяцев назад

    Many thanks for great work! What software/tools do you use to record these videos if you don't mind me asking.

    • @kasperwelbers
      @kasperwelbers 7 месяцев назад

      Thanks! I mostly used OBS, which is an open source tool for recording and streaming. I found it quite intuitive (with some tutorials), and as someone without any editing experience was able to set up a good simple system for switching and layering windows. (Though to be honest, this was amid early pandemic despair over how to manage online teaching, so I probably did spend quite some time on it). For the weather-person effect of talking in front of a screen, I bought a pull-up greenscreen, though since then I think automatic background filtering has come a long way, so a greenscreen might no longer be needed. I also used Kdenlive for editing. In my case I only used this for cutting and pasting pieces of recordings, which didn't really take long to figure out, but I think that tool also supports more advanced editing.

    • @EurekaRaven
      @EurekaRaven 7 месяцев назад

      @@kasperwelbers thank you so much!

  • @juliantorelli4540
    @juliantorelli4540 7 месяцев назад

    Kasper, how would this work for a correlation topic model heat map with topic rows/topic columns?

    • @kasperwelbers
      @kasperwelbers 7 месяцев назад

      If I recall correctly, the correlated topic model mostly differs in that it takes the correlations between topics into account in fitting the model. It probably adds a covariance matrix, but there should still be posterior distributions for document-topic and word-document, and so you should still just be able to visualize the correlations of topics and documents (or topics with topics) in a heatmap. Though depending on what package you use to compute them, extracting the posteriors might work differently.

    • @juliantorelli4540
      @juliantorelli4540 7 месяцев назад

      @@kasperwelbers Thank you! I tried this code, it seems to have worked for basic LDA: beta_x <- tidy(x, matrix = "beta") beta_wider = function(x){ pivot_wider(x, values_from = beta, names_from = topic) %>% arrange(term) %>% select(-term) %>% rename_all(~paste0("topic", .)) } beta _w <- beta_wider(x) cor1 <- cor(beta_w) I then plotted a correlation matrix.

  • @randomdude4411
    @randomdude4411 7 месяцев назад

    This is a brilliant tutorial on GLM in R with a very good breakdown of all the information in step by step fashion that is understandable for a beginner

  • @paphiopedilum1202
    @paphiopedilum1202 8 месяцев назад

    thank you french accent man

  • @marcosechevarria6237
    @marcosechevarria6237 8 месяцев назад

    The dfm function is defunct unfortunately :(

  • @moviezone8130
    @moviezone8130 8 месяцев назад

    Kasper, I found it very helpful, it was a great video and you set the bar high. Very very informative filled with concepts.

  • @MK-fp6tg
    @MK-fp6tg 8 месяцев назад

    This is a great tutorial. I have a quick question. Which file type do I have to convert my current data set in an Excel file?

  • @yifeigao8655
    @yifeigao8655 8 месяцев назад

    Thanks for sharing! The best tutorials I've watched. No fancy slides, but very very useful code line by line.

  • @Aguaires
    @Aguaires 8 месяцев назад

    Dank u!

  • @Roy-xr2wq
    @Roy-xr2wq 9 месяцев назад

    Best Explanation, the visuals bring the whole idea into life. Thanks

  • @pieracelis6862
    @pieracelis6862 9 месяцев назад

    Really good tutorial, thanks a lot!! :)

  • @rubyanneolbinado95
    @rubyanneolbinado95 9 месяцев назад

    Hi, why is R studio producing different results even though I am using the same call and data.

    • @kasperwelbers
      @kasperwelbers 9 месяцев назад

      Hi! Do you mean vastly different results, or very small differences? I do think some of the multilevel stuff could in potential differ due to random processes in converging the model, but if so it should be really minor.

  • @davidgao9046
    @davidgao9046 10 месяцев назад

    very clear layout and superb explanation for the intuition. Thanks!

  • @gma7205
    @gma7205 10 месяцев назад

    Amazingly well-explained, thanks! Please, make more videos. Nonlinear models, Bayesian... some extra content would be nice!

  • @michellelaurendina
    @michellelaurendina 10 месяцев назад

    THANK. YOU.

  • @genesuis
    @genesuis 11 месяцев назад

    What a legend! You have no idea how much your videos have helped me. Thanks for making it clear and easy to understand:)

  • @zafarnasim9267
    @zafarnasim9267 Год назад

    Great video, nicely explained

  • @DavidKoleckar
    @DavidKoleckar Год назад

    nice audio bro. you record in bathroom?

    • @kasperwelbers
      @kasperwelbers Год назад

      Ahaha, not sure whether that's a question or a burn 😅. This is just a Blue Yeti mic in the home office I set up during the COVID lock downs. The room itself has pretty nice acoustic treatment, but I was still figuring out in a rush how to make recordings for lectures/workshops and it was hard to get clear audio without keystrokes hitting through.

  • @mariuskombou6729
    @mariuskombou6729 Год назад

    In order to be able to plot with textplot_wordcloud, you need first to load the "quanteda.textplots" library. I guess so few things have changed after 3 years. Otherwise it is not going to work. Thank's for the video dear Kasper.

  • @roxyioana
    @roxyioana Год назад

    can not use - dfmat_inaug <- dfm(toks_inaug, remove = stopwords("en") -is outdeated - what can I do insted?

    • @kasperwelbers
      @kasperwelbers Год назад

      Hi @roxyioana, please check the link to the tutorial in the description. We keep that repository up-to-date with changes. (and at some point I hopefully find the time to re-record some videos)

  • @bignatesbookreviews
    @bignatesbookreviews Год назад

    god bless you

  • @bobmany5051
    @bobmany5051 Год назад

    Hello Kasper, I appreciate your great video. I have a question. Regarding your example data, what if there are two or more data points for each day for each person? Let's assume that you measure reaction time 4 times each day across participants. Do you need to average those data points and make one data point for each day? or do you use all data points?

    • @kasperwelbers
      @kasperwelbers Год назад

      Interesting question. We can actually add more groups to the model instead of aggregating, but it depends on your question. In the example, we used days as a continuous variable, because we wanted to test if there was a linear effect on reaction time. If you also want to consider the time of the day as a continous variable, then it indeed becomes awkward how to combine them. However, maybe your reason for the four measurements is just to get more data points, so you think of them as factors rather than continuous. While aggregating might be viable, you could also consider adding another level to your model, for whether the measurement was in the (1) morning, (2) afternoon, (3) evening, or (4) night. You could then have random intercept, for instance to take into account that people might on average have lower reaction times in the evening due to their after-dinner-dip. (though note that with just 4 groups you might rather want to use fixed effects with dummy variables) Perhaps more generally, what you're interested in is multilevel models with more than one group level. This is possible and very common/powerfull. Groups can then either be nested or crossed. be nested, for instance people living in cities.

  • @DeborahNicoletti
    @DeborahNicoletti Год назад

    what about importing text from multiple pdf/docx?

    • @kasperwelbers
      @kasperwelbers Год назад

      I think the easiest way would be to use the readtext package. This allows you to read an entire folder ("my_doc_files/") or use wildcards ("my_doc_files/article*_.txt). cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html#microsoft-word-files-.doc-.docx

  • @audreyq.nkamngangk.7062
    @audreyq.nkamngangk.7062 Год назад

    Thank you for the tutorial. Is it possible to create a glm model with a variable to explain which has 3 modalities

    • @kasperwelbers
      @kasperwelbers Год назад

      If I understand you correctly, I think it's indeed possible to model a dependent variable with a tri-modal distribution with glm. Actually, you might not even need glm for that. Whether a distribution is multimodal is a separate matter of the distribution family. A tri-modal distribution might be a mixture of three normal distributions, three binomial distributions, etc. Take the following simulation as an example. Here we create a y variable that is affected by a continuous variable x, and a factor with three groups. Since there is a strong effect of the group on y, this results in y being tri-modal. ## simulate 3-modal data n = 1000 x = rnorm(n) group = sample(1:3, n, replace=T) group_means = c(5,10,15) y = group_means[group] + x*0.4 + rnorm(n) hist(y, breaks=50) m1 = lm(y ~ x) m2 = lm(y ~ as.factor(group) + x) summary(m1) ## bad estimate of x (should be around 0.4) plot(m1, 2) ## error is non-normal summary(m2) ## good estimate after controlling for group plot(m2, 2) ## error is normal after including group

  • @kobeoncount
    @kobeoncount Год назад

    Dear Kasper, thank you very much for your videos. I am just getting into text analytics and I have a quick question. I am planning to work on Turkish language, and I don't know how to handle the stopwords and stemming processes. There are compatible files for TR to work through quanteda, but I don't know how to actually make them work. Could you please give some hints about that also? )

    • @kasperwelbers
      @kasperwelbers Год назад

      Good question. I'm not an expert on Turkish, so I don't know how well these bag-of-word style approaches work for it, but there does seem to be some support for it in quanteda. Regarding stopwords, quanteda uses the stopwords package under the hood. That package has the functions stopwords_getlanguages to see which languages are supported. Importantly, you also need to set a 'source' that stopwords uses. The default (snowball) doesn't support Turkish (which I assume is TR), but it seems nltk does: library(stopwords) stopwords_getsources() stopwords_getlanguages(source = 'nltk') stopwords('tr', source = 'nltk') Similarly, for stemming it uses SnowballC. Same kind of process: library(SnowballC) getStemLanguages() char_wordstem("aslında", language='turkish') # (same should work for dfm_wordstem) So, not sure how well this works, but it does seem to be supported!

    • @kobeoncount
      @kobeoncount Год назад

      @@kasperwelbers This is so helpful, thank you!!

  • @R0bbie4141
    @R0bbie4141 Год назад

    Hey Kasper. Bedankt voor je gratis youtube premium in een airbnb in Berlijn afgelopen week 😅. Ik heb voor je uitgelogd toen ik naar huis ging. 👍🏻

  • @ethanjudah8420
    @ethanjudah8420 Год назад

    Hi, I'm trying to do this on reddit data but the files I have are too large (100gb+) for only 3 months of data. That's in .zst. Do you have any suggestions on how to deal with this and apply these techniques on this data set in R?

    • @kasperwelbers
      @kasperwelbers Год назад

      If your file is too large to keep in memory, the only option is to work through it in batches or streaming. So the first thing to look into would be whether there is a package in R for importing ZST files that allows you to stream it in or select specific rows/items (so that you can get it in batches). But perhaps the bigger issue here would be that with this much data you really need to focus on fast preprocessing, so that you'll be able to finish your work in the current decade. So first make a plan what type of analysis you want to do, and then figure out which techniques you definitely need for this. Also, consider whether it's possible to run the analysis in multiple steps. Maybe you could first just process the data to filter it on some keywords, or to store it in a searchable database. Then you could do the more heavy NLP lifting only for the documents that require it.

  • @PaulYoung-r8g
    @PaulYoung-r8g Год назад

    great thanks

  • @PaulYoung-r8g
    @PaulYoung-r8g Год назад

    This is amazing. Thank you

  • @67lobe
    @67lobe Год назад

    hello i' can't find the moment where you speak bout word documents. I'm having my words documents to crete a corpus

    • @kasperwelbers
      @kasperwelbers Год назад

      Hi @67lobe, I don't think I discuss word files in this tutorial. But I think the best ways are to use the 'readtext' package, or 'antiword'. The readtext package is probably the best to learn, because it provides a unified interface for various file types, like word, pdf and csv.

  • @m9017t
    @m9017t Год назад

    Very well explained, thank you!

  • @MrJegerjeg
    @MrJegerjeg Год назад

    What if you have combinations of two different groups. For example, you measure blood pressure from volunteers after drinking a certain number of units of alcohol. You do that in two different locations. So you want to fit a line per individual, but you also want to control for the location effect. Right?

    • @kasperwelbers
      @kasperwelbers Год назад

      You can certainly have multiple groups. First, you could have groups nested in groups. If you perform the same experiment in many countries across the world, your units would be observations nested in people (group 1) nested in countries (group 2). Second, you could have cross-nested (or cross-classified) groups. For example, say we want to study if the effect of more alcoholic beverages on blood pressure differs depending on the type of alcoholic beverage (beer, wine, etc.). In that case, each person could have observations for multiple beverages, and each beverage could have observations for multiple people.

    • @MrJegerjeg
      @MrJegerjeg Год назад

      @@kasperwelbers I see, thanks. I can imagine that having all these nested and cross-nested groups can complicate quite a lot the model and its interpretation.

  • @learning.data.science
    @learning.data.science Год назад

    Thank you for informative text analysis videos. I am just begginner on texxt analysis and R, I start with your videos. I have got a question at 12 :13 min, kwic() needs tokens() so, I applied toks <- tokens(corp) k = kwic(toks, 'freedom', window = 5) . Is it true?

    • @kasperwelbers
      @kasperwelbers Год назад

      Yes, you're correct. The quanteda api has seen some changes since this video was recorded. You can still pass a corpus directly to kwic, but it will now throw a warning that this is 'deprecated'. This means that at the moment it still works, but at some point in the (near) future it will be mandatory to tokenize a corpus before using kwic