Hi! Thank you for the video! I've tried it out and it finally worked! If you have time would really love your tutorial on how to conduct topic modeling using STM!!
Hi Opal .thank you for great video. I was able to follow along and produce the barcharts however some column have dash instead of item words. Can you help me how to filter those dashes.thanks
There are several schools of thought on selecting k. I do not have a tutorial at this moment however see link below with different recommendations on how to approach this. stackoverflow.com/questions/17421887/how-to-determine-the-number-of-topics-for-lda
Hello. Respected Mrs./Ms. Quite informational video. I had a question. If we were to run the LDA on abstract corpus accessed in .csv format from academic databases such as Scopus or WOS, question arises how can we run it? Could you please demonstrate it
Hello, Very useful your video. I'd like to use topic modeling to understand the importance of the concept "Climate justice" on social questions of company's documents. If "climate justice" is term, what should be topics in this study?
Hi Stevy Dibala, thank you for your feedback. If you are using topic modeling, the topics will emerge from the analysis of the documents. If you watch this video to the end I discuss the topics that emerge from the documents.
For the various documents to analyse, i’m planning to get my documents about use of AI on « Factiva » since i’ll have a lot documents, is the best way to load in r only the URL of all documents found or single document?
I would suggest you load the documents similarly to have I have loaded them in R. You can visit my webpage for the code. It is available in the description of the video.
Hi again! I just noticed there is a new package available in CRAN 'fastTopics' if you manage some time will you please make a video on how to do Topic Modeling using fastTopics! Thanks again for your kind efforts! 🤝
There are several schools of thought on selecting k. I do not have a tutorial at this moment however see link below with different recommendations on how to approach this. stackoverflow.com/questions/17421887/how-to-determine-the-number-of-topics-for-lda
Hi Opal, thank you for the nice work you are doing. Your codes are simple and easy to work with. please, when plotting the chart for beta, I got this error: Error in `geom_col()`: ! `mapping` must be created by `aes()`. please help. Thank you
Hi Opal, thank you for creating this content. I'm learning how to implement LDA on my dataset which consists of social media posts. Should I remove hashtags in the cleaning process or would those hashtags actually contribute to the resulting topics?
Dear, I was trying the LDA with my pdf documents and the result of beta is coming out not with word on term column but numbers. For information, the document has been previously cleaned out. topic term beta
I used a csv twitter data, and followed the tutorials, however, when I run "beta_topics" i dont see terms in the "term" column, mine appears as numbers. Kindly assist me with this
You would have to specify that you are the language for the stopwords. Once it is supported you can add Slovenian stopwords. You can check out this link to one of my videos that speak about the different languages as well. ruclips.net/video/oKTG5ulP3wQ/видео.html
Hi Opal, great tutorials. Thankyou for sharing. Newbie here. It will be enhance my knowledge about Topic Modeling. Just quick question. Its quite technical. Its about DocumentTermMatrix functions, my results on term column does not shown any words as appeared, it just only numerical data in beta_topics. It is from the Document Matrix? I used one of two different function in determine the DTM (1. TermDocumentMatrix and 2. DocumentTermMatrix) I tried both with the same output. :(
Hi Wendy Imam, you can try reducing the sparsity of the Document Term Matrix to show only terms that appear for a significant count. For example in the code below I reduce the sparsity in the Document Term Matric to only include terms that appears at least 100 times. You can reduce the number. This should get rid of the numbers from your analysis. Hope this helps. DTM
Thank you so much for this video! This is a HUGE help since I need to do topic modeling for my dissertation. I followed the code directly from the blog and was able to replicate both plots with my own data. I have a question about term frequency (tf-idf values) and removal of sparse terms, are these already incorporated in the model? Sorry if this is such an obvious question. I'm still trying to wrap my head around all the R text mining tutorials and papers. I find your tutorials the simplest to follow, so I figured might go ahead and ask. Thank you again!
You are welcome Cep Garcia, no I did not remove sparse terms in this tutorial. However you can achieve this by including the following after you create the Document Term Matrix. The first line reduce the matrix to top 100 most frequent terms. You can play around with that number based on what you are trying to achieve. This will reduce the sparsity in the text. Please note that each time you change the number you must recreate the DTM and then reduce it. DTM
Hi, Opal wanderful video, I'm trying to determinate if a "abstract" or topic could be more common in one univeristy or other. I have 90 abstract, titles and keywords from 4 universties. I want to know if they abstracts have some relantionship among them or in base of the university that they were made. For example the topic modelling could agroup this abstracts into four groups: university 1, university 2... sorry for my english I'm not a native speaker.
I have never observed that error so i am not certain of the resolution however here is a link to a page that is discussing the Rcpp issue: support.rstudio.com/hc/en-us/articles/4415936301335-Resolving-Rcpp-precious-remove-error
Hi ,first up thankyou for such wonderful and extremely useful video. Now my query is that I have 146 pdf files of research papers which I have merged togather into a single file. Now please help me to clean the file and to get the bigrams analysis of the same. Your help will much appreciated.thank you
Hi, I liked your video. It is very useful, thank you so much. Could you help me in one more small thing? What Code can I use to delete additional texts from the corpus after I have applied stopwords("en").
Hi Sahil, thanks for your support. You can also watch this video that shows how to create a custom list of stop words. The description of the video will also take you to the code on my webpage. ruclips.net/video/oKTG5ulP3wQ/видео.html
@@DataCentricInc Thanks for replying. I am trying to follow your videos on text analytics on pdfs. The code is not loading them. lapply function is giving error
I have an urgent academic paper I am writing on texts about Land in the Bible and in South Africa. I have created a folder of pdf texts but cannot load it using the code you suggesting
@@itumelengmosala5335 Did you store the PDF file in the same folder you are working in for R? I really would need to see a screenshot of your R studio and the code. If you want you can email me at DataCentricInc@gmail.com
Hi, I am coducting text mining for academic articles and approved pdf docs from government website and i found some pdfs are protected how can i perform text mining. can we have a meeting in zoom to discuss this in more details Thx
Could you please identify the error doc_gamma.df doc_gamma.df$chapter ggplot(data=doc_gamma.df, aes(x=chapter, y= gamma, + group= factor(topic),color=factor(topic)))+ + geom_line()+facet_wrap(~factor(topic),ncol=1) Error in FUN(X[[i]], ...) : object 'chapter' not found
Hi. I find this video very useful. I have one question. In my research, I am analysing comments from social media. But I organised the data as follows and I would be happy if you could help. In the Excel document, I have the authors of the comment written in one column, and I have the content of the comment written in the other column. So I have approx. 4,000 rows and each row has two columns - one for the author and one for the comment. I had all of these comments “separate” in my document, but I wanted to combine them. I obtained each group of comments from individual web portals (eg. Facebook posts, comments under articles, Reddit debates, ...). And I combined all these documents of comments into two columns. So now all comments are written in one column. That's my corpus (it is binary now - all the comments in one row). Can I use the LDA in the R program on this data set? Or do comment groups need to be separated into individual documents for the LDA method? I hope my question is clear, thank you so much.
Thank you so much! You would never know how your video saved my life!! So thank you for your clear instructions!!! Have a nice day~
Awesome video!! I learned so much from this. Truly grateful for the awesome resource. Thanks a lot.
Good day! Thanks a lot for uploading this video it was really helpful to learn topic modeling using R! All the best to you!!! 🙋🏻♂️
You are welcome, all the best
Hi Opal, thanks for such a clear demonstration of the process. Now to implement it in my own LDA models.
You are welcome Amanda
Excellent Tutorial!
Thanks Dayne
Awesome to make it so simple and easy for beginners as well
Thanks so much 😊
Hi! Thank you for the video! I've tried it out and it finally worked! If you have time would really love your tutorial on how to conduct topic modeling using STM!!
Your videos are so great!
Thank you so much your videos helped me a lot!!!
Thank you for this video. helped a lot.
Lovely tutorial 💕
Thank you Kenya Bolt 😊
Hi Opal .thank you for great video. I was able to follow along and produce the barcharts however some column have dash instead of item words. Can you help me how to filter those dashes.thanks
This was great. Thank you. Can you please do a video on the Quanteda package?
Thanks for the tutorial, great video. When creating the Topic Model charts is there a way to get the topic added rather than numbering them.
Dear, could you assist to choose the optimal number of topics (k)? Thanks
There are several schools of thought on selecting k. I do not have a tutorial at this moment however see link below with different recommendations on how to approach this.
stackoverflow.com/questions/17421887/how-to-determine-the-number-of-topics-for-lda
What would I do if I have my text data as lines in a csv file?
Hello. Respected Mrs./Ms. Quite informational video. I had a question. If we were to run the LDA on abstract corpus accessed in .csv format from academic databases such as Scopus or WOS, question arises how can we run it? Could you please demonstrate it
Hello, Very useful your video. I'd like to use topic modeling to understand the importance of the concept "Climate justice" on social questions of company's documents. If "climate justice" is term, what should be topics in this study?
Hi Stevy Dibala, thank you for your feedback. If you are using topic modeling, the topics will emerge from the analysis of the documents. If you watch this video to the end I discuss the topics that emerge from the documents.
@@DataCentricInc wow! Great your speed feedback. I’ll look at to the end. Tks.
For the various documents to analyse, i’m planning to get my documents about use of AI on « Factiva » since i’ll have a lot documents, is the best way to load in r only the URL of all documents found or single document?
I would suggest you load the documents similarly to have I have loaded them in R. You can visit my webpage for the code. It is available in the description of the video.
Did someone have a problem applying the LDA model? Because I had it and I am still trying to figure out how to fix it...
Hi again! I just noticed there is a new package available in CRAN 'fastTopics' if you manage some time will you please make a video on how to do Topic Modeling using fastTopics! Thanks again for your kind efforts! 🤝
hi~ thank you for your video, could you tell me why the topic number k=4?
I choose how many topics I wanted to look for based on the number of pdf documents I had.
Hello! I have a question.. is there a way to implement LDA in other languages? I'm trying to applied to Italian Reviews from the web
Hi, thanks for your video! I'm trying to implement LDA for customer reviews. Would that be the same process in R?
great tutorial Opal. Is there a scientific way of properly determining K or the number of topics>
There are several schools of thought on selecting k. I do not have a tutorial at this moment however see link below with different recommendations on how to approach this.
stackoverflow.com/questions/17421887/how-to-determine-the-number-of-topics-for-lda
Thanks Opal
Hi Opal, thank you for the nice work you are doing. Your codes are simple and easy to work with.
please, when plotting the chart for beta, I got this error: Error in `geom_col()`:
! `mapping` must be created by `aes()`. please help. Thank you
Hi Audu You can access the the full code from the link in my video description.
Hi Opal, thank you for creating this content. I'm learning how to implement LDA on my dataset which consists of social media posts. Should I remove hashtags in the cleaning process or would those hashtags actually contribute to the resulting topics?
I would include the hashtags as they contribute to the overall pattern of the topics.
Dear, I was trying the LDA with my pdf documents and the result of beta is coming out not with word on term column but numbers. For information, the document has been previously cleaned out.
topic term beta
Hi Stevy, Did you remove numbers when you were cleaning the document?
I used a csv twitter data, and followed the tutorials, however, when I run "beta_topics" i dont see terms in the "term" column, mine appears as numbers. Kindly assist me with this
Hi Banny
Did you remove numbers when cleaning the corpus? refer to the code below to remove numbers.
document
@@DataCentricInc I did remove numbers and follow the code
How did you set topic number as 4? Is it arbitrary?
Hi again, I also have one question: how to add Slovenian stopwords in R? Do you know it maybe? Thank you so much.
You would have to specify that you are the language for the stopwords. Once it is supported you can add Slovenian stopwords. You can check out this link to one of my videos that speak about the different languages as well. ruclips.net/video/oKTG5ulP3wQ/видео.html
Hi Opal, great tutorials. Thankyou for sharing. Newbie here. It will be enhance my knowledge about Topic Modeling. Just quick question. Its quite technical. Its about DocumentTermMatrix functions, my results on term column does not shown any words as appeared, it just only numerical data in beta_topics. It is from the Document Matrix? I used one of two different function in determine the DTM (1. TermDocumentMatrix and 2. DocumentTermMatrix) I tried both with the same output. :(
Hi Wendy Imam, you can try reducing the sparsity of the Document Term Matrix to show only terms that appear for a significant count. For example in the code below I reduce the sparsity in the Document Term Matric to only include terms that appears at least 100 times. You can reduce the number. This should get rid of the numbers from your analysis. Hope this helps.
DTM
Thank you so much for this video! This is a HUGE help since I need to do topic modeling for my dissertation. I followed the code directly from the blog and was able to replicate both plots with my own data.
I have a question about term frequency (tf-idf values) and removal of sparse terms, are these already incorporated in the model? Sorry if this is such an obvious question. I'm still trying to wrap my head around all the R text mining tutorials and papers. I find your tutorials the simplest to follow, so I figured might go ahead and ask. Thank you again!
You are welcome Cep Garcia, no I did not remove sparse terms in this tutorial. However you can achieve this by including the following after you create the Document Term Matrix. The first line reduce the matrix to top 100 most frequent terms. You can play around with that number based on what you are trying to achieve. This will reduce the sparsity in the text. Please note that each time you change the number you must recreate the DTM and then reduce it.
DTM
@@DataCentricInc Thank you so much Dr. Opal, it did help! I like how you said to "play around" with it. It is definitely challenging, but also fun!
Hi, Opal wanderful video, I'm trying to determinate if a "abstract" or topic could be more common in one univeristy or other. I have 90 abstract, titles and keywords from 4 universties. I want to know if they abstracts have some relantionship among them or in base of the university that they were made. For example the topic modelling could agroup this abstracts into four groups: university 1, university 2... sorry for my english I'm not a native speaker.
Hi Opal, I have this trouble. When I want to run > DTM
I have never observed that error so i am not certain of the resolution however here is a link to a page that is discussing the Rcpp issue: support.rstudio.com/hc/en-us/articles/4415936301335-Resolving-Rcpp-precious-remove-error
Hi ,first up thankyou for such wonderful and extremely useful video.
Now my query is that I have 146 pdf files of research papers which I have merged togather into a single file. Now please help me to clean the file and to get the bigrams analysis of the same. Your help will much appreciated.thank you
Hi Moonis I will be working on content to show how to execute this so you can look out for it.
@@DataCentricInc thank you
Hi, I liked your video. It is very useful, thank you so much.
Could you help me in one more small thing?
What Code can I use to delete additional texts from the corpus after I have applied stopwords("en").
Hi Sahil, thanks for your support. You can also watch this video that shows how to create a custom list of stop words. The description of the video will also take you to the code on my webpage. ruclips.net/video/oKTG5ulP3wQ/видео.html
@@DataCentricInc Thank you so much !!
Most beautiful
Thank you! Cheers!
I am finding your videos so instructive and helpful but am still having errors when I load documents:
Hi itumeleng, what error are you getting and what exactly are you trying to do?
@@DataCentricInc Thanks for replying. I am trying to follow your videos on text analytics on pdfs. The code is not loading them. lapply function is giving error
I have an urgent academic paper I am writing on texts about Land in the Bible and in South Africa. I have created a folder of pdf texts but cannot load it using the code you suggesting
@@itumelengmosala5335 Did you store the PDF file in the same folder you are working in for R? I really would need to see a screenshot of your R studio and the code. If you want you can email me at DataCentricInc@gmail.com
Hi, I am coducting text mining for academic articles and approved pdf docs from government website and i found some pdfs are protected how can i perform text mining. can we have a meeting in zoom to discuss this in more details
Thx
By protected you mean that they have passwords?
@@DataCentricInc yes they have passwords
Hii mam, will u create a video on pdf text analytical for a specific keywords, like for example social responsibility in company annual report
Thanks
Will try
Thank you so much, hope to see video soon
Could you please identify the error
doc_gamma.df doc_gamma.df$chapter ggplot(data=doc_gamma.df, aes(x=chapter, y= gamma,
+ group= factor(topic),color=factor(topic)))+
+ geom_line()+facet_wrap(~factor(topic),ncol=1)
Error in FUN(X[[i]], ...) : object 'chapter' not found
It means that the variable chapter does not exist, Ensure that the line that creates the variable was executed.
Hi. I find this video very useful. I have one question. In my research, I am analysing comments from social media. But I organised the data as follows and I would be happy if you could help. In the Excel document, I have the authors of the comment written in one column, and I have the content of the comment written in the other column. So I have approx. 4,000 rows and each row has two columns - one for the author and one for the comment. I had all of these comments “separate” in my document, but I wanted to combine them. I obtained each group of comments from individual web portals (eg. Facebook posts, comments under articles, Reddit debates, ...). And I combined all these documents of comments into two columns. So now all comments are written in one column. That's my corpus (it is binary now - all the comments in one row). Can I use the LDA in the R program on this data set? Or do comment groups need to be separated into individual documents for the LDA method? I hope my question is clear, thank you so much.
I believe you should be able to proceed with everything combined.