TF-IDF | Introduction to Text Analytics with R Part 5

Data Science Dojo

Просмотров 42 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 3 окт 2024

Комментарии • 51

@rafaelfonseca7942 10 месяцев назад
Good teaching, so far thanks sir!
@avishekkumar9341 7 лет назад ⁺²
Thanks Dave. Really appreciate your style of teaching. Very helpful.
@Datasciencedojo 7 лет назад
@Avishek Kumar - Thank you for the compliment, greatly appreciated!
@juanmauricioarrietalopez2395 4 года назад ⁺²
All of these videos, all of this playlist has helped me into my thesis. Thanks a lot!
@duong65198914 3 года назад
R d. EA dc. Dạ ;
Nc
N
J mẹ
N mhh
H
Xem hs
Mh
H
H
H
Mh dg
Hs
J mh mhsjmaheuri đi
Mm
dj
jd
： j
在一起了的感觉真的很开心了😃
在
你的人会越来越努力工作努力
是啊
这么d恍恍惚惚是₫（ hhdhsh是。👌jz
好多好多
在h
h
👌
@duong65198914 3 года назад
R d. EA dc. Dạ ;
Nc
N
J mẹ
N mhh
H
Xem hs
Mh
H
H
H
Mh dg
Hs
J mh mhsjmaheuri đi
Mm
dj
jd
： j
在一起了的感觉真的很开心了😃
在
你的人会越来越努力工作努力
是啊
这么d恍恍惚惚是₫（ hhdhsh是。👌jz
好多好多
在h
h
👌
@163ii 3 года назад
Thanks for your contribution to expand the data science knowledge!
@slimoueslati9479 6 лет назад ⁺¹
Or simply you can use weightTfIdf fom tm package !!
@deepaksingh9318 7 лет назад ⁺³
Hi,
. One question on calculation of tfidf here..
As per the coded formula uh r taking documents which have > 0 occurance.
But it should be all the documents right?
Is that correct?
@Datasciencedojo 7 лет назад ⁺¹
@deepak singh - If I understand your question correctly the IDF calculation has the following:
corpus.size == the total number of documents
doc.count == the number of documents containing the term
HTH,
Dave
@cikatheresia 5 лет назад ⁺¹
Hi Dave, thanks for your video! learned a lot today.
Unfortunately, I'm not blessed with the number of documents you have, in fact, I only have one document to analyze! Therefore, using your log(N/count(t)) would not be rational because a lot of them will hit the score 0. What do I do? Bet on domain knowledge and manually flag certain terms as stopwords?
@bhupen008 7 лет назад ⁺³
U r a great teacher! I will follow ur each and every video from now on. Thank you so much. More power to you. Regards
@Datasciencedojo 7 лет назад
@Bhupendra Kumar - You are too kind. Glad you found the videos useful!
@maryamsi5090 3 года назад ⁺¹
what does "Text" represent?
@Datasciencedojo 3 года назад
Hello Maryam, the text represents 'term frequency-inverse document frequency'.
@MasoudPaydar 7 лет назад ⁺³
Thanks a lot. Learned a lot today.
@Fealivren 5 лет назад ⁺¹
Hi Dave,
First of all i'd like to say your videos are great, simple to digest and follow. So thank you for that!
I'm occuring an odd error, stating: 'Error: cannot allocate vector of size 45 Kb' while running the TF function.
Would you have any insight of solving the problem? I'd be very greatful! Regards.
@byoder999 5 лет назад
I got same problem. Have you solved?
@dr.javidiqbal265 4 года назад
Hello. I am just wondering to calculate tf.idf and your video looks great. However, if you could help me understand that I want dictionary specific terms to find from corpus of 500 annual reports and then calculating tf.idf. Is it possible to do it.? In addition, I want identity of annual reports rather mixing it into only one document.
@vaz.felipe 4 года назад
Hi guys from Data Science Dojo. Please, in my case I have just one document. So, dont I need to implement the TF-IDF method? Or, thinking like the documents are the number of respondents that I had in my csv file. Which one may I follow?
@akrsrivastava 6 лет назад ⁺¹
Thanks. A good explanatory video on text mining with R was missing on youtube. I have followed your other videos on youtube and found them helpful.
One query. Quanteda has inbuilt functions for calculating TF IDF. Any specific reason why you wrote the code yourself?
@juanmauricioarrietalopez2395 4 года назад
He explained it in the video because it was easier for viewers to understand how the TF-IDF works.
@Jayl__ 7 лет назад ⁺¹
Should I be concerned with this results when I run my model:
1: closing unused connection 7 (
@Datasciencedojo 7 лет назад
@Jayl - That is output illustrating that your doSNOW cluster had connections that were not properly closed. It’s nothing to worry about in particular, but it is best practice to ensure that a call to the stopCluster() function is called every time your spin up a doSNOW() cluster.
HTH,
Dave
@sidhantamaharana3813 7 лет назад ⁺¹
Probably this is my first youtube video comment, because i couldn't resist myself appreciating you for this wonderful video. Thank you and kudos to you for such a wonderful explanation. :D
@Datasciencedojo 7 лет назад
@sidhanta maharana - Glad you liked the video!
Dave
@sebastianhuppmann2017 7 лет назад ⁺¹
Thanks for your videos! Learned a lot!
@danielencinasz 4 года назад
Hi Dave, I am experiencing this problem and I cannot create the matrix needed: Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
@sureshnaik2943 4 года назад
I have a bunch of Text files in my folder . And I want answers from all those text files. This should happen like me coding on R . Like when I type some question in R for example what is earth . I need answers extrcated from all those in sentences.
@prateek2191 7 лет назад ⁺²
Hi, I am following your tutorial from the start and trying to build model for my problem that includes more than 2 class labels. I tried with the bag of words model like in your fourth tutorial with random forest, it is giving maximum of about 40% of accuracy. Any advice so that i can improve my model accuracy?
@Datasciencedojo 7 лет назад ⁺¹
@Prateek Argarwal - Without knowing more about your scenario, data, etc. I can only provide advice in the most general sense. Here's what I can suggest:
1 - Try out various types and combination of n-grams.
2 - Try out using both TF-IDF and not TF-IDF.
3 - Try both with and without SVD/LSA.
4 - Try various amounts of singular vectors for SVD/LSA.
5 - Try out various algorithms (e.g., SVMs).
NOTE - The above assume continuing with the bag of words model. However, it is possible that this alone might not be enough for your data. For example, you may require more advanced techniques like chunking to create more powerful features.
HTH,
Dave
@prateek2191 7 лет назад ⁺¹
Thanks Dave .....basically I am trying to classify the tickets based on user textual complaints. I tried different algos, Random forest is giving the maximum accuracy of 40%.
@Datasciencedojo 7 лет назад ⁺¹
@Prateek Agarawal - Always keep in mind that you may not have enough signal in your data to build an effective predictive model. I'm not saying that this is true in your particular case, but it is something to keep in mind. For example, the textual complaints might not be enough. You may also need additional data from the tickets (e.g., customer-related data) as well.
HTH,
Dave
@xixicy 6 лет назад ⁺¹
Dave, love your lecturing!
@williamjohnson928 5 лет назад
I am having memory issues with 8 GB of memory. Is that to be expected? (It is on a work machine... unsure if I can request more memory) Is there a file or two that I could eliminate to continue the work?
@phamngocminh4907 6 лет назад
Dear Dave, I really love your tutorials. Thank you very much for all your effort in making these videos and sharing your insights and experience in data science. Besides the Titanic and these text analytics tutorials, do you have any others? I want to check them all out :). Also, I want to ask TF (text frequency) you mentioned in this tutorial is the count of a term over the total number of terms in ONE document or ALL documents in a matrix/ data.frame. Besides, why don't we use 'scale' to normalize our data?
@mainestouen 7 лет назад ⁺¹
Hi Dave. I'm trying to use this approach but instead of training the model to predict text that results in ham or spam, I'm trying to predict which text leads to better email open rates. I have split my corpus into quartiles ranked by open rate and am training the model against these quartiles. Is this a valid adaptation of your approach outlined in this video series?
@mainestouen 7 лет назад ⁺¹
Great thanks for your help. That's exactly how I've set it up; now let's see if it's accurate enough!
@bhupen008 7 лет назад
Hi Gareth, I am pretty new to this world. it would be very kind of u , if you can share your work. so that I can compare and work my way out.
@Datasciencedojo 7 лет назад
@Gareth Lloyd - If I understand your scenario correctly, then yes! The main change will be that instead of a binary classification problem you would have a multi-class problem (i.e., 4 classes with one class for each quartile). The mighty random forest can handle multi-class problems like this just fine. Happy data sleuthing!
HTH,
Dave
@PCI208 6 лет назад
hey can you help me?
when i run this code
train.tokens.tfidf
@juanmauricioarrietalopez2395 4 года назад
What's your corpus' size in disk? O_O
@juwlsss 6 лет назад
Dear Dave.
First of all, thank you for the great introductory course!
I've worked through the first 5 videos of this course on the spam dataset, and in parallel I am running the code on my dataset which is slightly larger. In particular I have about 1,6 mln observations each one of them containing just about the same amount of text as in the spam observations.
When I created my train.tokens.dfm, I got a large dfm object of 200 bln elements, 420 mb.
Then I proceeded to the next step to create a matrix, here is when I encounter memory problems.
I've tried (as per your instruction)
train.tokens.matrix = as.matrix(train.tokens.dfm)
or
train.tokens.df = cbind(Label = train$Label, as.data.frame(train.tokens.dfm))
both of these function return the following mistake:
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
I can imagine what size the matrix must be created with these formulas, but I really do want to use tokenization in my approach.
Is there any way around it?
Thank you in advance!
P.S. I'm running a Macbook Pro core i7, 16 GB RAM
P.S.S. Your caret xgboost tutorial is awesome!
@juwlsss 6 лет назад
update: after checking the dimensions of this dfm here's what I've got
> dim(train.tokens.dfm)
[1] 1482535 138945
where 1 482 535 is the number of observations and 138 945 is the number of features
@hussainhatimi4858 6 лет назад
Really Helpful Videos
but when run this line of code it gives me an error:
> train.tokens.df
@Fealivren 5 лет назад
Hello Hussain. I've just got an exact same error. Do You remember if you had solved this issue?
@lauravanberkel7426 5 лет назад ⁺²
@@Fealivren The code (found here: stackoverflow.com/questions/10917532/memory-allocation-error-cannot-allocate-vector-of-size-75-1-mb) to increase storage capacity solved this error for me
@kanika1 2 года назад
Loved all your videos. I highly recommend these videos to any beginners.
@Datasciencedojo 2 года назад
Glad you liked it, stay tuned with us for more tutorials.
@TheBebwa 6 лет назад
This is awesome.... thanks
@93jackjoe 6 лет назад
Thanks yet again!

Следующие

Автовоспроизведение

N-grams | Introduction to Text Analytics with R Part 6