Video outline! 0:20 - What we will be doing! 3:40 - Sci-Kit Learn Overview 6:38 - How do we find training data? 9:33 - Download data 11:45 - Load our data into Jupyter Notebook 16:38 - Cleaning our code a bit (building data class) 20:13 - Using Enums 22:50 - Converting text to numerical vectors, bag of words (BOW) explanation 25:45 - Training/Test Split (make sure to "pip install sklearn" !) 33:45 - Bag of words in sklearn (CountVectorizer) 40:05 - fit_transform, fit, transform methods 42:05 - Model Selection (SVM, Decision Tree, Naive Bayes, Logistic Regression) & Classification 47:50 - predict method 53:35 - Analysis & Evaluation (using clf.score() method) 56:58 - F1 score 1:01:01 - Improving our model (evenly distributing positive & negative examples and loading in more data) 1:20:36 - Let's see our model in action! (qualitative testing) 1:22:24 - Tfidf Vectorizer 1:25:40 - GridSearchCv to automatically find the best parameters 1:31:30 - Further NLP improvement opportunities 1:32:50 - Saving our model (Pickle) and reloading it later 1:36:37 - Category Classifier 1:39:14 - Confusion Matrix Thank you for watching! Make sure to like & subscribe if you enjoyed :)
Is there anyway I could import another random dataset into my trained model and see if he can predict me the category from the other database (the one I used to trained my model)
This is glorious, been searching for "learn tennis betting game" for a while now, and I think this has helped. Ever heard of - Aiyenjamin Prefatory Approach - (should be on google have a look ) ? It is a good one of a kind guide for discovering how to get a unique tennis betting formula minus the hard work. Ive heard some super things about it and my buddy got amazing results with it.
you're the reason that I've got an internship in a great company :) well.. I'm broke now :D but when I earn tons of money( I hope we all do :D ) I'll donate you Keith !
A quick one for those into machine learning. On a scale of 1-10 how sufficiently enough does this tutorial cover machine learning. I am developing certain skills in data analytics and wanted to add Machine learning into the mix but don’t want to start diving too much into it. Just the necessary I will be need for a day in day out machine learning job requirements
A quick one for those into machine learning. On a scale of 1-10 how sufficiently enough does this tutorial cover machine learning. I am developing certain skills in data analytics and wanted to add Machine learning into the mix but don’t want to start diving too much into it. Just the necessary I will be need for a day in day out machine learning job requirements
keith ,you are like an elder brother teaching us how to do sums.thanksssssssssssssssssssssssssss a lottttttttttttttttttttttttttt bruhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
In the first exercise if any of you feels like laughing a bit do this: if float(review['overall']) < 2: print(review['reviewText']+ ' ') Also, great video! Didn't know I could enjoy Data Science as much as I am.
Yes, I agree with you 100%. He is the only person I know on youtube that actually teaches the material so well! I hope to see this channel grow to millions of subscribers.
I like it when you showed us how you would use online resources, all the Googling and documentation stuff, so that we are not afraid to actually go online ourselves and explore more new functions :) Thanks Keith!! Stay healthy! :)
from sklearn.naive_bayes import GaussianNB clf_gnb = GaussianNB() clf_gnb.fit(train_x_vectors, train_y) clf_gnb.predict(test_x_vectors[0]) what is the fault TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Hello if your Gaussian naive Bayes keeps coming up with an error, try this: from sklearn.naive_bayes import GaussianNB clf_gnb = GaussianNB() clf_gnb.fit(train_x_vectors.todense(), train_y) clf_gnb.predict(test_x_vectors[0].todense())
Thanks for the cool tutorial! Just a quick correction: when you're classifying using Naive Bayes, you used the Decision Tree Classifier, copied from the previous case. It's not super critical, but when I tried to use what I believed to be the corrected version, I found an unexpected error, and to use a working Naive Bayes, had to convert the train vector to a dense matrix using ".todense()". I'm not sure if this is correct though, if you have any input on this, it would be greatly appreciated! Thanks again :) How I tried to do it: clf_gnb = GaussianNB() clf_gnb.fit(train_x_vectors.todense(), train_y) clf_gnb.predict(test_x_vectors[0].todense())
I also had a problem with this. Where should we insert the fix? this : clf_gnb.predict(test_x_vectors[0].todense()) .. seems to late Anyone manage to get it working? Thanks
Awesome. Are you planning making more of this Machine Learning Videos? It would be great if you could include more about the preprocessing part, maybe trying to get data from a source where it is not ordered and with lot of outliers.
50 y.o. software developer here. this is the first hands on video I watch on the subject of ML. As a first step into the subject, I'm very sarisfied with the time I spent with you. You covered the basics, from data prep to model save and load. Surely a good starting point for further personal explorations. Also enjoying your Pandas related content Keep up the good work, and maybe use Jupyter's tab-completion, sometimes ;)
I have been doing a lot of courses for ML in scikit, I found this last week, and learnt it. And to be honest, I mastered things, which they couldn't cover in the so-called "mega" courses. You're awesome and also really helpful!
It was a really good video. But I have this one question. Why didn't we change the sentiment into numbers (i.e: "NEGATIVE=0, POSITIVE=1) and then try to fit the model. How does the model can understand the string values?
I suppose that the Bag of Words representation doesn't consider the word order in the sentence. Is that correct? While it generates distinct vectors for the following examples: "The book is good, but ..." "The book is bad, but ..." It fails to distinguish between the following two examples: "The book is not good, but ..." "The book is good, but not ..." Is there an alternative representation that considers the actual word order in the sentence? If not, I would suggest preprocessing the input data to form pairs (and/or triples) of adjacent words by concatenating them (thus treating them as a single word). This approach would capture common linguistic associations. Do you believe this could enhance the model, or might it introduce other issues?
Great video, but I tested a few other algorithms on the data-set and they seemed to work even better on the data. The algorithms were: Nearest Centroid Classifier and Stochastic Gradient Descent. Thanks for the video though, really helped me.
Phew, finally finished watching this one:) A lot to take in, but super helpful and interesting! Thanks, Keith! :) Gonna start your real-world task with Pandas tomorrow!
Improving accuracy by balancing test data does not mean your model is robust enough. I expected your model would have worked well even for unbalanced test data. Do you have any suggestions on how to do that?
Loved this video! I followed along writing my own code and it helped me put what I've learned into practice. Thank you so much for the practical advice, I can't wait to start on my own projects. Liked and subscribed! :) p.s. Did anyone have issues getting the output from the GridSearch portion? That was the only part that messed up for me. My output: GridSearchCV(cv=5, estimator=SVC(), param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})
Hello. Good day. A quick one for those into machine learning. On a scale of 1-10 how sufficiently enough does this tutorial cover machine learning. I am developing certain skills in data analytics and wanted to add Machine learning into the mix but don’t want to start diving too much into it. Just the necessary I will be need for a day in day out machine learning job requirements
I don't like seeing all that positive data go to waste. Is it beneficial/worthwhile to create synthetic data (?) based off of the existing negative reviews so that no positive data goes to waste? Or what about altering the training parameters to put 10x more weight into incorrect positives, a ratio that more closely resembles our data?
Hello! A quick one for those into machine learning. On a scale of 1-10 how sufficiently enough does this tutorial cover machine learning. I am developing certain skills in data analytics and wanted to add Machine learning into the mix but don’t want to start diving too much into it. Just the necessary I will be need for a day in day out machine learning job requirements
This was really helpful, I have been watching your videos since last few days. They are really aweseome. Subscribed. Can you please make a video for Face recognition using CNN or suggest me a link to watch.
Hi keith, what a great video. I think i know why the result of SVC is still the same. Because the best parameters are c=1 and kernel= linear, not rbf. You can check the best parameter of the gridsearchcv by running clf.best_params_
Correct me if I am wrong but I think that there is a problem. The vectorized version of the text are different between test and training because they come from differente dictionary. You can check it. If you get the length of any vectorized version of the train text and you compare it with the vectorized version of the test text, they are different. I think that the vectorized version you should get it before the split.
Make sure you didn't "fit" or "fit_transform" to the test vectors. We want to only fit the vectorizer to the training data, then simply "transform" the test data. In the real world, our training data is the only thing we get to see. We create a dictionary on that and then use that dictionary on any incoming test vector. Some words during test won't be in the dictionary and that's life, we won't be able to handle those properly. Our hope is that our training set was big enough and included enough words that during test time we can use that same vectorizer and get good results. Hopefully this makes sense.
@@KeithGalli ooooooh man! OK ok hahah I find my error. Thank you Keith I was fitting two times and getting two different dictionaries haha. Thank you!!
This one is just one heck of tutorial. Thanks a ton Keith. I am a Java Architect with 17 years of extensive experience, looking to shift to ML/Data Science. It took me 3 hours to cover this video. I must say first one hour was realy easy to follow but probably you covered a lot of things in the last 40 minutes.
great tutorial keith. you are incredible !! anyways, do you have any book recommendation for studying? I'm still a new in machine learning so, it would be nice if I read a lot of book first than start studying machine learning in practically. thanks in advance!!
Could you make a short video explaining what are the differences between deep learning, machine learning and AI from your point of view. Thank you and good luck
Just watched the video in one sitting. It was great! I learned so much, and I loved you showed the entire process from data to evaluation of model. Keep up the good work :)
Basically, very cool video, Keith! But for NLP I recommend to use spacy.io Also waiting more deeply videos about ML. Can you make any about PyTorch or TensorFlow in feature please?
You are so good, explaining the hardest things in common language and makes it easy to understand to even my grandma.... Thanks so much for making this simple!
phew.. that was heavy, not sure if you were rushing in with a lot of things, but then maybe there's so much to cover there anyways. You did well in spite of all anyways, but yeah there's a lot to pick up from this video.. good work
I am not getting any recommendation from GridSearchCV My code: from sklearn.model_selection import GridSearchCV parameters = { 'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)} svc=svm.SVC() clf=GridSearchCV(svc, parameters, cv=5) clf.fit(train_x_vectors, train_y) The response: GridSearchCV(cv=5, estimator=SVC(), param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')}) Nothing else !! Did I miss something, or has this stopped working? Thanks
Hey Keith. Looks like the issue with relatively low score (~80%) is caused by imperfection of training data. I'm talking about conversion of Star rate to one of three classifiers: NEGATIVE, POSITIVE, NEUTRAL. The problem is that Stars are assigned by Customers but not Amazon AI Engine. People are treating say 3 star rate in very different way. Even if Customer is not really happy with product and giving fairly negative feedback, he/she still can provide 3 stars rate. So, while 5-4 stars rate is working well for POSITIVE as well as 2-1 stars - for NEGATIVE, there is a little bit uncertainty with 3 Stars rate. I think (5-4-3 stars for POSITIVE and 2-1 stars for NEGATIVE) or (5 stars for PERFECT, 4-3 for POSITIVE and 2-1 NEGATIVE) logic should give us 90-95% score. Thoughts?
Yeah you're very right with your thoughts. The meaning of the 3-star classification is pretty ambiguous and we can't reliably count on the data rated this way. Ultimately though the models that were being scored with ~80% were only classifying between NEGATIVE (1-2 star) & POSITIVE (4-5 star) so our model had more issues than just how we categorized the data. If we want to get that score up higher we will want to apply some additional processing to our text. Some ideas would be removing stop words (words like "the", "this", "that", etc), lemmatizing/stemming (converting words to a base form), and utilizing bigrams (pairs of words instead of single words). Another reason for a relatively low score is that our data is not perfect. Even some of the 5 star reviews probably have no meaningful information that conveys positive sentiment in their review text. Same goes for 1 star reviews. Potentially doing some manual review of our training data would be another way to improve the score. Hope this information is helpful! BTW, I'm a huge hockey fan and after noticing the hat in your profile picture I have to quickly say.... Go Bruins!!! ;)
I have the impression that the transformation from text to vector has to be done before doing the "evenly distribution of the training sample because we would otherwise have different word to vector transformation, does that make sense?
Any Idea, why linear regression doesn't work from sklearn.linear_model import LinearRegression clf_log = LinearRegression() clf_log.fit(train_x_vectors, train_y) clf_log.predict(test_x_vectors[0]) error -->ValueError: could not convert string to float: 'POSITIVE'
Thanks Keith ,for tutorial. I tried to do something more with imbalanced data(well,at least i though in this was :) ).I tried SMOTEEN ,SMOTE + tomek,smote,combined over and under sampling ,neither of them showed better perfomance than yours,by manully distributing. All these i tried on data ,loaded as pandas dataframe . In addition,i tried almost the same exact things,on dataframe,that you did on json .f1 scores was 2 times lower. Now i wonder, is there any difference between json and pandas dataframe?Could you make it clear?Perhaps you also tried it ,to load as dataframe. Keen on to know your opinion. Thanks
Great video. Thank you very much. Is there a way of importing the contents of the variable file_name into a dataframe. So instead on print(line) you direct output into a data frame?
You're very welcome, glad you enjoyed. Check out this link: stackoverflow.com/questions/20037430/reading-multiple-json-records-into-a-pandas-dataframe It should answer your question! :)
I am at the 44:00 mark. and I keep on getting that ValueError Traceback (most recent call last) in 3 clf_svm = svm.SVC(kernel='linear') 4 ----> 5 clf_svm.fit(train_x_vectors,train_y) ~/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py in fit(self, X, y, sample_weight) 162 accept_large_sparse=False) 163 --> 164 y = self._validate_targets(y) 165 166 sample_weight = np.asarray([] ~/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py in _validate_targets(self, y) 547 classes=cls, y=y_) 548 if len(cls) < 2: --> 549 raise ValueError( 550 "The number of classes has to be greater than one; got %d" 551 " class" % len(cls)) ValueError: The number of classes has to be greater than one; got 1 class Do you mind helping me with this please?
Hey! How u doing? I dont know if u are going to see this, but when i run the f1 score with the 10.000 file jupyter says: MemoryError: Unable to allocate 1.33 GiB for an array with shape (6700, 26615) and data type int64 I googled it but i couldnt find an answer... can u help me please? Greetings from Argentina!
You kept appearing on my thumbnail.. I didn't care at first.. Later for once i opened the data science video.. Man.. It was so useful. The application videos of machine learning, data science were awesome. Thanks Keith ❤️.
@Keith Galli, really awesome tutorial to watch & try in parallel. however, can you please further clarify how to fit Gaussian Naive Bayes as per this video ?
from sklearn.naive_bayes import GaussianNB clf_gnb = GaussianNB() train_x_vectors_array = train_x_vectors.toarray() test_x_vectors_array=test_x_vectors.toarray() clf_gnb.fit(train_x_vectors_array, train_y) clf_gnb.predict(test_x_vectors_array) This works, but not sure the fit is correct
Ok i've solved, in my case the problem was with the Enum class, i was putting a comma after the sentiment string, and that created a list of labels with commas
At the step of classifiers,while using Naive Bayes,the line where you tell from sklearn.naive_bayes import GaussianNB clf_gnb = GaussianNB() clf_gnb.fit(train_x_vectors,train_y) clf_gnb.predict(test_x_vectors[0]) The third line .fit(---) is not working,there is a sparsing error it is throwing, Can you please check it @Keith Galli Thank you
I think this way it's much clear in the Prep Data x=[t.text for t in reviews] y=[z.sentiment for z in reviews] x_train,y_train,x_teat,y_test=train_test_split(x,y,test_size=0.3)
while the model is trained equally by POSITIVE and NEGATIVE labels, it shouldn't biased to any of them! (but we see there is a meaningful difference in f1-score) and also, number of data for testing model shouldn't affect the model performance as our model already constructed and will not change by the number of test cases. am I right?
I get blanks ( DecisionTreeClassifier() ) w/out the classifier information in my code; any ideas? Also thanks these are great videos! As a fellow Sigma Chia; In Hoc!
I think you haven't define test_x while working on classification. I got an error message then I saw it on the Reviewcontainer section. Did I missed something?
wouldn't be easier if we just import a resampling library like imblearn and oversample negative reviews / undersample positive reviews? That said, I learn some oop and love this tutorial so much. I somehow got an offer as a data scientist that works on NLP, so this actually gives me some confidence lol
Best channel ever to learn any Python library! 1:05 i wonder what the outcome will be for sarcasm, something like: 'beautiful restaurant that made me puke, raccomand'
hey keith, can you build an app for PC that can text any cellphone, can receive replies, maybe store a contacts list, I know that plenty of websites can do this, but an app for PC that doesn't require the cell phone that you're texting to download said app?
Can anyone tell me if there is a way to predict more than one value? An example of this would be to have a studentId and be able to predict not only if he will pass, but what grades he would probably get as well. There would obviously be more data from this specific student to learn from
Absolutely lovely tutorials! I follow all your data science projects. keep doing this :) I have encountered an issue while solving this, have posted my error and code on StackOverflow Link : stackoverflow.com/questions/62347528/trouble-fitting-my-model-on-sklearn-from-svm anyone who can figure out, please comment on on the solution
Video outline!
0:20 - What we will be doing!
3:40 - Sci-Kit Learn Overview
6:38 - How do we find training data?
9:33 - Download data
11:45 - Load our data into Jupyter Notebook
16:38 - Cleaning our code a bit (building data class)
20:13 - Using Enums
22:50 - Converting text to numerical vectors, bag of words (BOW) explanation
25:45 - Training/Test Split (make sure to "pip install sklearn" !)
33:45 - Bag of words in sklearn (CountVectorizer)
40:05 - fit_transform, fit, transform methods
42:05 - Model Selection (SVM, Decision Tree, Naive Bayes, Logistic Regression) & Classification
47:50 - predict method
53:35 - Analysis & Evaluation (using clf.score() method)
56:58 - F1 score
1:01:01 - Improving our model (evenly distributing positive & negative examples and loading in more data)
1:20:36 - Let's see our model in action! (qualitative testing)
1:22:24 - Tfidf Vectorizer
1:25:40 - GridSearchCv to automatically find the best parameters
1:31:30 - Further NLP improvement opportunities
1:32:50 - Saving our model (Pickle) and reloading it later
1:36:37 - Category Classifier
1:39:14 - Confusion Matrix
Thank you for watching! Make sure to like & subscribe if you enjoyed :)
thanks so much
please make videos on Django python full tutorial using visual studio
Thanks man
Is there anyway I could import another random dataset into my trained model and see if he can predict me the category from the other database (the one I used to trained my model)
can you help out with my error in the comments
This is glorious, been searching for "learn tennis betting game" for a while now, and I think this has helped. Ever heard of - Aiyenjamin Prefatory Approach - (should be on google have a look ) ? It is a good one of a kind guide for discovering how to get a unique tennis betting formula minus the hard work. Ive heard some super things about it and my buddy got amazing results with it.
you're the reason that I've got an internship in a great company :) well.. I'm broke now :D but when I earn tons of money( I hope we all do :D ) I'll donate you Keith !
How are you doing now, man?
Any updates?
Hey man how you doing now
Doing well now finally! :). Will be back on youtube very soon
A quick one for those into machine learning. On a scale of 1-10 how sufficiently enough does this tutorial cover machine learning. I am developing certain skills in data analytics and wanted to add Machine learning into the mix but don’t want to start diving too much into it. Just the necessary I will be need for a day in day out machine learning job requirements
A quick one for those into machine learning. On a scale of 1-10 how sufficiently enough does this tutorial cover machine learning. I am developing certain skills in data analytics and wanted to add Machine learning into the mix but don’t want to start diving too much into it. Just the necessary I will be need for a day in day out machine learning job requirements
keith ,you are like an elder brother teaching us how to do sums.thanksssssssssssssssssssssssssss a lottttttttttttttttttttttttttt bruhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
In the first exercise if any of you feels like laughing a bit do this:
if float(review['overall']) < 2:
print(review['reviewText']+ '
')
Also, great video! Didn't know I could enjoy Data Science as much as I am.
He not only teaches the good stuff but also teach how to google things and get the job done.
Keep going brother!. You are Awesome.
My goal is for you guys to be able to do this type of stuff on your own! Thanks for the support man, I appreciate it :)
Yes, I agree with you 100%. He is the only person I know on youtube that actually teaches the material so well! I hope to see this channel grow to millions of subscribers.
yess exactly.... I was confused how to use stackoverflow...but after watching his real world problem tutorial.. I learnt this skill too
Please keep uploading you're one of the best tutorial channels.
Thank you!! Will do my best
This video is so underrated. Should have atleast 500K views.
I like it when you showed us how you would use online resources, all the Googling and documentation stuff, so that we are not afraid to actually go online ourselves and explore more new functions :) Thanks Keith!! Stay healthy! :)
practical and nicely done. thanks! please do more videos on sklearn, maybe regression & clustering...
from sklearn.naive_bayes import GaussianNB
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors, train_y)
clf_gnb.predict(test_x_vectors[0])
what is the fault
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
having the same issue
fixed it. clf_nb.fit(train_x_vectors.todense(),train_y)
clf_nb.predict(test_x_vectors.todense()[0])
@@WeAreSWAGent Thanks!
Hello if your Gaussian naive Bayes keeps coming up with an error, try this:
from sklearn.naive_bayes import GaussianNB
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.todense(), train_y)
clf_gnb.predict(test_x_vectors[0].todense())
Resh Thank you for the solution
Thanks for the cool tutorial! Just a quick correction: when you're classifying using Naive Bayes, you used the Decision Tree Classifier, copied from the previous case. It's not super critical, but when I tried to use what I believed to be the corrected version, I found an unexpected error, and to use a working Naive Bayes, had to convert the train vector to a dense matrix using ".todense()".
I'm not sure if this is correct though, if you have any input on this, it would be greatly appreciated! Thanks again :)
How I tried to do it:
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.todense(), train_y)
clf_gnb.predict(test_x_vectors[0].todense())
thanks dude
Not working for me, I used .toarray() instead.
I also had a problem with this.
Where should we insert the fix?
this : clf_gnb.predict(test_x_vectors[0].todense()) .. seems to late
Anyone manage to get it working?
Thanks
Thank you!
Thanks! :)
Great stuff Keith. Really good. Keep doing your bit for all of us. Thanks a lot.
Awesome. Are you planning making more of this Machine Learning Videos? It would be great if you could include more about the preprocessing part, maybe trying to get data from a source where it is not ordered and with lot of outliers.
50 y.o. software developer here.
this is the first hands on video I watch on the subject of ML.
As a first step into the subject, I'm very sarisfied with the time I spent with you.
You covered the basics, from data prep to model save and load.
Surely a good starting point for further personal explorations.
Also enjoying your Pandas related content
Keep up the good work, and maybe use Jupyter's tab-completion, sometimes ;)
I was waiting for this! You sir, are a legend
@wise guy I think discrete math would help you grasp this
I have been doing a lot of courses for ML in scikit, I found this last week, and learnt it. And to be honest, I mastered things, which they couldn't cover in the so-called "mega" courses. You're awesome and also really helpful!
This guy is like the human version of W3school, his content is simple, succinct and well thought out
Big thank for you, Keith...!!!
May be just requesting for pytorch tutorial.
i really like the way u explain
It was a really good video. But I have this one question. Why didn't we change the sentiment into numbers (i.e: "NEGATIVE=0, POSITIVE=1) and then try to fit the model. How does the model can understand the string values?
big fan of what you are doing keep it up (y)
Keith Galli: "I'm going insane!" ahahah
that made me chuckle too lol
Please upload a real world predictive model project
I suppose that the Bag of Words representation doesn't consider the word order in the sentence. Is that correct?
While it generates distinct vectors for the following examples:
"The book is good, but ..."
"The book is bad, but ..."
It fails to distinguish between the following two examples:
"The book is not good, but ..."
"The book is good, but not ..."
Is there an alternative representation that considers the actual word order in the sentence?
If not, I would suggest preprocessing the input data to form pairs (and/or triples) of adjacent words by concatenating them (thus treating them as a single word). This approach would capture common linguistic associations. Do you believe this could enhance the model, or might it introduce other issues?
Keith has got it DOWN! Very instructive, thank you.
Great video, but I tested a few other algorithms on the data-set and they seemed to work even better on the data. The algorithms were: Nearest Centroid Classifier and Stochastic Gradient Descent. Thanks for the video though, really helped me.
Phew, finally finished watching this one:) A lot to take in, but super helpful and interesting! Thanks, Keith! :) Gonna start your real-world task with Pandas tomorrow!
Improving accuracy by balancing test data does not mean your model is robust enough. I expected your model would have worked well even for unbalanced test data. Do you have any suggestions on how to do that?
i always am being directed back and stay at Keith's video... just awesome...
Loved this video! I followed along writing my own code and it helped me put what I've learned into practice. Thank you so much for the practical advice, I can't wait to start on my own projects. Liked and subscribed! :)
p.s. Did anyone have issues getting the output from the GridSearch portion? That was the only part that messed up for me.
My output:
GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})
I was also having this problem, there was no result from GridSearchCV, hope you have got the solution for it.
Hello. Good day. A quick one for those into machine learning. On a scale of 1-10 how sufficiently enough does this tutorial cover machine learning. I am developing certain skills in data analytics and wanted to add Machine learning into the mix but don’t want to start diving too much into it. Just the necessary I will be need for a day in day out machine learning job requirements
I don't like seeing all that positive data go to waste. Is it beneficial/worthwhile to create synthetic data (?) based off of the existing negative reviews so that no positive data goes to waste? Or what about altering the training parameters to put 10x more weight into incorrect positives, a ratio that more closely resembles our data?
Hello! A quick one for those into machine learning. On a scale of 1-10 how sufficiently enough does this tutorial cover machine learning. I am developing certain skills in data analytics and wanted to add Machine learning into the mix but don’t want to start diving too much into it. Just the necessary I will be need for a day in day out machine learning job requirements
This was really helpful, I have been watching your videos since last few days. They are really aweseome. Subscribed. Can you please make a video for Face recognition using CNN or suggest me a link to watch.
@keithgalli even I want to learn face recognition using cnn plz make a video for that
Yes Please make a video on Face Recognition.
Thanks keith..I have issues with my Jupyter notebook can't recall the modules have installed in the terminal..Help Asap
Hi keith, what a great video.
I think i know why the result of SVC is still the same. Because the best parameters are c=1 and kernel= linear, not rbf. You can check the best parameter of the gridsearchcv by running clf.best_params_
Correct me if I am wrong but I think that there is a problem. The vectorized version of the text are different between test and training because they come from differente dictionary. You can check it. If you get the length of any vectorized version of the train text and you compare it with the vectorized version of the test text, they are different. I think that the vectorized version you should get it before the split.
Make sure you didn't "fit" or "fit_transform" to the test vectors. We want to only fit the vectorizer to the training data, then simply "transform" the test data. In the real world, our training data is the only thing we get to see. We create a dictionary on that and then use that dictionary on any incoming test vector. Some words during test won't be in the dictionary and that's life, we won't be able to handle those properly. Our hope is that our training set was big enough and included enough words that during test time we can use that same vectorizer and get good results. Hopefully this makes sense.
@@KeithGalli ooooooh man! OK ok hahah I find my error. Thank you Keith I was fitting two times and getting two different dictionaries haha. Thank you!!
This one is just one heck of tutorial. Thanks a ton Keith. I am a Java Architect with 17 years of extensive experience, looking to shift to ML/Data Science. It took me 3 hours to cover this video. I must say first one hour was realy easy to follow but probably you covered a lot of things in the last 40 minutes.
great tutorial keith. you are incredible !!
anyways, do you have any book recommendation for studying? I'm still a new in machine learning so, it would be nice if I read a lot of book first than start studying machine learning in practically. thanks in advance!!
Could you make a short video explaining what are the differences between deep learning, machine learning and AI from your point of view. Thank you and good luck
learn to google man
This is by far the most useful tutorial that I have ever seen. You are an amazing teacher.
Thank you! This was extremely helpful. (POSITIVE)
Just watched the video in one sitting. It was great! I learned so much, and I loved you showed the entire process from data to evaluation of model. Keep up the good work :)
Thank you! Glad it was helpful :)
Keith man. This is an awsome video. Please make some more videos just like you did "Solving real world data science task" video.
This is great! Looking forward to more ML content like regression, decision trees, SVM.
Basically, very cool video, Keith!
But for NLP I recommend to use spacy.io
Also waiting more deeply videos about ML.
Can you make any about PyTorch or TensorFlow in feature please?
Wow Keith, you're an absolute legend! I can't wait to get through your other videos and see your future work :D
You are so good, explaining the hardest things in common language and makes it easy to understand to even my grandma.... Thanks so much for making this simple!
phew.. that was heavy, not sure if you were rushing in with a lot of things, but then maybe there's so much to cover there anyways. You did well in spite of all anyways, but yeah there's a lot to pick up from this video.. good work
Is a previous knowledge of machine learning required for this video to be helpful ???
It might help, but it's not required!
@@KeithGalli Ok keith, Tnx 😊
I am not getting any recommendation from GridSearchCV
My code:
from sklearn.model_selection import GridSearchCV
parameters = { 'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}
svc=svm.SVC()
clf=GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)
The response:
GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})
Nothing else !! Did I miss something, or has this stopped working?
Thanks
Hey Keith. Looks like the issue with relatively low score (~80%) is caused by imperfection of training data. I'm talking about conversion of Star rate to one of three classifiers: NEGATIVE, POSITIVE, NEUTRAL. The problem is that Stars are assigned by Customers but not Amazon AI Engine. People are treating say 3 star rate in very different way. Even if Customer is not really happy with product and giving fairly negative feedback, he/she still can provide 3 stars rate. So, while 5-4 stars rate is working well for POSITIVE as well as 2-1 stars - for NEGATIVE, there is a little bit uncertainty with 3 Stars rate. I think (5-4-3 stars for POSITIVE and 2-1 stars for NEGATIVE) or (5 stars for PERFECT, 4-3 for POSITIVE and 2-1 NEGATIVE) logic should give us 90-95% score. Thoughts?
Yeah you're very right with your thoughts. The meaning of the 3-star classification is pretty ambiguous and we can't reliably count on the data rated this way. Ultimately though the models that were being scored with ~80% were only classifying between NEGATIVE (1-2 star) & POSITIVE (4-5 star) so our model had more issues than just how we categorized the data. If we want to get that score up higher we will want to apply some additional processing to our text. Some ideas would be removing stop words (words like "the", "this", "that", etc), lemmatizing/stemming (converting words to a base form), and utilizing bigrams (pairs of words instead of single words). Another reason for a relatively low score is that our data is not perfect. Even some of the 5 star reviews probably have no meaningful information that conveys positive sentiment in their review text. Same goes for 1 star reviews. Potentially doing some manual review of our training data would be another way to improve the score. Hope this information is helpful!
BTW, I'm a huge hockey fan and after noticing the hat in your profile picture I have to quickly say.... Go Bruins!!! ;)
I have the impression that the transformation from text to vector has to be done before doing the "evenly distribution of the training sample because we would otherwise have different word to vector transformation, does that make sense?
Any Idea, why linear regression doesn't work
from sklearn.linear_model import LinearRegression
clf_log = LinearRegression()
clf_log.fit(train_x_vectors, train_y)
clf_log.predict(test_x_vectors[0])
error -->ValueError: could not convert string to float: 'POSITIVE'
43:00
He's referring to Patrick Winston. By sheer chance I was watching one of his lectures on YT early this morning.
Amazing video. One won't find such tutorial on Python and Machine learning modules. It's the very video helped to complete my project.
Glad you liked it!
Hi, Sorry got some doubt. Why have you used DecisionTressClassifier() in Naive Bayes. Instead why we are not using GaussianNB(). in[37] code
Thanks Keith ,for tutorial. I tried to do something more with imbalanced data(well,at least i though in this was :) ).I tried SMOTEEN ,SMOTE + tomek,smote,combined over and under sampling ,neither of them showed better perfomance than yours,by manully distributing. All these i tried on data ,loaded as pandas dataframe .
In addition,i tried almost the same exact things,on dataframe,that you did on json .f1 scores was 2 times lower. Now i wonder, is there any difference between json and pandas dataframe?Could you make it clear?Perhaps you also tried it ,to load as dataframe. Keen on to know your opinion. Thanks
Great video. Thank you very much. Is there a way of importing the contents of the variable file_name into a dataframe. So instead on print(line) you direct output into a data frame?
You're very welcome, glad you enjoyed. Check out this link: stackoverflow.com/questions/20037430/reading-multiple-json-records-into-a-pandas-dataframe
It should answer your question! :)
I am at the 44:00 mark.
and I keep on getting that ValueError Traceback (most recent call last)
in
3 clf_svm = svm.SVC(kernel='linear')
4
----> 5 clf_svm.fit(train_x_vectors,train_y)
~/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py in fit(self, X, y, sample_weight)
162 accept_large_sparse=False)
163
--> 164 y = self._validate_targets(y)
165
166 sample_weight = np.asarray([]
~/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py in _validate_targets(self, y)
547 classes=cls, y=y_)
548 if len(cls) < 2:
--> 549 raise ValueError(
550 "The number of classes has to be greater than one; got %d"
551 " class" % len(cls))
ValueError: The number of classes has to be greater than one; got 1 class
Do you mind helping me with this please?
thank you mate! that's amazing!
Hey! How u doing? I dont know if u are going to see this, but when i run the f1 score with the 10.000 file jupyter says: MemoryError: Unable to allocate 1.33 GiB for an array with shape (6700, 26615) and data type int64
I googled it but i couldnt find an answer... can u help me please? Greetings from Argentina!
You kept appearing on my thumbnail.. I didn't care at first.. Later for once i opened the data science video.. Man.. It was so useful. The application videos of machine learning, data science were awesome. Thanks Keith ❤️.
Well I'm happy that you ended up clicking on a video :). Also glad that you have found the videos useful. I appreciate the support!
Wow, that is one comprehensive tutorial. Thanks for the time and effort.
Hi there, any advice on using sklearn to predict using multiple csv's and manually input data (i.e. I type in some data)
Sooo POSITIVE. You really saved me. Thanks a lot!
great tutorial! thanks 😊
Another great video. Really appreciate minimal slides paired with the 'live' coding feel.
I am like machines. I am always learning... Not watched but I believe you made your best.
Edit: I just finished this tutorial and I still support my first comment. NOICE. You are real deal!
@Keith Galli this is really dope. Totally love how how you teach the tutorial. Amazing stuff here.
Real helpful, made me realise New possibilities on how to go agout text data - thanks 🙂
Very good and cool Tutorial Keith! Thanks a Ton! Loved it!
Started getting invalid syntax at 18. the .text kept coming back as not being listed as a tuple
The only reason I can't learn anything from your videos is because you are too cute and hot to not watch your face rather than screen!
@Keith Galli, really awesome tutorial to watch & try in parallel. however, can you please further clarify how to fit Gaussian Naive Bayes as per this video ?
from sklearn.naive_bayes import GaussianNB
clf_gnb = GaussianNB()
train_x_vectors_array = train_x_vectors.toarray()
test_x_vectors_array=test_x_vectors.toarray()
clf_gnb.fit(train_x_vectors_array, train_y)
clf_gnb.predict(test_x_vectors_array)
This works, but not sure the fit is correct
Keith, this is incredibly helpful. Your teaching style is to be commended. I look forward to more like this for ML.
Your videos are superb. I can see your videos and just get started applying it to my project. Thank you👍.
That's awesome! Glad you have enjoyed :)
Hi keith could you please help me how to load the json file in python. I am unable to do it
Dude you are an excellent educator, thank you so much for this well structured, well explained video!!
can't import the data books small can anyone help mE ?
Excellent tutorial to learn the fundamentals of SCI-Kit
Why won't you use stratify to distribute evenly data
hey bro am getting a TypeError: '
TypeError Traceback (most recent call last)
in
----> 1 clf_svm.fit(train_x_vector,train_y)
C:\Anaconda\lib\site-packages\sklearn\svm\base.py in fit(self, X, y, sample_weight)
145 order='C', accept_sparse='csr',
146 accept_large_sparse=False)
--> 147 y = self._validate_targets(y)
148
149 sample_weight = np.asarray([]
C:\Anaconda\lib\site-packages\sklearn\svm\base.py in _validate_targets(self, y)
513 def _validate_targets(self, y):
514 y_ = column_or_1d(y, warn=True)
--> 515 check_classification_targets(y)
516 cls, y = np.unique(y_, return_inverse=True)
517 self.class_weight_ = compute_class_weight(self.class_weight, cls, y_)
C:\Anaconda\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
164 y : array-like
165 """
--> 166 y_type = type_of_target(y)
167 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
168 'multilabel-indicator', 'multilabel-sequences']:
C:\Anaconda\lib\site-packages\sklearn\utils\multiclass.py in type_of_target(y)
285 return 'continuous' + suffix
286
--> 287 if (len(np.unique(y)) > 2) or (y.ndim >= 2 and len(y[0]) > 1):
288 return 'multiclass' + suffix # [1, 2, 3] or [[1., 2., 3]] or [[1, 2]]
289 else:
in unique(*args, **kwargs)
C:\Anaconda\lib\site-packages
umpy\lib\arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
260 ar = np.asanyarray(ar)
261 if axis is None:
--> 262 ret = _unique1d(ar, return_index, return_inverse, return_counts)
263 return _unpack_tuple(ret)
264
C:\Anaconda\lib\site-packages
umpy\lib\arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
308 aux = ar[perm]
309 else:
--> 310 ar.sort()
311 aux = ar
312 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '
Me too dude, have you solved?
Ok i've solved, in my case the problem was with the Enum class, i was putting a comma after the sentiment string, and that created a list of labels with commas
Can you please make videos on pytorch or tensorflow.
why distribute evenly the testing data and f1 score better?
Relevant and super helpful in 2024 too
!
At the step of classifiers,while using Naive Bayes,the line where you tell
from sklearn.naive_bayes import GaussianNB
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors,train_y)
clf_gnb.predict(test_x_vectors[0])
The third line .fit(---) is not working,there is a sparsing error it is throwing,
Can you please check it @Keith Galli
Thank you
naive bayes gives error for .fit method how to solve it ?
I think this way it's much clear in the Prep Data
x=[t.text for t in reviews]
y=[z.sentiment for z in reviews]
x_train,y_train,x_teat,y_test=train_test_split(x,y,test_size=0.3)
while the model is trained equally by POSITIVE and NEGATIVE labels, it shouldn't biased to any of them! (but we see there is a meaningful difference in f1-score) and also, number of data for testing model shouldn't affect the model performance as our model already constructed and will not change by the number of test cases. am I right?
I get blanks ( DecisionTreeClassifier()
) w/out the classifier information in my code; any ideas?
Also thanks these are great videos!
As a fellow Sigma Chia; In Hoc!
I think you haven't define test_x while working on classification. I got an error message then I saw it on the Reviewcontainer section. Did I missed something?
How to use it in games like snake, Mario,etc . Pls tell
hey bro
make a video on sql for data science ...PLZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
wouldn't be easier if we just import a resampling library like imblearn and oversample negative reviews / undersample positive reviews?
That said, I learn some oop and love this tutorial so much. I somehow got an offer as a data scientist that works on NLP, so this actually gives me some confidence lol
Very good video! New subscriber and added to my “ Perfect videos” list. Thanks for sharing your knowledge.
Best channel ever to learn any Python library!
1:05 i wonder what the outcome will be for sarcasm, something like: 'beautiful restaurant that made me puke, raccomand'
Your videos. Are changing my life
hey keith, can you build an app for PC that can text any cellphone, can receive replies, maybe store a contacts list, I know that plenty of websites can do this, but an app for PC that doesn't require the cell phone that you're texting to download said app?
Can anyone tell me if there is a way to predict more than one value? An example of this would be to have a studentId and be able to predict not only if he will pass, but what grades he would probably get as well. There would obviously be more data from this specific student to learn from
Absolutely lovely tutorials! I follow all your data science projects. keep doing this :)
I have encountered an issue while solving this, have posted my error and code on StackOverflow
Link : stackoverflow.com/questions/62347528/trouble-fitting-my-model-on-sklearn-from-svm
anyone who can figure out, please comment on on the solution
Almost exact 2 years later, a crappy student is watching ur video at 3am, trying to finish his assignment