I do appreciate Dave's approach. I think it's important to stress that there is a lot more to being a data scientist than simply understanding concepts of M, AI, etc, or taking a few online courses a certificate. I believe it takes graduate coursework and years of being a practitioner underatnding and implementing a list of techniques. Engineers typically vector into data analytics completely differently than I do, having a MS in data analytics. It is a good illustration into just how complex and broad the science of data is in these infant stages.
This was really great Dave. I've done a bunch of your tutorials online including the intro to data science videos you did using the Titanic Kaggle competition about 4 years ago. What I enjoyed the most about this video was seeing how much more confident and impassioned you have become as a data scientist since those prior videos. You can tell that it really excites you and that is infectious in a teaching environment. I too have become somewhat hooked on data science and I was one of those students that avoided statistics at all costs at every level of education. I'm looking into coming to one of the data science bootcamps at the data science dojo and really looking forward to learning from people that are equally passionate about data science and hopefully making up some lost ground. Keep up the great work.
Extremely helpful video. I don't know the concept of grid search what it does? Can you explain me in simple terms how it work and how it helps in tuning the model? Thank you.
Hi Dave, very instructive video, congratulations. Please let me ask you a question: I know caret does not impute with factors. But how do you do in practice when you need to impute data to categorical/factor variables? (discarding the mode) In the example of your video, in the dataset "imputed.data" you have two columns/dummies for Sex. If you -hypothetically-impute missing values for them, how do to take them back to the original dataset, in which there is only one column for Sex?
Thanks for the video! Quick question - why do you have to split the data into a training/test set of 70/30 when you are going to do 10-fold cross-validation (90/10 split?) anyway later on? Are these two different things?
I am running through the smiliar problem. I built a very simple model with complete cases with this data. Didn't do test-train split as CV was supposed to give me out of sample metric. I got an ROC score of 0.8. I uploaded the model and the kaggle gave accuracy of .52. Now I am confused what purpose CV served.
@Nikhitha Rajashekar - As I mention in the video, while caret can perform stratified cross validation, the video does not demonstrate this. As coded, the video illustrates using cross validation with simple random sampling to create each of the 10 folds for each of the repeats (i.e., 30 total folds each created with random sampling). HTH, Dave
Great video.. caret is amazing.. one question though... If we are doing stratified sampling then we don't have to balance the data? Because if we don't balance the data then the outcome will be biased and if we balance the data then it will be manipulation
The result of stratified sampling with respect to the outcome (survived/non-survived) is a balanced training and test set with respect to the outcome (survived/non-survived). Best, Gregory
Hi Dave first of all thanks for the video. AWESOME stuff! One question, are the hyperparameters for the xgboost algorithm universal or are they tuned specifically to this training set? Could I get the reference for the hyperparameters it was cut off in the code editor screen. Thanks again.
@Overlooking the Obvious - This is an important question. While the list of hyperparamters for any algorithm will always be the same, the values of each individual hyperparameter are tuned in the context of a particular data set. For example, you may find some values that are optimal for your training data set. You then perform feature engineering and add a new feature. There is no assurance that the previous hyperparamter values are still optimal, hence it is common practice to tune later in the project cycle when you arrive at a stable list of features. Here's a link to a great reference to xgboost hyperparameter tuning: www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1 HTH, Dave
Great video! Only one question. When you say that set.seed(54321) is not random, what do you mean? I thought whatever we put in set.seed could be anything, e.g., set.seed (321). What is the meaning behind your 54321? You sorta glanced over that part and I'd love to dive a little deeper into that.
I noticed that the other columns with large number of na's were removed and while imputing Age variable all the other factors were having no na's . What should I do if the variables that are critical for imputation of age variable also has na's ? I'm a noob. So please correct me if there is lack of logic in my doubt.
@Paul Victor - Glad you liked the video. Your question is apt as "big data" vs. "small data" is a subjective measure that depends on the situation. In this particular case, caret will create N bagged decision tree models where N is the number of predictors in your data frame. A 5000x10 matrix would be fine, but you certainly wouldn't want to use this functionality on other problems like text analytics where you could have tens of thousands of rows and tens of thousands of columns. HTH, Dave
Isn‘t there some sort of data leakage? You‘re imputing the missing ages using the entire data set. So the training set „knows“ something about the test set. That‘s not good. I think you should split first and then use two pipelines for training and testing. Is there support for pipelines in caret?
Thank you for this. However I tried Implementing the code as written in IntroToMachineLearning.R and I get an error at line 159. I have tried it several times and the error message i get is Loading required package: plyr Error in train.default(x, y, weights = w, ...) : The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight this is all confusing being that all the columns specified are included in the code. Could this be a result of a bug? Please I'll appreciate an prompt answer to this. Thanks
@dave, i understood how you imputed the age. however if we have like 200 missing data for embark data, will the same method for imputing age work, ? i mean is not it possible that for some cases both Q and S might have values close to 1 for same row? what to do in that case
Great video, I have watched several times at this point to get a better understanding of the caret package. It helped me out a lot. However, I have one question. Why do you split in train and validation sets and then use cross validation on train. I always thought that cross validation was repeated train test split. This way you will avoid evaluating your model on only one split, which by chance might be easy (or super hard) to predict i.e. because the test subset contains more extreme values or the train contains more of the imputed instances, etc.. . By repeating the process of splitting the data in train and test several time and averaging the performance metrics over all these splits, you get a better view of the real performance of the model. So why do you split in train test subset and then use cross validation on train? As I understand it now, it looks like you are reintroducing the problem cross validation is trying to solve. Would it not be better to not in train and test and use k-folds cross validation (which is basically a repeated split in train and test). Thanks!
If you get "subscript out of bounds" in the train() function, change the parallelization engine over to the future engine as it is better at exporting environments: library(parallel) library(future) library(doFuture) plan("multisession") #if you're seeing this error, you're likely on a Windows machine anyway registerDoFuture() And also comment out the makeCluster, registerDoSNOW, and stopCluster lines.
Great video...Do you feel it is necessary to use dummyvars before doing the imputation ? Isn't it sufficient to do the imputation within the call to the train function as part of the preProcess argument ? That is, is the conversion to one hot encoding outside of the call to train, strictly necessary ?
Thanks for the tutorial. Talking about model based imputation, let us say we have 3 numeric variables to impute. How the imputation will work if we want to impute the first variable? Does caret will consider complete case approach for the rest of data? If so, how then it will impute the original first variable if it happens that for a record one of second or third variable has missing value? What is the procedure here? Thanks.
am reproducing your example but am stack at training the model, it returning an error like Error: The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample what could i be doing wrong? here is the part which is returning error caret.cv
Dear David, great talk, thank you very much. I have a short question: how do I know which factors are included in the "best" model? Thus, which factors are most predictive in separating survivors from non-survivors? Thank you in advance! Best, Bob
@Robert Daihatsu - If you mean individual factor levels, then that can be difficult to get from the models. Finding feature importance, however, is far easier. For example, the following code can be added to the end of the Meetup code file to get the feature importance: xgb.importance(feature_names = colnames(titanic.train), model = caret.cv$finalModel) HTH, Dave
Hi David, thanks for this session!! one question, is it always good to go with imputing using caret(e.g. bagged decision trees for imputing age) or we should do some EDA such as finding a pattern in age using Pclass, sex aggregation and then imputing the age with that value?
@Raj kamal Srivastav - I tend to shy away from terms like "always" and "never" when it comes to data science. The only universal answer I've found is "it depends". :-) To answer you specific question, I always strongly suggest doing exploratory data analysis - in fact we spend a good chunk of day 1 in our bootcamp discussing EDA. However, it is often the case that even after EDA you might need a ML model to most accurately impute ages due to the underlying complexities in the patterns in the data. HTH, Dave
thanks for the amazing video dev, it's going to help in future also, but I don't know the grid search concept what it does? Can you explain to me in simple terms how it works and how it helps in tuning the model???
Quantum Information learn what you need for the job you want. Different jobs require different tools for different tasks. Figure out what you want to do then figure out what tools will get you there.
Great video @Dave. Super helpful; I love the step-by-step Q&A. Just curious: is it 'good' practice to include the test set when imputing data? shouldn't it be done on the train set only?
You can watch more of our Machine Learning tutorials here: tutorials.datasciencedojo.com/azure-machine-learning-tutorial-part-1/ You can also find our other meetups here as well: tutorials.datasciencedojo.com/categories/community-talks/
I got it that dummy variables were calculated in order to do the imputation. In this case, you did the dummy variable stuff to teach, because there were not any missing values in any other columns other than the age, so in reality we can skip the dummy part, correct me if I am wrong. Also, actually I thought that the dummy variables were created for the training part too. A question arises after that how the machine learning is performing on the categorical variables? is it converting them into numerical values like one hot encoding automatically or processing directly just like in a simple decision tree?
@Junaid Effendi - If I understand your question correctly the following lines of code use train.dummy to generate a new matrix with all missing Age values imputed: pre.process
@Junaud Effendi - If you use caret's imputation feature via the preProcess() function then you need to convert to dummy variables. As I mention in the video, the preProcess() function does not work with factor variables. You would not want to skip this step as you are losing potential features that the bagged decision trees could use to potentially build more accurate imputation models. To answer your second question, "it depends". For example, in the case of the mighty Random Forest factors can be used directly so caret will do nothing. However, xgboost does not support factors by default. In this case caret is transforming the factors behind the scenes for you. HTH, Dave
Troubling? Half the time she does not even put her hand up and furthermore she keeps taking it down. Seems a bit of a stretch to make this a gender issue. I know there is properly a bias against women in this field, but not every situation should be used as a call to arms.
@@Ivansnooze Isn't that what men ALWAYS do, minimize the gender concerns of women? I'm a part-time chemistry professor, I don't need my students to keep their hands up for the entire lecture to recognize their questions. And when I see a student put their hand down after having raised it earlier, I'll double back to ask them if they still have a question. That's what good lecturers do.
For more captivating community talks featuring renowned speakers, check out this playlist: ruclips.net/p/PL8eNk_zTBST-EBv2LDSW9Wx_V4Gy5OPFT
What is your actual test? What do you want to explain? Model fit: where is it? Coding is 'impressive', but you must get some real results too.
Sure Hannu! We will work on explaining some real results in the future. Thank you for your suggestion.
I do appreciate Dave's approach. I think it's important to stress that there is a lot more to being a data scientist than simply understanding concepts of M, AI, etc, or taking a few online courses a certificate. I believe it takes graduate coursework and years of being a practitioner underatnding and implementing a list of techniques. Engineers typically vector into data analytics completely differently than I do, having a MS in data analytics. It is a good illustration into just how complex and broad the science of data is in these infant stages.
You are a nice American chap
This was really great Dave. I've done a bunch of your tutorials online including the intro to data science videos you did using the Titanic Kaggle competition about 4 years ago. What I enjoyed the most about this video was seeing how much more confident and impassioned you have become as a data scientist since those prior videos. You can tell that it really excites you and that is infectious in a teaching environment. I too have become somewhat hooked on data science and I was one of those students that avoided statistics at all costs at every level of education. I'm looking into coming to one of the data science bootcamps at the data science dojo and really looking forward to learning from people that are equally passionate about data science and hopefully making up some lost ground. Keep up the great work.
Great!
This is the single best ML video on the internet. Dave for President 2020.
Extremely helpful video. I don't know the concept of grid search what it does? Can you explain me in simple terms how it work and how it helps in tuning the model? Thank you.
Excellent class!
Hi Dave, very instructive video, congratulations. Please let me ask you a question:
I know caret does not impute with factors. But how do you do in practice when you need to impute data to categorical/factor variables? (discarding the mode)
In the example of your video, in the dataset "imputed.data" you have two columns/dummies for Sex. If you -hypothetically-impute missing values for them, how do to take them back to the original dataset, in which there is only one column for Sex?
Thanks for the video! Quick question - why do you have to split the data into a training/test set of 70/30 when you are going to do 10-fold cross-validation (90/10 split?) anyway later on? Are these two different things?
I am running through the smiliar problem. I built a very simple model with complete cases with this data. Didn't do test-train split as CV was supposed to give me out of sample metric. I got an ROC score of 0.8. I uploaded the model and the kaggle gave accuracy of .52. Now I am confused what purpose CV served.
Great Video to understand !
But i have doubt, how the resampling result across tunin Parameters are selected?
@Nikhitha Rajashekar - As I mention in the video, while caret can perform stratified cross validation, the video does not demonstrate this. As coded, the video illustrates using cross validation with simple random sampling to create each of the 10 folds for each of the repeats (i.e., 30 total folds each created with random sampling).
HTH,
Dave
Great video.. caret is amazing.. one question though... If we are doing stratified sampling then we don't have to balance the data? Because if we don't balance the data then the outcome will be biased and if we balance the data then it will be manipulation
The result of stratified sampling with respect to the outcome (survived/non-survived) is a balanced training and test set with respect to the outcome (survived/non-survived). Best, Gregory
Thanks a lot! Doing my first steps into R and Machine Learning. This talk is exactly what I needed
Hi Dave first of all thanks for the video. AWESOME stuff! One question, are the hyperparameters for the xgboost algorithm universal or are they tuned specifically to this training set? Could I get the reference for the hyperparameters it was cut off in the code editor screen. Thanks again.
@Overlooking the Obvious - This is an important question. While the list of hyperparamters for any algorithm will always be the same, the values of each individual hyperparameter are tuned in the context of a particular data set. For example, you may find some values that are optimal for your training data set. You then perform feature engineering and add a new feature. There is no assurance that the previous hyperparamter values are still optimal, hence it is common practice to tune later in the project cycle when you arrive at a stable list of features. Here's a link to a great reference to xgboost hyperparameter tuning:
www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1
HTH,
Dave
Oh. My. God. THIS... THIS!!!!! This literally changes everything.
this was excellent I've leant quite a lot and have a few new books for the reading list. Many thanks!
Keep following us for more tutorials.
@@Datasciencedojo will do!
Great video! Only one question. When you say that set.seed(54321) is not random, what do you mean? I thought whatever we put in set.seed could be anything, e.g., set.seed (321). What is the meaning behind your 54321? You sorta glanced over that part and I'd love to dive a little deeper into that.
I noticed that the other columns with large number of na's were removed and while imputing Age variable all the other factors were having no na's . What should I do if the variables that are critical for imputation of age variable also has na's ? I'm a noob. So please correct me if there is lack of logic in my doubt.
Great video! In regards to preProcess(..., method = "bagImpute") what's your definition of SMALL DATA? Would 5000 rows with 10 columns be small?
@Paul Victor - Glad you liked the video. Your question is apt as "big data" vs. "small data" is a subjective measure that depends on the situation. In this particular case, caret will create N bagged decision tree models where N is the number of predictors in your data frame. A 5000x10 matrix would be fine, but you certainly wouldn't want to use this functionality on other problems like text analytics where you could have tens of thousands of rows and tens of thousands of columns.
HTH,
Dave
Isn‘t there some sort of data leakage? You‘re imputing the missing ages using the entire data set. So the training set „knows“ something about the test set. That‘s not good. I think you should split first and then use two pipelines for training and testing. Is there support for pipelines in caret?
Thank you for this. However I tried Implementing the code as written in IntroToMachineLearning.R and I get an error at line 159. I have tried it several times and the error message i get is
Loading required package: plyr
Error in train.default(x, y, weights = w, ...) :
The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight
this is all confusing being that all the columns specified are included in the code. Could this be a result of a bug? Please I'll appreciate an prompt answer to this. Thanks
@dave, i understood how you imputed the age. however if we have like 200 missing data for embark data, will the same method for imputing age work, ? i mean is not it possible that for some cases both Q and S might have values close to 1 for same row? what to do in that case
Waw its amazing for turning parameter on xgboost, because we know xgboost always taking too much time for training
Great video, I have watched several times at this point to get a better understanding of the caret package. It helped me out a lot. However, I have one question. Why do you split in train and validation sets and then use cross validation on train. I always thought that cross validation was repeated train test split. This way you will avoid evaluating your model on only one split, which by chance might be easy (or super hard) to predict i.e. because the test subset contains more extreme values or the train contains more of the imputed instances, etc.. . By repeating the process of splitting the data in train and test several time and averaging the performance metrics over all these splits, you get a better view of the real performance of the model. So why do you split in train test subset and then use cross validation on train? As I understand it now, it looks like you are reintroducing the problem cross validation is trying to solve. Would it not be better to not in train and test and use k-folds cross validation (which is basically a repeated split in train and test). Thanks!
Great 👍🏻
If you get "subscript out of bounds" in the train() function, change the parallelization engine over to the future engine as it is better at exporting environments:
library(parallel)
library(future)
library(doFuture)
plan("multisession") #if you're seeing this error, you're likely on a Windows machine anyway
registerDoFuture()
And also comment out the makeCluster, registerDoSNOW, and stopCluster lines.
Are you hiring? I am the same as you. I spend over 20 years doing system development, programmer analyst, data analyst and data scientist.
Great video...Do you feel it is necessary to use dummyvars before doing the imputation ? Isn't it sufficient to do the imputation within the call to the train function as part of the preProcess argument ? That is, is the conversion to one hot encoding outside of the call to train, strictly necessary ?
Thanks for the tutorial. Talking about model based imputation, let us say we have 3 numeric variables to impute.
How the imputation will work if we want to impute the first variable? Does caret will consider complete case approach for the rest of data? If so, how then it will impute the original first variable if it happens that for a record one of second or third variable has missing value?
What is the procedure here?
Thanks.
am reproducing your example but am stack at training the model, it returning an error like Error: The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample what could i be doing wrong?
here is the part which is returning error
caret.cv
Dear David, great talk, thank you very much. I have a short question:
how do I know which factors are included in the "best" model? Thus, which factors are most predictive in separating survivors from non-survivors?
Thank you in advance!
Best,
Bob
Thanks Dave!
@Robert Daihatsu - If you mean individual factor levels, then that can be difficult to get from the models. Finding feature importance, however, is far easier. For example, the following code can be added to the end of the Meetup code file to get the feature importance:
xgb.importance(feature_names = colnames(titanic.train),
model = caret.cv$finalModel)
HTH,
Dave
Thank you for sharing ! Amazing Video and Instructions.
@Juliano Nascimento - Glad you liked the video!
Hi David, thanks for this session!!
one question, is it always good to go with imputing using caret(e.g. bagged decision trees for imputing age) or we should do some EDA such as finding a pattern in age using Pclass, sex aggregation and then imputing the age with that value?
@Raj kamal Srivastav - I tend to shy away from terms like "always" and "never" when it comes to data science. The only universal answer I've found is "it depends". :-)
To answer you specific question, I always strongly suggest doing exploratory data analysis - in fact we spend a good chunk of day 1 in our bootcamp discussing EDA. However, it is often the case that even after EDA you might need a ML model to most accurately impute ages due to the underlying complexities in the patterns in the data.
HTH,
Dave
Simply excellent , I could not hold my self to comment even if few miniues are still left . You are genious to make things so interesting .
Meetup Starts at: 2:57
thanks for the amazing video dev, it's going to help in future also, but I don't know the grid search concept what it does? Can you explain to me in simple terms how it works and how it helps in tuning the model???
Excellent presentation, you are a great teacher. Thank you
Keep following us for more crash courses!
very nice, i just used this package for an assignment. this got me enthusiastic to learn more
Glad to help you, Shaun.
Hi, I am a js expert wanting to get into DS. What tools do you advise me to learn?
Quantum Information learn what you need for the job you want. Different jobs require different tools for different tasks. Figure out what you want to do then figure out what tools will get you there.
Thanks very much. I got somewhere to start and do it to the end.. Great!!
Thank you so much for sharing this!
Great video @Dave. Super helpful; I love the step-by-step Q&A.
Just curious: is it 'good' practice to include the test set when imputing data? shouldn't it be done on the train set only?
Brilliant and great advert for your bootcamps!
Thank you very much Dave & team. Really enjoy the whole presentation and learn a lot!
Glad you liked it, Tama. Keep following us for more content.
Thanks! This was very helpful. Where can I get the rest of the videos on Machine Learning.
You can watch more of our Machine Learning tutorials here: tutorials.datasciencedojo.com/azure-machine-learning-tutorial-part-1/
You can also find our other meetups here as well: tutorials.datasciencedojo.com/categories/community-talks/
Great guide, I was really struggling with a ML assignment and didn't realise what an absolute unit 'caret' is!
By far the best video out there for ML in R
Amazing video!
Thank you so much!
So, how do we implement this model to a new dataset ?
Using the predict() function on the trained caret model object. Best, Gregory
Great video
@Statsvenu Manneni - Glad you liked the video!
thank you.
Great video but didnt see use of train.dummy? you worked on train dataset which has the imputed age but not the dummy columns, clear me please.
I got it that dummy variables were calculated in order to do the imputation. In this case, you did the dummy variable stuff to teach, because there were not any missing values in any other columns other than the age, so in reality we can skip the dummy part, correct me if I am wrong.
Also, actually I thought that the dummy variables were created for the training part too. A question arises after that
how the machine learning is performing on the categorical variables? is it converting them into numerical values like one hot encoding automatically or processing directly just like in a simple decision tree?
Thanks :)
@Junaid Effendi - If I understand your question correctly the following lines of code use train.dummy to generate a new matrix with all missing Age values imputed:
pre.process
@Junaud Effendi - If you use caret's imputation feature via the preProcess() function then you need to convert to dummy variables. As I mention in the video, the preProcess() function does not work with factor variables. You would not want to skip this step as you are losing potential features that the bagged decision trees could use to potentially build more accurate imputation models.
To answer your second question, "it depends". For example, in the case of the mighty Random Forest factors can be used directly so caret will do nothing. However, xgboost does not support factors by default. In this case caret is transforming the factors behind the scenes for you.
HTH,
Dave
Great video, but waiting ~ 5 mins to be recognized as having a question is troubling (between ~55:00 - 59:00). #WomeninDataScience
Troubling? Half the time she does not even put her hand up and furthermore she keeps taking it down. Seems a bit of a stretch to make this a gender issue. I know there is properly a bias against women in this field, but not every situation should be used as a call to arms.
@@Ivansnooze Isn't that what men ALWAYS do, minimize the gender concerns of women? I'm a part-time chemistry professor, I don't need my students to keep their hands up for the entire lecture to recognize their questions. And when I see a student put their hand down after having raised it earlier, I'll double back to ask them if they still have a question. That's what good lecturers do.
@@yishengkim9081 he did speak to her, her question wasn't missed. Are you sure the causation isn't due to her being at the back of the room?