Machine Learning with R | Machine Learning with caret

Поделиться
HTML-код
  • Опубликовано: 2 окт 2024

Комментарии • 81

  • @Datasciencedojo
    @Datasciencedojo  Год назад

    For more captivating community talks featuring renowned speakers, check out this playlist: ruclips.net/p/PL8eNk_zTBST-EBv2LDSW9Wx_V4Gy5OPFT

  • @hannukoistinen5329
    @hannukoistinen5329 Год назад +1

    What is your actual test? What do you want to explain? Model fit: where is it? Coding is 'impressive', but you must get some real results too.

    • @Datasciencedojo
      @Datasciencedojo  Год назад

      Sure Hannu! We will work on explaining some real results in the future. Thank you for your suggestion.

  • @KarriemPerry
    @KarriemPerry 5 лет назад

    I do appreciate Dave's approach. I think it's important to stress that there is a lot more to being a data scientist than simply understanding concepts of M, AI, etc, or taking a few online courses a certificate. I believe it takes graduate coursework and years of being a practitioner underatnding and implementing a list of techniques. Engineers typically vector into data analytics completely differently than I do, having a MS in data analytics. It is a good illustration into just how complex and broad the science of data is in these infant stages.

  • @djangoworldwide7925
    @djangoworldwide7925 2 года назад +1

    You are a nice American chap

  • @ghexer
    @ghexer 5 лет назад +5

    This was really great Dave. I've done a bunch of your tutorials online including the intro to data science videos you did using the Titanic Kaggle competition about 4 years ago. What I enjoyed the most about this video was seeing how much more confident and impassioned you have become as a data scientist since those prior videos. You can tell that it really excites you and that is infectious in a teaching environment. I too have become somewhat hooked on data science and I was one of those students that avoided statistics at all costs at every level of education. I'm looking into coming to one of the data science bootcamps at the data science dojo and really looking forward to learning from people that are equally passionate about data science and hopefully making up some lost ground. Keep up the great work.

  • @QuickFlicksx
    @QuickFlicksx Год назад +1

    Great!

  • @24brophy
    @24brophy 4 года назад +1

    This is the single best ML video on the internet. Dave for President 2020.

  • @flamboyantperson5936
    @flamboyantperson5936 6 лет назад

    Extremely helpful video. I don't know the concept of grid search what it does? Can you explain me in simple terms how it work and how it helps in tuning the model? Thank you.

  • @LuthieriadeBanheiro
    @LuthieriadeBanheiro 3 года назад +1

    Excellent class!

  • @sebastianvarela2190
    @sebastianvarela2190 5 лет назад +1

    Hi Dave, very instructive video, congratulations. Please let me ask you a question:
    I know caret does not impute with factors. But how do you do in practice when you need to impute data to categorical/factor variables? (discarding the mode)
    In the example of your video, in the dataset "imputed.data" you have two columns/dummies for Sex. If you -hypothetically-impute missing values for them, how do to take them back to the original dataset, in which there is only one column for Sex?

  • @erinklark
    @erinklark 6 лет назад +1

    Thanks for the video! Quick question - why do you have to split the data into a training/test set of 70/30 when you are going to do 10-fold cross-validation (90/10 split?) anyway later on? Are these two different things?

    • @ravi281381
      @ravi281381 5 лет назад

      I am running through the smiliar problem. I built a very simple model with complete cases with this data. Didn't do test-train split as CV was supposed to give me out of sample metric. I got an ROC score of 0.8. I uploaded the model and the kaggle gave accuracy of .52. Now I am confused what purpose CV served.

  • @nikhitharajashekar1637
    @nikhitharajashekar1637 7 лет назад +2

    Great Video to understand !
    But i have doubt, how the resampling result across tunin Parameters are selected?

    • @Datasciencedojo
      @Datasciencedojo  7 лет назад

      @Nikhitha Rajashekar - As I mention in the video, while caret can perform stratified cross validation, the video does not demonstrate this. As coded, the video illustrates using cross validation with simple random sampling to create each of the 10 folds for each of the repeats (i.e., 30 total folds each created with random sampling).
      HTH,
      Dave

  • @aakashchugh9
    @aakashchugh9 6 лет назад

    Great video.. caret is amazing.. one question though... If we are doing stratified sampling then we don't have to balance the data? Because if we don't balance the data then the outcome will be biased and if we balance the data then it will be manipulation

    • @gregorkvas6332
      @gregorkvas6332 3 года назад

      The result of stratified sampling with respect to the outcome (survived/non-survived) is a balanced training and test set with respect to the outcome (survived/non-survived). Best, Gregory

  • @yanivtubul
    @yanivtubul 5 лет назад +2

    Thanks a lot! Doing my first steps into R and Machine Learning. This talk is exactly what I needed

  • @neuro1152
    @neuro1152 7 лет назад +2

    Hi Dave first of all thanks for the video. AWESOME stuff! One question, are the hyperparameters for the xgboost algorithm universal or are they tuned specifically to this training set? Could I get the reference for the hyperparameters it was cut off in the code editor screen. Thanks again.

    • @Datasciencedojo
      @Datasciencedojo  7 лет назад

      @Overlooking the Obvious - This is an important question. While the list of hyperparamters for any algorithm will always be the same, the values of each individual hyperparameter are tuned in the context of a particular data set. For example, you may find some values that are optimal for your training data set. You then perform feature engineering and add a new feature. There is no assurance that the previous hyperparamter values are still optimal, hence it is common practice to tune later in the project cycle when you arrive at a stable list of features. Here's a link to a great reference to xgboost hyperparameter tuning:
      www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1
      HTH,
      Dave

  • @seanpitcher8957
    @seanpitcher8957 Год назад

    Oh. My. God. THIS... THIS!!!!! This literally changes everything.

  • @pipertripp
    @pipertripp 2 года назад +1

    this was excellent I've leant quite a lot and have a few new books for the reading list. Many thanks!

  • @StockSpotlightPodcast
    @StockSpotlightPodcast 4 года назад

    Great video! Only one question. When you say that set.seed(54321) is not random, what do you mean? I thought whatever we put in set.seed could be anything, e.g., set.seed (321). What is the meaning behind your 54321? You sorta glanced over that part and I'd love to dive a little deeper into that.

  • @alisterdcruz1667
    @alisterdcruz1667 3 года назад

    I noticed that the other columns with large number of na's were removed and while imputing Age variable all the other factors were having no na's . What should I do if the variables that are critical for imputation of age variable also has na's ? I'm a noob. So please correct me if there is lack of logic in my doubt.

  • @paulvictor3316
    @paulvictor3316 7 лет назад +1

    Great video! In regards to preProcess(..., method = "bagImpute") what's your definition of SMALL DATA? Would 5000 rows with 10 columns be small?

    • @Datasciencedojo
      @Datasciencedojo  7 лет назад

      @Paul Victor - Glad you liked the video. Your question is apt as "big data" vs. "small data" is a subjective measure that depends on the situation. In this particular case, caret will create N bagged decision tree models where N is the number of predictors in your data frame. A 5000x10 matrix would be fine, but you certainly wouldn't want to use this functionality on other problems like text analytics where you could have tens of thousands of rows and tens of thousands of columns.
      HTH,
      Dave

  • @JerryWho49
    @JerryWho49 4 года назад

    Isn‘t there some sort of data leakage? You‘re imputing the missing ages using the entire data set. So the training set „knows“ something about the test set. That‘s not good. I think you should split first and then use two pipelines for training and testing. Is there support for pipelines in caret?

  • @fredasefamilia
    @fredasefamilia 4 года назад

    Thank you for this. However I tried Implementing the code as written in IntroToMachineLearning.R and I get an error at line 159. I have tried it several times and the error message i get is
    Loading required package: plyr
    Error in train.default(x, y, weights = w, ...) :
    The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight
    this is all confusing being that all the columns specified are included in the code. Could this be a result of a bug? Please I'll appreciate an prompt answer to this. Thanks

  • @arindambpcsrkm
    @arindambpcsrkm 3 года назад

    @dave, i understood how you imputed the age. however if we have like 200 missing data for embark data, will the same method for imputing age work, ? i mean is not it possible that for some cases both Q and S might have values close to 1 for same row? what to do in that case

  • @farhanadham7237
    @farhanadham7237 3 года назад

    Waw its amazing for turning parameter on xgboost, because we know xgboost always taking too much time for training

  • @sbdavid123
    @sbdavid123 4 года назад

    Great video, I have watched several times at this point to get a better understanding of the caret package. It helped me out a lot. However, I have one question. Why do you split in train and validation sets and then use cross validation on train. I always thought that cross validation was repeated train test split. This way you will avoid evaluating your model on only one split, which by chance might be easy (or super hard) to predict i.e. because the test subset contains more extreme values or the train contains more of the imputed instances, etc.. . By repeating the process of splitting the data in train and test several time and averaging the performance metrics over all these splits, you get a better view of the real performance of the model. So why do you split in train test subset and then use cross validation on train? As I understand it now, it looks like you are reintroducing the problem cross validation is trying to solve. Would it not be better to not in train and test and use k-folds cross validation (which is basically a repeated split in train and test). Thanks!

  • @a.useronly2266
    @a.useronly2266 2 года назад +1

    Great 👍🏻

  • @shorthand1121
    @shorthand1121 5 лет назад

    If you get "subscript out of bounds" in the train() function, change the parallelization engine over to the future engine as it is better at exporting environments:
    library(parallel)
    library(future)
    library(doFuture)
    plan("multisession") #if you're seeing this error, you're likely on a Windows machine anyway
    registerDoFuture()
    And also comment out the makeCluster, registerDoSNOW, and stopCluster lines.

  • @coolhead8686
    @coolhead8686 4 года назад

    Are you hiring? I am the same as you. I spend over 20 years doing system development, programmer analyst, data analyst and data scientist.

  • @atlantaguitar9689
    @atlantaguitar9689 4 года назад

    Great video...Do you feel it is necessary to use dummyvars before doing the imputation ? Isn't it sufficient to do the imputation within the call to the train function as part of the preProcess argument ? That is, is the conversion to one hot encoding outside of the call to train, strictly necessary ?

  • @mahdip.4674
    @mahdip.4674 5 лет назад

    Thanks for the tutorial. Talking about model based imputation, let us say we have 3 numeric variables to impute.
    How the imputation will work if we want to impute the first variable? Does caret will consider complete case approach for the rest of data? If so, how then it will impute the original first variable if it happens that for a record one of second or third variable has missing value?
    What is the procedure here?
    Thanks.

  • @collinsouru3629
    @collinsouru3629 4 года назад

    am reproducing your example but am stack at training the model, it returning an error like Error: The tuning parameter grid should have columns nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample what could i be doing wrong?
    here is the part which is returning error
    caret.cv

  • @bobbird4957
    @bobbird4957 7 лет назад +1

    Dear David, great talk, thank you very much. I have a short question:
    how do I know which factors are included in the "best" model? Thus, which factors are most predictive in separating survivors from non-survivors?
    Thank you in advance!
    Best,
    Bob

    • @bobbird4957
      @bobbird4957 7 лет назад +1

      Thanks Dave!

    • @Datasciencedojo
      @Datasciencedojo  7 лет назад +3

      @Robert Daihatsu - If you mean individual factor levels, then that can be difficult to get from the models. Finding feature importance, however, is far easier. For example, the following code can be added to the end of the Meetup code file to get the feature importance:
      xgb.importance(feature_names = colnames(titanic.train),
      model = caret.cv$finalModel)
      HTH,
      Dave

  • @julianonas
    @julianonas 7 лет назад +2

    Thank you for sharing ! Amazing Video and Instructions.

    • @Datasciencedojo
      @Datasciencedojo  7 лет назад

      @Juliano Nascimento - Glad you liked the video!

  • @rajkamalsrivastav7696
    @rajkamalsrivastav7696 7 лет назад +1

    Hi David, thanks for this session!!
    one question, is it always good to go with imputing using caret(e.g. bagged decision trees for imputing age) or we should do some EDA such as finding a pattern in age using Pclass, sex aggregation and then imputing the age with that value?

    • @Datasciencedojo
      @Datasciencedojo  7 лет назад +1

      @Raj kamal Srivastav - I tend to shy away from terms like "always" and "never" when it comes to data science. The only universal answer I've found is "it depends". :-)
      To answer you specific question, I always strongly suggest doing exploratory data analysis - in fact we spend a good chunk of day 1 in our bootcamp discussing EDA. However, it is often the case that even after EDA you might need a ML model to most accurately impute ages due to the underlying complexities in the patterns in the data.
      HTH,
      Dave

  • @bljangir7450
    @bljangir7450 4 года назад

    Simply excellent , I could not hold my self to comment even if few miniues are still left . You are genious to make things so interesting .

  • @Datasciencedojo
    @Datasciencedojo  7 лет назад

    Meetup Starts at: 2:57

  • @aman_mashetty5185
    @aman_mashetty5185 6 лет назад

    thanks for the amazing video dev, it's going to help in future also, but I don't know the grid search concept what it does? Can you explain to me in simple terms how it works and how it helps in tuning the model???

  • @acada
    @acada 2 года назад

    Excellent presentation, you are a great teacher. Thank you

  • @shaunoconnell9506
    @shaunoconnell9506 Год назад

    very nice, i just used this package for an assignment. this got me enthusiastic to learn more

  • @NikosKatsikanis
    @NikosKatsikanis 7 лет назад

    Hi, I am a js expert wanting to get into DS. What tools do you advise me to learn?

    • @navjotsingh2251
      @navjotsingh2251 6 лет назад

      Quantum Information learn what you need for the job you want. Different jobs require different tools for different tasks. Figure out what you want to do then figure out what tools will get you there.

  • @TIKITAKANEWS
    @TIKITAKANEWS 6 лет назад

    Thanks very much. I got somewhere to start and do it to the end.. Great!!

  • @henrique6748
    @henrique6748 3 года назад

    Thank you so much for sharing this!

  • @yannelfersi3510
    @yannelfersi3510 6 лет назад

    Great video @Dave. Super helpful; I love the step-by-step Q&A.
    Just curious: is it 'good' practice to include the test set when imputing data? shouldn't it be done on the train set only?

  • @antzlck
    @antzlck 5 лет назад

    Brilliant and great advert for your bootcamps!

  • @tamafun4745
    @tamafun4745 2 года назад

    Thank you very much Dave & team. Really enjoy the whole presentation and learn a lot!

    • @Datasciencedojo
      @Datasciencedojo  2 года назад

      Glad you liked it, Tama. Keep following us for more content.

  • @apoorvspydy
    @apoorvspydy 6 лет назад

    Thanks! This was very helpful. Where can I get the rest of the videos on Machine Learning.

    • @Datasciencedojo
      @Datasciencedojo  6 лет назад

      You can watch more of our Machine Learning tutorials here: tutorials.datasciencedojo.com/azure-machine-learning-tutorial-part-1/
      You can also find our other meetups here as well: tutorials.datasciencedojo.com/categories/community-talks/

  • @reubenschneider3921
    @reubenschneider3921 5 лет назад

    Great guide, I was really struggling with a ML assignment and didn't realise what an absolute unit 'caret' is!

  • @CK-vy2qv
    @CK-vy2qv 6 лет назад

    By far the best video out there for ML in R

  • @wereskiryan
    @wereskiryan 3 года назад

    Amazing video!

  • @joseramon4301
    @joseramon4301 4 года назад

    Thank you so much!

  • @drnabinpaudel6984
    @drnabinpaudel6984 4 года назад

    So, how do we implement this model to a new dataset ?

    • @gregorkvas6332
      @gregorkvas6332 3 года назад

      Using the predict() function on the trained caret model object. Best, Gregory

  • @venustat
    @venustat 7 лет назад +2

    Great video

    • @Datasciencedojo
      @Datasciencedojo  7 лет назад

      @Statsvenu Manneni - Glad you liked the video!

  • @hasthigiSrivaradhan1
    @hasthigiSrivaradhan1 6 лет назад

    thank you.

  • @junaideffendi4860
    @junaideffendi4860 7 лет назад +3

    Great video but didnt see use of train.dummy? you worked on train dataset which has the imputed age but not the dummy columns, clear me please.

    • @junaideffendi4860
      @junaideffendi4860 7 лет назад

      I got it that dummy variables were calculated in order to do the imputation. In this case, you did the dummy variable stuff to teach, because there were not any missing values in any other columns other than the age, so in reality we can skip the dummy part, correct me if I am wrong.
      Also, actually I thought that the dummy variables were created for the training part too. A question arises after that
      how the machine learning is performing on the categorical variables? is it converting them into numerical values like one hot encoding automatically or processing directly just like in a simple decision tree?

    • @junaideffendi4860
      @junaideffendi4860 7 лет назад +1

      Thanks :)

    • @Datasciencedojo
      @Datasciencedojo  7 лет назад

      @Junaid Effendi - If I understand your question correctly the following lines of code use train.dummy to generate a new matrix with all missing Age values imputed:
      pre.process

    • @Datasciencedojo
      @Datasciencedojo  7 лет назад

      @Junaud Effendi - If you use caret's imputation feature via the preProcess() function then you need to convert to dummy variables. As I mention in the video, the preProcess() function does not work with factor variables. You would not want to skip this step as you are losing potential features that the bagged decision trees could use to potentially build more accurate imputation models.
      To answer your second question, "it depends". For example, in the case of the mighty Random Forest factors can be used directly so caret will do nothing. However, xgboost does not support factors by default. In this case caret is transforming the factors behind the scenes for you.
      HTH,
      Dave

  • @yishengkim9081
    @yishengkim9081 6 лет назад

    Great video, but waiting ~ 5 mins to be recognized as having a question is troubling (between ~55:00 - 59:00). #WomeninDataScience

    • @Ivansnooze
      @Ivansnooze 6 лет назад

      Troubling? Half the time she does not even put her hand up and furthermore she keeps taking it down. Seems a bit of a stretch to make this a gender issue. I know there is properly a bias against women in this field, but not every situation should be used as a call to arms.

    • @yishengkim9081
      @yishengkim9081 6 лет назад

      @@Ivansnooze Isn't that what men ALWAYS do, minimize the gender concerns of women? I'm a part-time chemistry professor, I don't need my students to keep their hands up for the entire lecture to recognize their questions. And when I see a student put their hand down after having raised it earlier, I'll double back to ask them if they still have a question. That's what good lecturers do.

    • @mushroomdew
      @mushroomdew 4 года назад

      @@yishengkim9081 he did speak to her, her question wasn't missed. Are you sure the causation isn't due to her being at the back of the room?