Hello, I've done this project (thank you for it!), but I have a question: You said that when you do standardization on the numerical features of the training set, the calculated means and st. deviations should be used for the test set. What if I make the data transformation on the full data, then split it randomly 80-20, do the train on 80%, then test on the 20%? I mean this should not matter given the split is random and therefore sample means and st. deviations should be close? Or am I making a mistake here? Many thanks im advance! Roland J
@@GregHogg thank you for the confirmation, i got back a hit rate slighly above 80% so I think it shoudl be fine. Bottom line is to never use different mean and std if we split the data before normalization :)
I tried to submit gender_submission.csv and accuracy on the test data was 0.765, which is even higher than with sub.csv :) But it is a good starting point to increase results.
Take my courses at mlnow.ai/!
Super helpful!
Thanks; glad to hear it!
Hello,
I've done this project (thank you for it!), but I have a question:
You said that when you do standardization on the numerical features of the training set, the calculated means and st. deviations should be used for the test set.
What if I make the data transformation on the full data, then split it randomly 80-20, do the train on 80%, then test on the 20%?
I mean this should not matter given the split is random and therefore sample means and st. deviations should be close? Or am I making a mistake here?
Many thanks im advance!
Roland J
I've always wondered this question. I personally agree with you that this should be fine. Thanks!
@@GregHogg thank you for the confirmation, i got back a hit rate slighly above 80% so I think it shoudl be fine. Bottom line is to never use different mean and std if we split the data before normalization :)
Yes, I believe this is correct!
I tried to submit gender_submission.csv and accuracy on the test data was 0.765, which is even higher than with sub.csv :) But it is a good starting point to increase results.
That's funny, the base solution is better than the logistic regression? Lmao
why not use pandas get dummies function for encoding?
pd.get_dummies(train_pd.Pclass, drop_first=True)
Sure, there's many different ways to do these things - go for it if you prefer :)