Machine Learning Tutorial Python - 9 Decision Tree

Поделиться
HTML-код
  • Опубликовано: 22 июл 2024
  • Decision tree algorithm is used to solve classification problem in machine learning domain. In this tutorial we will solve employee salary prediction problem using decision tree. First we will go over some theory and then do coding practice. In the end I've a very interesting exercise for you to solve.
    #MachineLearning #PythonMachineLearning #MachineLearningTutorial #Python #PythonTutorial #PythonTraining #MachineLearningCource #DecisionTree #sklearntutorials #scikitlearntutorials
    Code: github.com/codebasics/py/blob...
    csv file for exercise: github.com/codebasics/py/blob...
    Exercise solution: github.com/codebasics/py/blob...
    Topics that are covered in this Video:
    0:00 - How to solve classification problem using decision tree algorithm?
    0:26 - Theory (Explain rationale behind decision tree using a use case of predicting salary based on department, degree and company that a person is working for)
    2:10 - How do you select ordering of features? High vs low information gain and entropy
    3:52 - Gini impurity
    4:28 - Coding (start)
    9:11 - Create sklearn model using DecisionTreeClassifier
    13:32 - Exercise (Find out survival rate of titanic ship passengers using decision tree)
    Do you want to learn technology from me? Check codebasics.io/?... for my affordable video courses.
    Next Video:
    Machine Learning Tutorial Python - 10 Support Vector Machine (SVM): • Machine Learning Tutor...
    Populor Playlist:
    Data Science Full Course: • Data Science Full Cour...
    Data Science Project: • Machine Learning & Dat...
    Machine learning tutorials: • Machine Learning Tutor...
    Pandas: • Python Pandas Tutorial...
    matplotlib: • Matplotlib Tutorial 1 ...
    Python: • Why Should You Learn P...
    Jupyter Notebook: • What is Jupyter Notebo...
    Tools and Libraries:
    Scikit learn tutorials
    Sklearn tutorials
    Machine learning with scikit learn tutorials
    Machine learning with sklearn tutorials
    To download csv and code for all tutorials: go to github.com/codebasics/py, click on a green button to clone or download the entire repository and then go to relevant folder to get access to that specific file.
    🌎 My Website For Video Courses: codebasics.io/?...
    Need help building software or data analytics and AI solutions? My company www.atliq.com/ can help. Click on the Contact button on that website.
    #️⃣ Social Media #️⃣
    🔗 Discord: / discord
    📸 Dhaval's Personal Instagram: / dhavalsays
    📸 Codebasics Instagram: / codebasicshub
    🔊 Facebook: / codebasicshub
    📱 Twitter: / codebasicshub
    📝 Linkedin (Personal): / dhavalsays
    📝 Linkedin (Codebasics): / codebasics
    🔗 Patreon: www.patreon.com/codebasics?fa...

Комментарии • 1 тыс.

  • @codebasics
    @codebasics  2 года назад +13

    Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced

    • @honeymilongton8401
      @honeymilongton8401 2 года назад

      it is better for us if you please provide that slides sir can you please send slides also sir

    • @adiflorense1477
      @adiflorense1477 Год назад

      Cool

    • @kisholoymukherjee
      @kisholoymukherjee Год назад

      Hi Dhaval sir, please note I tried to register in Python course. But the link is not working on the site

    • @Swormy097
      @Swormy097 10 месяцев назад

      @codebasics
      Hello Sir, Regarding the encoding approach (label encoding) used in the video, I read on the sklearn documentation that it should be used only on the target variable (output "y") and not the input feature ("x"). The documentation stated that for input feature one should use either onehotencoder, ordinalencoder, or dummy variable encoding.
      Also, I was expecting that you use onehotencoder(OHE) since the input features (company, job and degree) are nominal and not ordinal variables. Is it best practice to use OHE for nominal variables or it just doesn't matter?
      Please could you clarify for me???
      Thank you.

  • @Koome777
    @Koome777 7 месяцев назад +8

    My model got a score of 98.6%. I dropped all the Age Na values which reduced the sample size from 812 to 714. I label-encoded the Sex column and then used a test size of 0.2 with the remainder of 0.8 as the training size. I am all smiles. Thanks @codebasics

  • @codebasics
    @codebasics  4 года назад +15

    Step by step roadmap to learn data science in 6 months: ruclips.net/video/H4YcqULY1-Q/видео.html
    Exercise solution: github.com/codebasics/py/blob/master/ML/9_decision_tree/Exercise/9_decision_tree_exercise.ipynb
    Complete machine learning tutorial playlist: ruclips.net/video/gmvvaobm7eQ/видео.html
    5 FREE data science projects for your resume with code: ruclips.net/video/957fQCm5aDo/видео.html

    • @bestineouya5716
      @bestineouya5716 4 года назад +2

      97.97% accurate

    • @rahulpatidar9905
      @rahulpatidar9905 4 года назад

      @@bestineouya5716 i also got the same accuracy

    • @praveenkamble89
      @praveenkamble89 4 года назад

      Great Explanation Sir, Thanks a lot for your efforts and help. I got 97.76% accuracy. I did not map male and female to 1, 2 instead used as it is. Is it necessary to do that ? is there any significance of it?

    • @harris7294
      @harris7294 3 года назад

      Exercise results ::::: Accuracy : 0.8229665071770335
      Actually I your csv file as training and for test data used test.csv provided on Kaggle
      >> which increase my training data(which would have been less if I had split my data)
      >> Increased Accuracy(As we have more data to train)
      >> Reduce chances of overfitting if i had used same data for both training and testing...
      Thank you.. for great video

    • @anonym9158
      @anonym9158 3 года назад

      0.98

  • @kartikeyamishra4641
    @kartikeyamishra4641 5 лет назад +114

    This is by far the most straight forward and amazing video on decision trees I have come across! Keep making more videos Sir! I am totally hooked to your channel :) :)

    • @codebasics
      @codebasics  5 лет назад +8

      Thanks kartikeya for your valuable feedback. 👍

  • @proplayerzone5122
    @proplayerzone5122 2 года назад +24

    Hi sir, I am a 10th grade student and I am learning ML and in the exercise My model got 81% accuracy😀 sir. Will Make many models while learning and share with you. Thanks for the tutorials sir.

    • @codebasics
      @codebasics  2 года назад +55

      It is ok to learn ML but make sure you find time for outdoor activities, sports and some fun things. The childhood will never come back and do not waste it in search of some shiny career. If you are so much concerned, I would advice focusing on math and statistics at this stage and worry about ML later.

    • @proplayerzone5122
      @proplayerzone5122 2 года назад +2

      @@codebasics ok sir. Thanks for guidance!

    • @kalaipradeep2753
      @kalaipradeep2753 9 месяцев назад

      Hi bro now what doing....

    • @kalaipradeep2753
      @kalaipradeep2753 9 месяцев назад

      How to fill empty value on age feature

    • @toxiclegacy5948
      @toxiclegacy5948 5 месяцев назад +5

      @@codebasicsAbsolutely correct, it’s great to learn new things. But learning all these is not your right age. Make more and more memories in childhood. I am 23 and trust me life is very painful…

  • @ansh6848
    @ansh6848 2 года назад +5

    Actually this man has made learning Machine Learning easy for everyone whereas if you will see other channels they show big mathematical equations and formulas..which makes beginners uncomfortable in learning ML.
    But thanks to this channel.♥️🥰

    • @bhawnaverma5532
      @bhawnaverma5532 2 года назад

      very True. Complex concept explained in very understanding way. Hats off really

  • @oatilemothuloe9178
    @oatilemothuloe9178 2 года назад

    Played with the test_size a bit and I managed to push out a score of 87% max.Appriciate the tutorial lot!

  • @nikhilrana668
    @nikhilrana668 3 года назад +17

    For those wondering what 'information gain' is, it is just the measure of decrease of entropy after the dataset is split.

  • @WestCoastBrothers_
    @WestCoastBrothers_ 3 года назад +7

    Incredible video! Thank you for sharing your knowledge. Scored a 83.15%. I changed the hyperparameter "criterion" to entropy instead of gini and was consistently performing better. Looking forward to seeing how changing other hyperparameters effects accuracy.

    • @codebasics
      @codebasics  3 года назад +3

      That’s the way to go niko, good job working on that exercise

  • @WorldsTuber13
    @WorldsTuber13 4 года назад +4

    Your videos are absolutely awesome.... Those who wants a career transition in DS basically they use to spend more then 3k us dollars to do their certification and what they ultimately get is a diploma or a degree certification on Data Science not what exactly happening in data science, but when a scholar like you train us we come to know what's happening in it.

    • @codebasics
      @codebasics  4 года назад +2

      K Prabhu, thanks for your kind words of appreciation.

  • @dataguy7013
    @dataguy7013 Год назад

    Best description of Information gain, your explanation is really the only resource that explains the intuition well

  • @nnasirhussain
    @nnasirhussain Год назад +1

    Excellent Tutorial. In Exercise I used three different method to fill Age
    1- Backward, Forward, Median of Age
    2- Median of Female_Survive to fill Female_Survive_Age and Median of Male_Survive to fill Male_Survive_Age and same for Not survive.
    3- Interpolate Method.
    Using train_test_split of 0.3 test size. I get max of 82% accuracy and I also change gini to entropy for each approach

  • @larrybuluma2458
    @larrybuluma2458 3 года назад +4

    Thanks for this tutorial mate, it is the best straight forward DTC tutorial.
    Using entropy i got an 81% accuracy and,
    using gini i have a 78% accuracy

    • @codebasics
      @codebasics  3 года назад +3

      That’s the way to go Larry, good job working on that exercise

  • @sujankatwal9255
    @sujankatwal9255 4 года назад +8

    Thank you so much for the tutorial. Im doing all the exercise.I got an accuracy of 81% on titanic dataset

    • @codebasics
      @codebasics  4 года назад

      Sujan that a decent score. Good job 👍👏

  • @franky0226
    @franky0226 4 года назад +8

    Got an accuracy of 78.92
    Thanks for the Lovely tutorial !

  • @aayushichaudhari9357
    @aayushichaudhari9357 Месяц назад

    Hello sir, I received an accuracy of 97.97% for the given exercise. Thank you for the wonderful tutorials, all of them are very helpful and I am performing all exercises that you give at the end of the video.

  • @stephenngumbikiilu3988
    @stephenngumbikiilu3988 2 года назад +4

    Thank for these awesome videos. I have been learning a lot through your ML tutorials.
    I replaced the missing values in the 'Age' column with the median. My test set was 20% and my accuracy on test data was 99.44%.

    • @AnanyaRay-ct8nx
      @AnanyaRay-ct8nx Год назад +1

      how? can u share the solution?

    • @vikassengupta8427
      @vikassengupta8427 3 месяца назад

      There is high chance that the model is overfitted, it is not generalized

    • @vikassengupta8427
      @vikassengupta8427 3 месяца назад

      Nd chances are that ur model has already seen your test data, better rerun from the first cell once and check...

  • @rahulkambadur147
    @rahulkambadur147 5 лет назад +4

    Do you have any thing related to sentiment analysis/Text mining/Text analysis? please have a tutorial for the text analytics as the other videos are so good
    I also request you to create chats for AUC and also create a model evaluation according to CRISP DM model

  • @DataScienceHarrison
    @DataScienceHarrison 5 месяцев назад +1

    Thanks for the video. My model got an accuracy of 83.5%. Glad to be this far with the data science roadmap. Continue with good work sir.

  • @minsaralokunarangoda4251
    @minsaralokunarangoda4251 Месяц назад +1

    Thanks for the awesome tutorial....
    Dropped all na values in Age column which reduced the sample size from 812 to 714 and ran the model couple times, the best accuracy I got was 83.21%

  • @g.scholtes
    @g.scholtes 2 года назад +17

    In In (8) you use the "le_company" LaberEncoder object 3 times and never use the 'le_job" and 'le_degree' objects. It still works, so my guess would be that you'll only need one LabelEncoder object to do the job.

    • @rajubhatt2
      @rajubhatt2 Год назад +1

      label encoder basically converts the categorical to numerical, since job and degree are categorical you still need them to be LabelEncoded. and he used them see carefully using fit_transform().

    • @omdusane8685
      @omdusane8685 Год назад

      @@rajubhatt2 he encoded them using company object Only though

    • @AkhileshKumar-mg9vs
      @AkhileshKumar-mg9vs Год назад

      well here it worked as Sir used fit_transform but if he had splitted the data into test and train sets , then he would have used transform on remaining test set and for that different instances would be required for each coloumn.

    • @PAWANKELA-rh7yj
      @PAWANKELA-rh7yj Месяц назад

      when i use only one object then my first 2 rows are drop from dataset ,why??

  • @kuldeepsharma7924
    @kuldeepsharma7924 4 года назад +5

    Got an accuracy of 97.20%
    Dropped all rows whose values were missing.
    Thank you, Dhaval sir..

    • @codebasics
      @codebasics  4 года назад +4

      Kuldeep, that is indeed a nice score. good job buddy.

    • @elvenkim
      @elvenkim 2 года назад

      Mine is 98.459%. Likewise I removed all missing data for Age.

    • @ShubhamSharma-qb1bw
      @ShubhamSharma-qb1bw 2 года назад

      @@elvenkim why you are removing the missing value whether it is possible to fill with whether mean or median it depends upon the outlier present in the column age

  • @Pacificatorrr
    @Pacificatorrr 8 дней назад

    Hi! Thank you for this playlist

  •  2 года назад +1

    help me explain this : I use different methods to encode string from SX columns ( 1 : LabelEncoder, 2 get_dummies , 3 map ) then I fillna with mean() method and also test_size the same for 3 above encoding methods BUT I got different accuracy . Tell me why??

  • @pablu_7
    @pablu_7 4 года назад +4

    I got 98.4 % in titanic data set . Thank you Sir , you are the best.

    • @codebasics
      @codebasics  4 года назад +1

      Oh wow, good job arnab 👍😊

    • @jayrathod2172
      @jayrathod2172 3 года назад +5

      I don't want to hurt your fillings but 98.4% is only possible if you are checking model score on train data instead of test data.

    • @blaze9558
      @blaze9558 6 месяцев назад

      true@@jayrathod2172

    • @vikassengupta8427
      @vikassengupta8427 3 месяца назад

      ​@@jayrathod2172yes I was about to say that, and also possible if you have change the random state multiple times and your model has seen all your data, and is now overfitted

  • @alexplastow9496
    @alexplastow9496 3 года назад +3

    Thanks for helping me get my homework done, by God it was a mistake to wait till the last day

    • @yeru2480
      @yeru2480 3 года назад

      oh i couldn't agree more

  • @dwisetyoaji5007
    @dwisetyoaji5007 3 года назад +1

    I got an warning and found a dead end in Label encoding input['Fare']
    "TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'method']"
    trying solved to change to int or string but failed,anyone can help me?

  • @anitama5377
    @anitama5377 5 лет назад +2

    Very nice video! Can you use anything as target or does it need to be binary? so is it also possible to use the salary for example?

  • @noorameera26
    @noorameera26 3 года назад +3

    Will never get tired to say thank you at every video I watched but honestly, you're the best! :) Keep posting great videos

    • @codebasics
      @codebasics  3 года назад +4

      I am happy this was helpful to you.

  • @abhishekkhare6175
    @abhishekkhare6175 3 года назад +4

    got 97.4% accuracy filled the empty blocks in age with mean.
    thanks a lot for perfect tutorial

    • @nitinmalusare6763
      @nitinmalusare6763 2 года назад

      How to calculate accuracy for the above dataset mentioned in the video

    • @muskanagrawal9428
      @muskanagrawal9428 4 месяца назад

      thanks it helped me increase my accuracy

  • @suleyman4166
    @suleyman4166 3 года назад

    Is there a way to get the mapping after using LabelEncoder? I used LabelEncoder in a larger, different dataset from this one. After I fit the model, I want to try predictions but I don't know which string belongs to which code. Help please.. Thanks.

  • @ybezginova
    @ybezginova Год назад

    Hey! Where can I get a dataset for the exercise? I cannot find it anywhere in the description

  • @krijanprajapati6816
    @krijanprajapati6816 4 года назад +3

    Thank you so much sir, I really appreciate your tutorial, I learnt a lot

    • @codebasics
      @codebasics  4 года назад +2

      Krijancool, thanks for the comment. By the way your name is really cool 😎

  • @KallolMedhi
    @KallolMedhi 5 лет назад +5

    can anyone tell me why didn't we use OneHotEncoding in this example????
    does it mean that we need dummy variable only in Regression algorithms???

    • @daisydiary1895
      @daisydiary1895 5 лет назад

      I also got the same question. I appreciate if somebody help.

    • @daisydiary1895
      @daisydiary1895 5 лет назад

      Maybe here is the answer: "Still there are algorithms like decision trees and random forests that can work with categorical variables just fine". datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor

    • @swatiamonkar9827
      @swatiamonkar9827 4 года назад

      use pandas.get_dummies

    • @amalsunil4722
      @amalsunil4722 4 года назад

      Using One hot encoding worsens the accuracy of trees...therefore it's recommended to use label encoding

  • @aryanmadhavverma
    @aryanmadhavverma 3 года назад

    I filled the null values in Age column to Age.mean(). I also label encoded the 'Sex' column followed by a train_test_split and normalization using Standard Scaler. When I checked my accuracy score, it came out to be just 77%. What did I do wrong?

  • @ShivamYadav-sb3vc
    @ShivamYadav-sb3vc 3 года назад

    in the inputs do we have to take all the columns except the target or we can take any number of columns

  • @ganeshyugesh9559
    @ganeshyugesh9559 2 года назад +4

    i have only started to learn about data science using python and i have a question: Why use labelencoder rather than getting dummy variables for the categorical variables? Is it more efficient using labelencoder?

    • @yourskoolboy
      @yourskoolboy 11 месяцев назад

      I prefer the .get_dummies()

  • @ajaykumaars2154
    @ajaykumaars2154 4 года назад +7

    Hi Sir, Thanks for the great video.
    I've a question, why didn't we use one hot encoding here for our categorical variables?

    • @codebasics
      @codebasics  4 года назад +3

      We can but for decision tree it doesn't make much difference that's why I didn't use it

    • @ajaykumaars2154
      @ajaykumaars2154 4 года назад

      @@codebasics Ohh, OK Sir. Thank you

    • @whatever_5913
      @whatever_5913 3 года назад +1

      @@codebasics But then doesn't the model give a higher priority(value) to Facebook than to google on the basis of the number assigned in Label Encoding ...just confused here.

  • @snehasneha9290
    @snehasneha9290 4 года назад

    sir in the PClass columns the values are present like this 1st,2nd,3rd...... so how to change these values into an integer when I am using labelEncoder() I got an error

  • @RA-pi1lg
    @RA-pi1lg 4 года назад

    Great Video! thank you, I have a question I am working on Logistic Regression problem and trying to feature engineer some of text columns Do you think LableEncoder can serve my purspose?
    Thanks Again!!

  • @naveenkalhan95
    @naveenkalhan95 4 года назад +12

    really appreciate your work. learning a lot... just want to confirm something from the tutorial @7:40 you are using fit_transform with le_company object for all the other columns and did not use le_job object and le_degree object. is it ok? or should we do it? Thank you very much again.

    • @sadiqabbas5239
      @sadiqabbas5239 3 года назад +1

      That's just the variable name you can use that way too..

  • @usmanasad3146
    @usmanasad3146 5 лет назад +3

    As usual, all your videos are awesome to watch. Thanks for the same :)

  • @mithunjain4834
    @mithunjain4834 3 года назад +2

    can you please say why fit_transform () is used with labelencoder?

  • @danielnderitu5886
    @danielnderitu5886 2 года назад

    Learning quite a lot through these wonderful tutorials, thanks Codebasic.

  • @udaysai2647
    @udaysai2647 5 лет назад +6

    Great Tutorials keep going but I have a doubt why haven't you used onehotencoder for company here as it is nominal variable? and please make a tutorial on what exactly these parameters are and on random forests

    • @swatiamonkar9827
      @swatiamonkar9827 4 года назад +1

      true, one hot encoding is better than labelEncoder as assigning categories would results in errors in prediction if that feature is chosen, because higher category is considered better over the others. so in this case if google =0 and Fb =1 , then FB>Google.

    • @aravindabilash151
      @aravindabilash151 3 года назад

      @@swatiamonkar9827 Thank you for the clarification, actually i was trying it with OneHotEncoder and resulted in mis-prediction.

  • @valapoluprudhviraj9778
    @valapoluprudhviraj9778 4 года назад +18

    Hurray! Sir i got an accuracy of 97.38% by using interpolate method for Age column.😍✨

    • @HipHop-cz6os
      @HipHop-cz6os 3 года назад +5

      Did u use train_test_split method

    • @codebasics
      @codebasics  3 года назад +3

      Good job Prudhvi, that’s a pretty good score. Thanks for working on the exercise

    • @jixa2109
      @jixa2109 Год назад +2

      It was easy.. i got 98.3%

  • @helloWorldPlus
    @helloWorldPlus 2 года назад

    Hello! What can be done if we have an overfitting situation? (trainning_set accuracy 0.88 vs testing_set accuracy of 0.4)

  • @Egitam-ow7ih
    @Egitam-ow7ih 3 месяца назад +1

    I question I have is arent we supposed to do OneHotEncoding since the variables are not ordinal or is it that decision trees takes care of it since it doesnt considers the magnitude of features but rather the values of feature to determine the rules

  • @moeintorabi2205
    @moeintorabi2205 4 года назад +8

    There are some NaN values in the Age column. I filled them through padding. Also, I spit my data for testing and at the end I got the accuracy of 0.8.

    • @piyushtale0001
      @piyushtale0001 2 года назад

      Use fillna with median and accuracy will be 0.9777 by normal method

    • @tejassrivastava6971
      @tejassrivastava6971 2 года назад

      @@piyushtale0001 i have used median() for Pclass, Age and Fare but got score = 78 around. How to improve?

  • @irmscher9
    @irmscher9 5 лет назад +8

    *for x in features.columns:*
    *features[x] = le.fit_transform(features[x])*

    • @prabur3296
      @prabur3296 4 года назад

      How to write the predicted values into a csv file
      For eg: model.predict(test_data), I want the output array in a csv file submission.csv

  • @bhumitbedse8156
    @bhumitbedse8156 2 года назад +2

    Hello sir at 7:50 LabelEncoder is used for all the columns like compony,job and degree but when we fit_transform then why only le_compony is used ?
    For job and degree we have to write le_job.fit_transform() and le_degree.fit_transform() ?
    Am I right please answer 😶

  • @maruthiprasad8184
    @maruthiprasad8184 2 года назад

    Got accuracy as 76.22 %. Tried by tweaking train data & test data but no significant difference. Thank you very much for simple & clear explanation.

    • @kalaipradeep2753
      @kalaipradeep2753 9 месяцев назад

      How to fill empty value on age feature

  • @mohammedalshen3147
    @mohammedalshen3147 4 года назад +4

    Thank you so much for making it very simple. As an ML learner, will do we need to understand the code behind each of these sklearn functions ?

    • @codebasics
      @codebasics  4 года назад +4

      Not necessary. If you know the math and internal details then it can help if you want to write you own customised ML algorithm but otherwise no.

    • @areejbasudan4732
      @areejbasudan4732 Год назад

      @@codebasics can you recommend videos for understanding the math behind it, thanks

  • @ss57hd
    @ss57hd 5 лет назад +5

    Your VIdeos are always Awesome!
    Can u suggest me some websites where I can find Questions like those in ur Excercises and all?

    • @codebasics
      @codebasics  5 лет назад +6

      Hey, honestly I am not aware of any good resource for this. Kaggle.com is there but it is for competition and little more advanced level. Try googling it. Sorry.

  • @haziq7885
    @haziq7885 2 года назад

    hi wouldnt label encoder means you're assigning some sort of ordering to the values ?

  • @zainhana2968
    @zainhana2968 2 года назад

    i start to learn about machine learning and your video help me so much to make understanding

  • @gaganbansal386
    @gaganbansal386 3 года назад +6

    Why we have not created dummy variables here as we have done in Logistic Regression using OneHotEncoder

    • @mohitb5230
      @mohitb5230 3 года назад

      In one hot encoding turorial you mentioned its better cos then we dont have encoding which has relation to each other. Please clarify. These videos are teaching me a lot.

    • @anshulagarwal6682
      @anshulagarwal6682 2 года назад

      Yes same doubt. Have you cleared your doubt? If yes, then please tell.

    • @anshulagarwal6682
      @anshulagarwal6682 2 года назад

      I think company should be given one hot encoding while job and degree should be label encoded.

  • @tejobhiru1092
    @tejobhiru1092 3 года назад +3

    thank you for such amazing, well detailed and easy to understand tutorial(s) !
    im following your channel exclusively for learning ML, along with kaggle competitions.
    also recommending your channel to my peers.
    great work..!
    PS - i got 75.8% as the score of my model in for the exercise.
    any tips to improve the score?

    • @shreyansengupta2594
      @shreyansengupta2594 2 года назад

      take test_size=0.5 it increases to 78.15%

    • @pranav9339
      @pranav9339 Год назад

      re execute the test train split function as it generates rows randomly. Then Again fit the model and execute. Continue this for 4-5 time until u get somewhere around 95% accuracy. So this set of data is the most accurate for training the model.

  • @rdwmuzic
    @rdwmuzic 3 года назад

    Hello thank you for creating this content. I am specifically looking for some thing that will help me do this task: XL has a lot of raw data of x,y,z. Z=x-2y. Can i query the dataset if i know z but i need to find out which Combinations of X and why will yield that particular Z.

  • @rishabh9410
    @rishabh9410 5 лет назад

    Can we convert these lables with get dummies , not with label encoder. Plase explain if there is some difference

  • @anujack7023
    @anujack7023 3 года назад +7

    I got 74.4% accuracy. it is good to do everything by my own....

    • @codebasics
      @codebasics  3 года назад +1

      That’s the way to go anujack, good job working on that exercise

  • @kirankumarb2190
    @kirankumarb2190 3 года назад +4

    Why didn't we use dummy column concept here like we did for linear regression?

    • @naveedarif6285
      @naveedarif6285 3 года назад +1

      As in trees we have many levels so here dummy variables concept doesnt work well so we try to avoid it

    • @snehagupta-xz1fs
      @snehagupta-xz1fs 3 года назад

      @@naveedarif6285 how can we train and split dataset in this? Please help

  • @leoadi3833
    @leoadi3833 3 года назад

    Sir i want to ask. Is it possible to get multiple target values. E.g if i have more than 1 target values on my input. This tree is giving only 1 .

  • @blaze9558
    @blaze9558 6 месяцев назад

    thanks a lot sir i just learnt so many things without getting bored(usually we don't get to do hands on for these topic), this was super helpful

  • @musicsense2799
    @musicsense2799 3 года назад +8

    Amazing Video! But I have some doubts please help me here:
    1. We made three Label encoder instances here. Cant we use just one to encode all three?
    2. We Use label encoding and not OneHoteEncoding, however, the latter made more sense as our model might assume that our variables have some order/ precedence
    It would be great if you clarify my doubts. Thanks!

    • @paulkornreich9806
      @paulkornreich9806 2 года назад +4

      It is necessary to understand the underlying logic of the algorithm. In regression, the algorithm tries to fit to a line, curve (or higher dimensional object in SVM), so, what the relative value (order, or where it is on the axis) is matters. In decision tree, the algorithm is just asking Yes/No questions, such as Is the company Facebook?, Does the employee have only a bachelors degree?, etc, so the order is not significant. Therefore, a the Label encoder is valid for decision tree.
      While it could have been possible to lump the label encoders into one, say by using a power of 10 to distinguish them, it would have given too much weight to the highest power of 10 (the algorithm understands numbers, so it is going to ask >/< /= questions), but the whole point of using decision tree was for *the algorithm* to find the precedence of features that will give the quickest prediction. Therefore it is better to have more features (i.e. more Label encoders).
      Then, if more features is better, one could re-ask the question of why not one-hot encoding, that would give even more encoders. Now, the issue is the tradeoff of accuracy vs conciseness. Here, there were only 3 companies, but there could be a case where a problem was examining over 100 companies. Having a one-hot encoder for all the companies would get quite cumbersome.

  • @eliashossain9849
    @eliashossain9849 4 года назад +6

    Exercise result for the titanic dataset: Score: 0.77 (using Decision Tree Classifier)

    • @cyberversary262
      @cyberversary262 3 года назад

      DUDE CAN U PLS SHARE ME THE CODE.... IM GETTING ACCURACY 1.0

    • @prakashdolby2031
      @prakashdolby2031 3 года назад

      @@cyberversary262 you are giving entire dataset to get trained ,
      Better try with test_size != 1 (use 0.3-0.2 ) to get better results

    • @cyberversary262
      @cyberversary262 3 года назад +1

      @@prakashdolby2031 dude I have asked this question 3 months ago 😂😂😂

  • @krupagajjar5410
    @krupagajjar5410 Год назад

    What is the datatype of target variable? I executed the query model.fit(inputs_n, target) and it throwed below error : ValueError: Unknown label type: 'unknown' . Pls help

  • @prateeksinha08
    @prateeksinha08 11 месяцев назад +1

    ValueError: could not broadcast input array from shape (2,712) into shape (1,712)
    I'm getting this error whenever I'm tryint to fit the (xtrain,ytrain) in the model
    can anyone please resolve it??

  • @patelshivam1965
    @patelshivam1965 5 лет назад +4

    Please can any one tell me how to increase our model's accuracy? i.e. Score

    • @codebasics
      @codebasics  5 лет назад +2

      Increasing score is an art as well as science. If your question is specific to only decision tree then try fine tunning model parameters such as criterian, tree depth etc. You can also try some feature engineering and see if it helps.

    • @samitpatra8615
      @samitpatra8615 4 года назад +2

      I tried with increasing training data and score is increased.

  • @ritamsadhu2873
    @ritamsadhu2873 Год назад +3

    Score is 97.75% for exercise dataset. Filled the null values in Age column with median value

    • @RohithS-ig4hl
      @RohithS-ig4hl Год назад

      I did the same thing, but i still get accuracy around 79%. Any suggestions?

    • @istiakahmed3033
      @istiakahmed3033 8 месяцев назад

      @@RohithS-ig4hlHey, I got 80% percent accuracy. I got also low accuracy like your.

  • @nihalchidambaram3395
    @nihalchidambaram3395 2 года назад

    Hello Sir,
    Great tutorial. My model's accuracy for the titanic dataset came out to be 82%. Thank you.

  • @sadbinshakil1955
    @sadbinshakil1955 3 года назад

    I truly appreciate your effort, sir. please make more videos. Take love from Bangladesh

  • @learnerlearner4090
    @learnerlearner4090 4 года назад +3

    Thanks so much for these tutorials! These are the best tutorials I've found so far. The code shared by you for examples and exercises are very helpful.
    I got score 76% for the exercise. How is it possible to get a different score for the same model and the same data? The steps followed are the same too.

    • @codebasics
      @codebasics  4 года назад +4

      In train_test_split it will generate different samples Everytime so even when you run your code multiple times it will give different score. Specify random_state in train_test_spkit method, let's say 10, after that when you run your code you get same score. This is because now your train and test samples are same between different runs.

    • @learnerlearner4090
      @learnerlearner4090 4 года назад +1

      @@codebasics Got it. Thanks!

    • @anujvyas9493
      @anujvyas9493 4 года назад +1

      Same, I too got an accuracy of 76% but was aware about the random_state attribute! :)

  • @yashchavan1350
    @yashchavan1350 3 года назад +3

    Sir, In the Exercise you perform map on sex column and I did it using LabelEncoder. I liked when you give us a difference approach to perform a same task .and one more question Sir, instead of mean why cant we use mode on age column ........btw My score is 79%

  • @yazdanmovahedi6418
    @yazdanmovahedi6418 Год назад

    Can't download the exercise dataset from GitHub. Getting this error: "ParserError: Error tokenizing data. C error: Expected 1 fields in line 28, saw 384"

  • @muntazirali2195
    @muntazirali2195 11 месяцев назад

    WHy is this error shows up whenever I try to predict by my model? "D:\Software\INSTALLED\Annaconda\Lib\site-packages\sklearn\base.py:464: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
    warnings.warn("

  • @mihirsheth9918
    @mihirsheth9918 3 года назад +1

    Sir i have a doubt regariding method .score() from sklearn.model_selection.DecisionTreeClassifier and accuracy_score() from sklearn.metrics.
    you have computed the performance of the model on the basis of .score().What if we compute on basis of accuracy_score()??Are they identically the same??
    What if for a certain classifier accuracy is not the best parameter to measure the performance?i.e the best parameter might be precision or recall or something else

  • @seiv-
    @seiv- 2 года назад

    is it possible to make a prediction given inputs never seen in the dataframe ? I mean, what if we give as input a sample of a male, age 44, fare 30, pclass 2 which is not in the current dataset ? what will the model predict ? and how is this done ?

  • @dhrumil5811
    @dhrumil5811 2 года назад

    Dhaval sir ..just to know.. in this model every time when we want to predict for any combination, we have to find the assigned code for each category and it becomes difficult to identify. Can you give clarification on this ! I mean how can we make it easy .. thanks

  • @aslammahmood5850
    @aslammahmood5850 4 года назад +1

    I am unable to download the CSV file from github
    please provide a CSV file link

  • @krishnasahu2935
    @krishnasahu2935 4 года назад +1

    where can I find machine learning questions for python ?

  • @BilalKhan-uc4xm
    @BilalKhan-uc4xm Год назад

    how u can connect the explanation with scatter plot to this data set. Because then we can use decision tree here

  • @amanyadav411
    @amanyadav411 4 года назад +1

    My model accuracy is 79.32
    Thanks for the nice data science series🙏

  • @lakshsinghania
    @lakshsinghania 2 месяца назад

    sir, why OneHotEncoding is not done for the first example as all there are categorical values ??

  • @ximul61
    @ximul61 4 года назад

    array([0.75])
    is the prediction of model.predict([[3,0,24,7.225]]) this data is it ok ?? i am using the titanic.csv as my dataset.

  • @abhisheksharma1031
    @abhisheksharma1031 2 года назад

    Such a nice explanation , now I dont need to watch any further videos. This video was very satisfactory and convincing !!

  • @dwisetyoaji5007
    @dwisetyoaji5007 3 года назад

    after trying i give up and open the exercise and i try run exactly in my own jupyter and got and error
    "float() argument must be a string or a number, not 'method'"
    I think its because dataset is update,anyone can help me?

  • @sonal9792
    @sonal9792 2 месяца назад

    Hello,
    At timestamp 8:09, should the inputs['jobs_n'] not be le_job.fit_transform(inputs['job']) and similarly for inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])
    The video used le_company object for all the three encodings. Is it not necessary to create three different objects? If yes, why were these objects not used?
    please explain.

  • @MunnaSingh-dx3or
    @MunnaSingh-dx3or 4 года назад +1

    Simple explanation thank you! The excercise you have given got score of 98.18%... And it's predicting pretty well 👍 Thank you once again

    • @niyazahmad9133
      @niyazahmad9133 4 года назад

      Best_params_ plz

    • @user-fz9ni1ff6x
      @user-fz9ni1ff6x 4 года назад

      This is unbelieveable. I saw someone used Random forecast, SVM, Gradient Boosting etc. The best score on testing data is 84%. With simple Decsion Tree, best score would be around 82%, i think.

  • @aakashp7808
    @aakashp7808 4 года назад +1

    I didnt get the label encoder part could u explain that in comment ?

  • @AnilAnvesh
    @AnilAnvesh 2 года назад +1

    Thanks for this video. I have used train and test csv files of titanic. Cleaned both datasets and implemented Decision Tree Classifier and got a test score of 0.74 ❤️

    • @codebasics
      @codebasics  2 года назад

      That’s the way to go anil, good job working on that exercise

  • @mingzeli1770
    @mingzeli1770 3 года назад

    should we drop the na rows in exercise? since the ages are not correlated to each other, and, in my opinion, fillna with the mean value may affect the accuracy of the final model.

  • @MultiBoringlife
    @MultiBoringlife 3 года назад

    Hi sir, in the titanic dataset i did not use train test method.. i used entire dataset as training data and am getting 0.9845 model score using decision tree algorithm. Is this correct way of doing prediction ? My final count of rows came down to 714 after removing NaN values from original 891 rows.

  • @kabirnarayanjha
    @kabirnarayanjha 5 лет назад +2

    Wohooooo once again new video thank you so much sir

  • @regithabaiju
    @regithabaiju 3 года назад +1

    Thanks for sharing this awesome video. I have learned more about ML using this.

  • @aaditstudent
    @aaditstudent Год назад

    Will I get better accuracy if I used one hot encoder instead of label encoder?

  • @aadarsh14
    @aadarsh14 5 месяцев назад

    Thank you very much for this course! Super helpful. I was able to get an accuracy of 83.24%

  • @ShubhamKumar-sn9ri
    @ShubhamKumar-sn9ri 2 года назад

    can we use get dummies method insted of label encoding?

  • @sdc5574
    @sdc5574 4 года назад

    How to download the csv file?? Or how to read csv file url

  • @ghzich017
    @ghzich017 2 года назад +1

    On 7:28, why do you have to created multiples LabelEncoder() classifier?