Titanic Survival Prediction in Python - Machine Learning Project

Поделиться
HTML-код
  • Опубликовано: 8 фев 2025

Комментарии • 115

  • @timvielhauer1231
    @timvielhauer1231 Год назад +70

    The latest pandas version is not ignoring string values in the .corr function anymore. just add "numeric_only=True" and it will work again

    • @ronie-i1q
      @ronie-i1q Год назад +4

      thank you so much! i was looking how to resolve this issue

    • @hk6926
      @hk6926 9 месяцев назад

      People who are dump like me , here what it means :) sns.heatmap(train_data.corr(numeric_only='True'), cmap='YlGnBu')

    • @crux_X_shh
      @crux_X_shh 8 месяцев назад +2

      Thank you so much bro I was trying to solve this for 2 days continuously and nothing worked..🥹

    • @moody_moony123
      @moody_moony123 7 месяцев назад +2

      thank you life saver!

    • @white-ts5np
      @white-ts5np Месяц назад

      import seaborn as sns
      sns.heatmap(titanic_data.corr(numeric_only=True), cmap="YlGnBu")
      plt.show()

  • @muratsahin1978
    @muratsahin1978 2 года назад +16

    I was pretty confused when I saw %100 accuracy lol, thanks for the explaining.

    • @MohammedAhmed-y9r
      @MohammedAhmed-y9r 6 месяцев назад +1

      I knew it was cheating right away especially that the data contains the specific names of the people in the titanic

  • @saya5664
    @saya5664 2 года назад +13

    Great tutorial video! helped me to understand how pipeline in ML works, hope there will be more Kaggle competition walkthrough like this from you soon! :)

  • @paralogyX
    @paralogyX 3 года назад +14

    Good video, but: 1) What was a purpose of test set? You didn't use for your model estimation and you used cross-validation. 2) You shouldn't fit StandardScaler on Kaggle Test Set, but only transform on the same scaler you used for training data, because if features distributed a bit different, then scaling will be different and your model will get different numbers for exactly similar passenger. Would be nice if you pay attention to these details, because they are really important. But generally, video is nice and useful.

    • @jaysoncastillo2593
      @jaysoncastillo2593 Год назад

      Got the same comment. Test set shouldn’t be fitted anymore but only transformed.

    • @jaysoncastillo2593
      @jaysoncastillo2593 Год назад

      Do you know any yt channel solving the titanic dataset for reference?

    • @JunaidAnsari-my2cx
      @JunaidAnsari-my2cx 5 месяцев назад

      @@jaysoncastillo2593 Did u find anything?

  • @benjamindeporte3806
    @benjamindeporte3806 2 года назад +1

    Nice "real life" example of the scikit pipeline. Helped me a lot, thanks.

  • @jaym0ney_
    @jaym0ney_ 3 года назад +6

    This is a great video, I’ve been trying to find a good place that would show the code behind creating a basic ML pipeline, or show some beginner feature engineering and whatnot, but I haven’t found anything as straightforward as this. A lot of other people have a lot of fluff in their tutorials, but you just show it straight up, which I really appreciate. Do you have any recommendations for textbooks/articles for a beginner wanting to get into Machine Learning? I have a strong math/programming background, so that’s not an issue, I just need something that will comprehensively explain all the main components of making an ML project. Thanks in advance and keep up the good work!

  • @ishansharma5787
    @ishansharma5787 3 часа назад

    Thanks for the video. Could you deep dive into the current solutions and make a video on how to iterate over the current solution to get to a even better prediction accuracy?

  • @shashvatsinghal2574
    @shashvatsinghal2574 2 года назад +1

    This is the best video i have ever watch on datascience and ml till date

  • @cryptigo
    @cryptigo 3 года назад

    This is actually such a good idea. A lot of python program / resume ideas are boring. Thanks!

  • @jomp6141
    @jomp6141 9 месяцев назад

    Man your video was awesome. Easy to follow and replicate, plus you explain the key insights for those of us who have only a little knowledge of data analysis. Thanks a lot!

  • @statistikochspss-hjalpen8335
    @statistikochspss-hjalpen8335 Год назад +4

    11:45 You can't use Pearson correlation coefficient for nominal/ordinal data.
    12:49 you need to create dummy variables for each class.

    • @unfff
      @unfff Год назад

      Hey, I see he addresses the Pearson correlation coeffecient issue later on where he uses One Hot Encoding to turn the data from ordinal to discrete. Is there a better way to visualize correlation even when you use this method? Or would doing the one hot encoding first and then doing the correlation heat map be best practise?

    • @statistikochspss-hjalpen8335
      @statistikochspss-hjalpen8335 Год назад +2

      @@unfff doing one hot encoding and choosing the right correlation coefficient are two separate things. One hot encoding has nothing to do with correlation analysis. One hot encoding is just a transformation of a variable that can be used for multiple purposes.

    • @Summer-of8zk
      @Summer-of8zk Год назад +1

      to fix the fact corr() doesnt work with words, then you can do "df.corr(numeric_only=True)". where df is your data, and that will give the corr for your data but you do lose the non integer data coiumns.

    • @statistikochspss-hjalpen8335
      @statistikochspss-hjalpen8335 Год назад

      @@Summer-of8zkYou are talking about a technical solution. What do you mean by if it doesn't work? Every statistical software will produce a correlation coefficient as long as your columns have some digits in it. I'm talking about what's theoretically (in)correct.

  • @RivinduBRO
    @RivinduBRO 6 месяцев назад

    thankyou very much for this tutorial cuz i was like mentally down as i got 0.75 accuracy at my first try and also there were many people with 1.0 accuracy. so i was thinking why i can't. but now i understood the thing. thankyou soo much for this lesson.

  • @aflahalabri6331
    @aflahalabri6331 Год назад

    I don't think there was a need for creating the AgeImputer class at least in the latest versions, probably using the SimpleImpute class directly is sufficient. But it's good learning tip on how to create a custom class.

  • @valentinmagis6743
    @valentinmagis6743 Год назад +2

    Why are you scaling the variables when using a tree-based model? Scaling is done to Normalize data so that priority is not given to a particular feature. Scaling is mostly important in algorithms that are distance based and require Euclidean Distance. Random Forest is a tree-based model and hence does not require feature scaling.

  • @jeremyheng8573
    @jeremyheng8573 2 года назад

    Thank you for great tutorial! Do you have more Kaggle competition walkthrough?

  • @vivekthumu8992
    @vivekthumu8992 Год назад

    Thank u so much for providing this video helped me to understand a lot

  • @soorajsridhar3279
    @soorajsridhar3279 Год назад +6

    I followed the code as said in the video and came across an error when we fit_transform with the strat_test_set. The error was that the 'Embarked' column was missing. I think it is because we drop it in featuredropper function, but in the pipeline as we process it all over again , I guess we get this error. Can you help me fix it asap???

  • @fizipcfx
    @fizipcfx 3 года назад +3

    This is strange but, if you add the name length as a column it helps. The name length has 0.332350 correlation with the Survived column :)

    • @paralogyX
      @paralogyX 3 года назад +4

      Correlation is not causation. Very good example!

  • @armantech5926
    @armantech5926 Год назад

    Great Video, thank you!

  • @wasgeht2409
    @wasgeht2409 2 года назад +1

    Thank you... I have one question, why u pick this models ? On which KPI based you choice your models for any kinds of problems. That will be a very interesting for me

  • @yogeshwarkethepalli4234
    @yogeshwarkethepalli4234 Год назад

    sparse matrix length is ambiguous; use getnnz() or shape[0]
    showing error message as shown above.(How to slove this)
    column_names = ["C", "S", "Q", "N"]
    ---> 13 for i in range(len(matrix.T)):
    14 X[column_names[i]] = matrix.T[i]

    • @wbdhh317
      @wbdhh317 Год назад

      me too how to solve

  • @supremenp
    @supremenp Год назад +7

    sns.heatmap(titanic_data.corr(), cmap="YlGnBu")
    plt.show()
    This gives error: could not convert string to float: 'Braund, Mr. Owen Harris'
    shouldn't the titanic_data.corr() drop the string columns automatically?

    • @heisgiovann
      @heisgiovann Год назад

      How did you solve this error?

    • @unfff
      @unfff Год назад +16

      Do sns.heatmap(titanic_data.corr(numeric_only=True),cmap="YlGnBu") instead of sns.heatmap(titanic_data.corr(),cmap="YlGnBu") in 11:50
      as I assume it defaulted to True when this video was made and was later made not to. This is because that correlation function can't figure out the correlation between anything not quantitative so you have to tell the function to only look at numerical features.

    • @TheShakour
      @TheShakour Год назад

      @@unfff tnx bro... it helped

    • @sushre10
      @sushre10 10 месяцев назад

      yes this same error exist to me also

    • @mahis7232
      @mahis7232 10 месяцев назад

      @@unffftysm 🥰

  • @Animax590
    @Animax590 Год назад

    I just used logistic regression and got 0.7655 taking only gender & Pclass. Thanks for your clarification about 100% accuracy though.

  • @novagamings4505
    @novagamings4505 Год назад

    I am new in the field of data science in terms of experience. I have completed paid skill course from IBM though. In my first attempt of this project which is my first project i got an accuracy of 78%. Is it good enough and should i move on to next project or try to refine my model for better accuracy. Please suggest someone with experience

  • @Warclimb64
    @Warclimb64 8 месяцев назад

    had a problem here 42:05
    I solved only selecting numeric:
    X_test_numeric = X_test.select_dtypes(include=[np.number])

    • @SaurabhSah-x7w
      @SaurabhSah-x7w 6 месяцев назад

      bro how did you solved the problem which is in timeline 32:00
      🙄

    • @SaurabhSah-x7w
      @SaurabhSah-x7w 6 месяцев назад

      can you help me with you code that you solved

    • @Warclimb64
      @Warclimb64 6 месяцев назад

      @@SaurabhSah-x7w Yeah sure, i dont remember right now, but i will check my code tomorrow and write you back

  • @pravachanpatra4012
    @pravachanpatra4012 3 года назад +2

    Can you make a tutorial on an AI that plays a game using the NEAT module in python and pygame???

  • @mertmunuklu7732
    @mertmunuklu7732 Год назад

    Thanks, it is a great tutorial

  • @paulbuono5088
    @paulbuono5088 Год назад +1

    Interesting where at 15:10 you said you don't want to look too much at your training set so you don't get biased. It seems everyone else I hear says to examine it as much as possible....is there something I'm misinterpreting from you or them?

    • @alimemon9942
      @alimemon9942 11 месяцев назад

      He said testing dataset not the training dataset.

  • @marcosamuel17
    @marcosamuel17 26 дней назад

    I'm stuck in the following code:
    X_final_test = final_data
    X_final_test = X_final_test.fillna(method="ffill")
    scaler = StandardScaler()
    X_data_final_test = scaler.fit_transform(X_final_test)
    the message error: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
    What should i do guys?

  • @ChristianA.Bradna
    @ChristianA.Bradna 6 месяцев назад

    I am confused as to when I should use fit_transform and when I should use transform only. Previously, I understood that when you sing the former, you are calibrating, so to speak, to the estimator to a particular set of data, so that if you wanted to use that estimator subsequently and have it perform in the exact same way you should not refit it, but you should only use it with its transform method. In this video however you used fit transform every time and still got it to perform the same in every data set. Could you tell me a little bit about how that works?

  • @shanondalmeida7235
    @shanondalmeida7235 Год назад +3

    Correlation doesn't work for string values hw u did it ? 🤔

  • @philjoseph3252
    @philjoseph3252 10 месяцев назад

    Is there a difference between hit encoding in pandas and sklearn? The process is so much easier with pandas, is there a particular reason why he used sklearn?

  • @tgmbrett
    @tgmbrett 2 года назад +2

    at 32:00, how is he calling stat_train_set in the pipeline.fit_transform function when the variable doesnt exist yet?

    • @90cijdixke
      @90cijdixke Год назад

      Did u find the answer?😬

    • @sayuri_20
      @sayuri_20 8 месяцев назад

      @@90cijdixke Did you find yet ?

  • @emmaoye2704
    @emmaoye2704 Год назад +4

    Am i the only one Stuck at 32:31. i keep getting this error: AttributeError: 'FeatureEncoder' object has no attribute 'transform'

    • @aidaosmonova4798
      @aidaosmonova4798 Год назад

      could you solve this?

    • @lemanosmanli2006
      @lemanosmanli2006 8 месяцев назад

      @@aidaosmonova4798 hi could you solve it?

    • @jeeaspirant7890
      @jeeaspirant7890 8 месяцев назад

      Please tell how to fix this

    • @jamesrosicky2912
      @jamesrosicky2912 21 день назад

      The error indicates an issue with the FeatureEncoder class, specifically the transform method. The problem arises because the variable column_names is used before being defined in the second part of the transform method.
      try to execute this this code instead
      from sklearn.preprocessing import OneHotEncoder
      class FeatureEncoder(BaseEstimator, TransformerMixin):
      def fit(self, X, y=None):
      return self
      def transform(self, X):
      # Encode "Embarked"
      encoder = OneHotEncoder(handle_unknown="ignore")
      embarked_matrix = encoder.fit_transform(X[["Embarked"]]).toarray()
      embarked_column_names = ["C", "S", "Q", "N"]
      for i in range(len(embarked_matrix.T)):
      X[embarked_column_names[i]] = embarked_matrix.T[i]
      # Encode "Sex"
      sex_matrix = encoder.fit_transform(X[["Sex"]]).toarray()
      sex_column_names = ["Female", "Male"]
      for i in range(len(sex_matrix.T)):
      X[sex_column_names[i]] = sex_matrix.T[i]
      return X

  • @yashtysingh1171
    @yashtysingh1171 Год назад +2

    Sir my updated sklearn version doesn't have fit_transform.. Please guide what should I do!

  • @TheErick211_
    @TheErick211_ 10 месяцев назад

    Is there a video in which you have a deep explanation of how to understand 'Class' __init__ and everything related to this methods?

    • @rizwan_sayyad
      @rizwan_sayyad 5 месяцев назад

      Yes u search for OOP in python

  • @MohammedAhmed-y9r
    @MohammedAhmed-y9r 6 месяцев назад

    Why did you fit your pipeline on the test.csv data

  • @lemanosmanli2006
    @lemanosmanli2006 8 месяцев назад +1

    Hello thanks for your this video , but strat_train_set = pipeline.fit(strat_train_Set) give attribute error that DataFrame object has no attribute "toarray"

  • @谷歌账户-d2d
    @谷歌账户-d2d 9 месяцев назад

    Thank you for you teach video, it is very good for noob

  • @abhinavchoudhary6849
    @abhinavchoudhary6849 3 года назад

    Awesome bro

  • @angelamaharjan2054
    @angelamaharjan2054 2 месяца назад

    Does anyone know how to do MSE error for this dataset?

  • @jsemslava7880
    @jsemslava7880 Год назад

    A little bit fast(especially typing xD), but good tutorial; I got 79,42%, thanks!

  • @AzureCz
    @AzureCz 3 года назад

    I'm curious, how do I know the accuracy percentage inside the notebook, comparing the prediction with the dataset that we have, and not just uploading to kaggle.

  • @TheErick211_
    @TheErick211_ 10 месяцев назад

    Can we download your jupyter notebook from somewher?

  • @TheNewfacto
    @TheNewfacto Год назад

    I just submitted mine today and I got a score of 0.78229 but then I saw all those 1s and I was like "just how did they do that"😂

  • @juanmariomorenochaparro127
    @juanmariomorenochaparro127 Год назад

    Thanks, very interesntin video, new susbcribe.

  • @Vikraman99
    @Vikraman99 5 месяцев назад

    The Embarked column in the test set has no N value and I am not able to use your pipeline code because of it. Is there a way to overcome this?

    • @Vikraman99
      @Vikraman99 5 месяцев назад

      Ok got it, I didn't write error="ignore' in Feature Dropper section.

  • @cristhianriverajurado7497
    @cristhianriverajurado7497 2 года назад +1

    I got this error ValueError: Input contains NaN after this line strat_train_set = pipeline.fit_transform(strat_train_set),I was following your tutorial.

    • @yashp5341
      @yashp5341 2 года назад

      I got the same error, did you perhaps get the answer?

    • @francoramirezcastillo8075
      @francoramirezcastillo8075 2 года назад

      @@yashp5341 I solved it, but I don't know if you get the same error, it kept emphasizing this: X[column_names[i]] = matrix.T(i), and it should look like this: X[column_names[i]] = matrix .T[i], I had to change the parentheses for this [ ], I hope it helps

  • @komalrehman7173
    @komalrehman7173 9 месяцев назад

    i am having strat data error after that everywhere its an error anyone can explain why

  • @anotherone8256
    @anotherone8256 3 года назад

    Nice video.

  • @dragosdalta4317
    @dragosdalta4317 Год назад

    Cn't import BaseEstimator, anyone can help?

  • @ParthivShah
    @ParthivShah 8 месяцев назад

    nice

  • @kianestrera-hr5vt
    @kianestrera-hr5vt 8 месяцев назад

    I see they probably cheating I lost confidence when I say some 100% while I only got 0.76 which I think is not bad

  • @whilstblower901
    @whilstblower901 Год назад

    Give the notebook

  • @mtk-0_0
    @mtk-0_0 Год назад

    decent vid

  • @pogus3229
    @pogus3229 3 года назад +4

    lol

    • @HypnosisBear
      @HypnosisBear 3 года назад +1

      Even I laughed at the title.

  • @HypnosisBear
    @HypnosisBear 3 года назад +2

    Lol

  • @aleks.na.vse.100
    @aleks.na.vse.100 3 года назад

    Very interesting. But please translate your video in Russian

    • @quasii7
      @quasii7 3 года назад +1

      No offence, but the generally accepted language of computer science is English. It would be hard to translate everything, and I am saying this as a non native speaker.

    • @aleks.na.vse.100
      @aleks.na.vse.100 3 года назад +2

      @@quasii7 а, ну ладно

    • @paralogyX
      @paralogyX 3 года назад

      I am also Russian, but all computer science literature etc is mostly in English, so better to get used to it.

  • @marie-louiseleroux828
    @marie-louiseleroux828 3 года назад +13

    I'm actually tired of worrying about stocks. it's driving me nuts these days,I think crypto investment is far better than stock made over $39k in a week..

    • @abubakar_Abson
      @abubakar_Abson 3 года назад

      oops that's a huge lost.

    • @charlesthomas2735
      @charlesthomas2735 3 года назад

      That's a good idea,but how do I get an experienced trader? I don't know anyone sorry to bother you mate do you have any that I could work with?

    • @greysonyhk2826
      @greysonyhk2826 3 года назад

      He'll help you recover your money. But must take caution, On the broker you invest with.

    • @jonassturluson5273
      @jonassturluson5273 3 года назад

      he is the best Broker, I have tried lots of professionals but got exceptional income trading with Dave Javens he is the best strategy now earning over $18,300 every 10 days...

    • @thomassterne599
      @thomassterne599 3 года назад

      To me it is, been working with him for a year and four months. And I have been getting my profits seems legit to me️