Tutorial 11-Exploratory Data Analysis(EDA) of Titanic dataset

Поделиться
HTML-код
  • Опубликовано: 9 фев 2025
  • Here is the detailed explanation of Exploratory Data Analysis of the Titanic. Finally we are applying Logistic Regression for the prediction of the survived column.
    Github url: github.com/kri...
    References from : Jose Portila EDA Materials And Kaggle
    ⭐ Kite is a free AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I've been using Kite for a few months and I love it! www.kite.com/g...
    Stats playlist : • Population vs Sample i...
    You can buy my book where I have provided a detailed explanation of how we can use Machine Learning, Deep Learning in Finance using python
    Packt url : prod.packtpub....
    Amazon url: www.amazon.com...

Комментарии • 290

  • @aakritiroy7336
    @aakritiroy7336 4 года назад +70

    After so much of struggle with my LMS, I was finally able to understand entire EDA in within 30 minutes. Thank you.🙏👍

  • @VVV-wx3ui
    @VVV-wx3ui 5 лет назад +25

    Doing a job that of True Guru, Ekalavyas are all around and raring for such knowledge-impartation. Thanks much Krish.

  • @Esha25ghosh
    @Esha25ghosh 4 года назад +15

    You are awesome sir! Not only are you a great mentor, but also a great motivator. Thanks for all the great work you have been doing. Stay blessed!

    • @chaos8514
      @chaos8514 2 года назад

      I am learning this for data analyst but not sure what more should I learn to get job asap.. if you can help please we can connect on instagram

  • @classicemmaeasy2292
    @classicemmaeasy2292 2 года назад +2

    Me trying to understand data analysis with python couple of days ago now
    U actually make it simpler and beginners friendly, more unction to function sir

  • @souvikdas3905
    @souvikdas3905 5 лет назад +3

    What a beautiful video for a beginner who is just getting his hands on data science.

  • @aayushshukla342
    @aayushshukla342 6 месяцев назад

    Loved the video; in fact, the entire playlist gives an amazing approach to the intricacies of Machine Learning. Thank you, Sir.

  • @sunnychandra5064
    @sunnychandra5064 5 лет назад +6

    You have actually cleared the EDA concept for me, Thanks a lot !!

    • @ShivamChaudhary-jn4kw
      @ShivamChaudhary-jn4kw Год назад

      why 0 and 1 is taken in cols as the indexing of the column is 2 and 5 then why 0 and 1 is taken can you clear

  • @aliakbarrayhan6389
    @aliakbarrayhan6389 5 лет назад +3

    Sir I'm very impressed to see your such amazing video.. Though I am very weak in programming but now I feel like that i should start my programming journey again cause i have someone like u who can explains anything in very simple way

  • @thePrabhuChannel
    @thePrabhuChannel 4 года назад +30

    21:30 Median of the passenger age travelling in each Pclass can be calculated using below code instead of looking at boxplot and guessing the number.
    df[df['Pclass']==1]['Age'].median()
    df[df['Pclass']==2]['Age'].median()
    df[df['Pclass']==3]['Age'].median()

    • @viveksingh881
      @viveksingh881 4 года назад +2

      good one brother i was thinking the same y to guess it when we can actually calculate it,....

    • @tusharmahuri2439
      @tusharmahuri2439 3 года назад

      There is a error comes when I want to use sns.countplot. And the error is "could not interpret input 'survived' "

    • @yashikaarora8573
      @yashikaarora8573 2 года назад +1

      @@tusharmahuri2439 bro copy the heads from the data set and not just type, the language is case sensitive
      it is 'Survived' and not 'survived'

  • @vital4statistix
    @vital4statistix 3 года назад

    Krish, This material is FIRST CLASS. Appreciate it very much.

  • @sudeeprajput21
    @sudeeprajput21 3 года назад +1

    You are amazing brother. Your videos are helping me gain confidence in ML. Keep up the good work

  • @MuhammadAwais-n2b
    @MuhammadAwais-n2b 4 месяца назад +1

    3:37 Add hahahaha Great learning Exp love you brother

  • @imranullah7355
    @imranullah7355 4 года назад +1

    Thanks a lot Sir... You've expailed it in a great way... Love from Pakistan

  • @PiyushSingh-cq2xv
    @PiyushSingh-cq2xv 3 года назад

    This is one of the best data set being used to understand how to fix the nulls. Great Job and thank you .

  • @akanshabhandari1062
    @akanshabhandari1062 4 года назад

    Very helpful..... U did a lot of hard-work for us.... Thnk u so much sir🙌🙌🙏🙏..... And ur way of teaching is very good that is form basic

  • @ManishKumar-gg2vm
    @ManishKumar-gg2vm 5 лет назад +6

    awesome explain ...........I really can't stop myself to comment on this video...……...on of the grt video on data visualization

  • @vinothv8514
    @vinothv8514 5 лет назад +3

    Nice work Mr. Krish...... It's really helpful

  • @VengalraoPachavaedu
    @VengalraoPachavaedu 6 лет назад +3

    I have seen some of your videos, excellent work. I really appreciate your work Mr. Krish Naik.

  • @premkishanmishra1574
    @premkishanmishra1574 Год назад

    loved your video , far better than the uni teachers :P

  • @GauravVerma-jk6cf
    @GauravVerma-jk6cf 3 года назад

    this was really one of the most usefull stuff avialable !!!!!!!!!!!!!!!

  • @theayodejipopshow
    @theayodejipopshow 2 года назад

    This video is amazing. Thanks so much for sharing your wealth of knowledge.

  • @girishmahamuni1830
    @girishmahamuni1830 4 года назад

    Thank you for providing knowledge in a simple way.

  • @rupeshnandanyadav8108
    @rupeshnandanyadav8108 3 года назад

    Awesome tutorial on Exploratory Data Analysis ❤️❤️

  • @aination7302
    @aination7302 4 года назад +9

    Both imputing and dropping missing values (NaN) is not a good practice with real world data. The ideal way is to derive a new field indicating missing values. 1 for missing else 0. because, sometimes missing value can be a new information in itself.
    just sharing some learning from my job :)

    • @okonvictor8711
      @okonvictor8711 2 года назад

      Hi please do you mind sharing how to do that here. Or can I reach you via email?

    • @waqarmehdi4394
      @waqarmehdi4394 2 года назад

      Yes, it depends upon the dataset and problem you want to solve. In this case, dropping the null value is the best possible option in my opinion.

  • @RajatSharma-ct6ie
    @RajatSharma-ct6ie 5 лет назад +1

    Great work sir, learning a lot from your videos, please upload more videos on EDA..

  • @sowjanyadharmavarapu2653
    @sowjanyadharmavarapu2653 3 года назад +11

    sir i really liked your video.. but according to road map video, you asked us to watch python 1-24 lectures first..in this eda concept, you have mentioned some new words like get_dummies, and few other new words.. stuck with the last 10 mins explaination.. else everything is really clear and understandable.. thanks for all the efforts...

    • @dynamictechnocrat
      @dynamictechnocrat 2 года назад

      Get dummy are use in pandas

    • @ashridas9896
      @ashridas9896 2 года назад

      It is basically one - hot encoding..
      Encoding techniques are used to convert categorical data into numerical data
      Since it is applied on 'Embarked' column
      ruclips.net/video/OTPz5plKb40/видео.html

  • @pravinmore434
    @pravinmore434 4 года назад

    Thanks a lot for the very detailed lesson Sir.. that was really fruitful and helped me complete one of my project. Thanks a ton..

  • @tumul1474
    @tumul1474 5 лет назад +1

    this is beyond amazing....amazing place to learn and to revise the impn techniques

  • @garvitjain4106
    @garvitjain4106 3 года назад

    @Krish You are doing an amazing job.

  • @AshishRoy
    @AshishRoy 2 года назад

    Very nicely explained. Awesome

  • @naveenrawat6505
    @naveenrawat6505 3 года назад

    loving the playlist :)))))

  • @MrKmdmustaq
    @MrKmdmustaq 5 лет назад +7

    Can u please make a video on treating the outliers, this will help us a lot in solving the problems

  • @venkatadeviprasadkankanala7387
    @venkatadeviprasadkankanala7387 5 лет назад

    Very nice one thank you very much for sharing valuable information

  • @abhinavmahajan448
    @abhinavmahajan448 4 года назад

    Thanks for the detailed video. Really helpful :)

  • @ifhamaslam9088
    @ifhamaslam9088 4 года назад

    Superb explanations..
    And interesting to learning

  • @naveenrawat6505
    @naveenrawat6505 3 года назад

    great video :)
    i have a suggestion
    we can drop PassengerId to increase the accuracy score because it doesn't contribute to the dependent variable

    • @tusharmahuri2439
      @tusharmahuri2439 3 года назад

      @naveen rawat
      There is a error comes when I want to use sns.countplot. And the error is "could not interpret input 'survived' "

    • @naveenrawat6505
      @naveenrawat6505 3 года назад

      @@tusharmahuri2439show me the line of code

  • @siddhisingh4713
    @siddhisingh4713 Год назад +1

    Everytime, I import data it shows error "file not found"
    import pandas as pd
    data=pd.read_csv('C:\Users\Siddhi Singh\Desktop\Iris.csv')
    print(data)

    • @Kishor_D7
      @Kishor_D7 Год назад

      Actually you should reset the laptop because if any file found in name of panda means error willl be encountered and in the other case you should download and upload in jupyter notebook and in that jupyter notebook you should copy the path...

    • @krishs7244
      @krishs7244 25 дней назад

      U can try using Google collab

  • @lavanyameesa6432
    @lavanyameesa6432 3 года назад

    wonderful explaination

  • @arniloy9358
    @arniloy9358 3 года назад

    there is another null left in embarked column in 831st entry. it still shows in the heatmap, while in the video this doesn't show.(25:07)
    and if I continue this path, do I apply the same method of removing age nulls(defining a class) or should I just replace the average value directly by redefining the index of the null(as it is just a single cell)?

  • @pedrocrespo2681
    @pedrocrespo2681 4 года назад

    Pretty nice explanation !

  • @GreatHimalayanAsmr
    @GreatHimalayanAsmr 4 года назад

    Thankyou sir it is very helpful 😊.

  • @saylisuryawanshi3989
    @saylisuryawanshi3989 4 года назад

    great job sir, please do make more such videos for practising for beginners .

  • @pandian3731
    @pandian3731 5 лет назад

    Another great video very useful one bro like NLP.. 📍

  • @ShubhamJain-in6sz
    @ShubhamJain-in6sz 4 года назад

    Great work sir!!👍🏻👍🏻

  • @unnatiraut9553
    @unnatiraut9553 2 года назад

    Great to understand. thanks alot

  • @warmachinex5330
    @warmachinex5330 3 месяца назад +1

    that notification in the 3:39 part 🤣🤣😂😂

  • @ganeshrao405
    @ganeshrao405 3 года назад

    Really helpful, Thank you soo much.

  • @ds-hy9nc
    @ds-hy9nc 4 года назад +1

    when i try to apply my functinon (23:20)it is showing unexpected EOF while parsing

  • @vinayaksharma6349
    @vinayaksharma6349 4 года назад +8

    sir how you get to know the age age has relation with pclass (how and which analysis you did?)

    • @ashishmeher216
      @ashishmeher216 4 года назад

      @Vinayak sharma you can relate any column with any other column.

    • @SravanKumar-td5im
      @SravanKumar-td5im 3 года назад

      You could do a heat map of all features and get their correlation according to which you can know which feature is dependent on what

  • @sulaimankhan8033
    @sulaimankhan8033 4 года назад

    Krish - Thank you for the EDA,
    Throw some light on Story Telling - If you had to conclude the EDA, Theorotically, In lay man terms - we must do the story telling- Correct me If I am wrong .

  • @gkmadhav
    @gkmadhav 4 года назад +4

    Is there a part 2 and 3 for this video, about feature engineering on the same dataset?

  • @tusharikajoshi8410
    @tusharikajoshi8410 2 года назад +1

    hey @Krish! Should we do this data visualization for each and every column? or we do it after feature selection? if we are supposed t do for each column, wouldn't the code get to big and complex for data with hundreds or thousands of features?

  • @umeshrbaidya
    @umeshrbaidya 4 года назад +4

    Great video Sir, I just have two doubts that why did you not use get_dummies on "Pclass" as it was also categorical data.. and second why did you not normalize the "Fare" and "Age" Columns as their values are might over power the results?

    • @bharathb3946
      @bharathb3946 4 года назад +1

      Same doubt bro

    • @harshmakwana8001
      @harshmakwana8001 4 года назад

      If you type "train.info( )" you will see thae dtypes of all the columns. I don't know if this might help or not but get_dummies( ) can be used for objects only i think as they do not represent any numerical value for the system to compute get_dummies( ) changes indicates those objects into numerical values. Please correct me if i am wrong as i am also confused about this if you agree or have a different insight on this please tell me so.

  • @naveengoud3264
    @naveengoud3264 4 года назад

    Best explanation

  • @mssnal
    @mssnal 3 года назад

    Great one Krish. Basically covers most of the things a beginner needs to understand.

  • @ashishgoyal7020
    @ashishgoyal7020 3 года назад

    Thank you Krish.

  • @sghosh5904
    @sghosh5904 7 дней назад

    very simple and lucid videos. it encourages me to practice along not getting into too many details.. at the beginning..Superb and stay blessed👏
    one ques: at timestamp 27.09, from the code without mentioning dtype=int, the ouptut displays bollean value in integer form. but in my case it shows as 'true' or 'False' unless dtype is specifically mentioned as int. IS this something to do with python updates?

  • @piyush_paul_
    @piyush_paul_ 5 месяцев назад +1

    3:35 the add🫠💀

  • @aaryangoyal5595
    @aaryangoyal5595 3 года назад

    best video

  • @RahulRoy-qy8rk
    @RahulRoy-qy8rk 4 года назад

    This was so helpful. Thank You

  • @Sab_Moh_Maya_Hal
    @Sab_Moh_Maya_Hal 4 года назад

    very knowledgeable,thanks man :)

  • @mustafaraza6107
    @mustafaraza6107 5 месяцев назад +1

    16:15 now we have displot() ---- [without t]

  • @mohamedshathik8045
    @mohamedshathik8045 3 года назад

    Hi krish,
    You didn't drop the passenger ID column before fit the logistic regression model cause it doesn't contain any information.

  • @Parshant17
    @Parshant17 3 года назад

    Are you sure that is average in boxplot near 20th mintue? Because when we talk about percentile then 50%ile should be median.

  • @devanshusharma9386
    @devanshusharma9386 5 лет назад

    very helpful for beginners

  • @shubhamthapa7586
    @shubhamthapa7586 4 года назад +1

    i have a question why is he not using SimpleImputer class from scikit learn
    instead of finding the realtion to make the nan values having some values
    we can easily do it through sklearn module
    and also why isnt he using label encoder for binary values ???

  • @aradhyakanth8409
    @aradhyakanth8409 3 года назад

    Sir, what is the need to visualise the data in this problem. You haven't use any analysis extracted from the visualisation to get help out in data cleaning.

  • @louerleseigneur4532
    @louerleseigneur4532 3 года назад

    Thanks Krish

  • @honey9111
    @honey9111 4 года назад +1

    Thanks a lot Kris. EDA was well explained. I could not understand the last part starting from confusion matrix and how to read the final result of the analysis?

  • @babupatil2416
    @babupatil2416 5 лет назад +1

    Hi Krish,
    Please create some more videos on EDA, it will be helpful.

  • @abdullahkidwai7222
    @abdullahkidwai7222 2 года назад

    I am getting key error after executing the following code:
    sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=40)
    Any suggestion/idea as to what is to be done to stop getting this error?

  • @bharath_v
    @bharath_v 3 года назад

    Good One!

  • @Kk-gi4uw
    @Kk-gi4uw 6 месяцев назад

    I understood till splitting training and output data. But From there Logistical regression application and confusion matrix application is very difficult to understand. I found theoretical explanation of Logistic regression but the code and explanation of syntax and its application videos is not found. Could anyone help with links to understand these two concepts. Thankss

  • @anjalis4016
    @anjalis4016 2 года назад

    Sir, we can only use seaborn for inbuilt datasets available in seaborn? After data cleaning i am unable to use seaborn please help me

  • @nabeelsj3631
    @nabeelsj3631 4 года назад

    Hi Krish,
    Upon analysing the titanic data, could see one missing value is there in Embarked column. Since there is only one value missing, it was hard to find it via visualisation. On cheeking the percentage of null value, i could find it as the below:
    data.isnull().mean() * 100
    PassengerId 0.000000
    Survived 0.000000
    Pclass 0.000000
    Name 0.000000
    Sex 0.000000
    Age 19.865320
    SibSp 0.000000
    Parch 0.000000
    Ticket 0.000000
    Fare 0.000000
    Cabin 77.104377
    Embarked 0.224467
    dtype: float64
    Could you please confirm whether this can be ignored?

  • @jagadeeshabburi570
    @jagadeeshabburi570 3 года назад

    kind of fantastic video bro, but it needs 2-3x watch for crystal clear understanding.

  • @buzzfeedRED
    @buzzfeedRED Год назад

    @Krish : Arrange your Complete ML playlist videos into a roadmap playlist, from start to end : to data scientist

  • @121horaa
    @121horaa 4 года назад +1

    Sir, I didn't get why you compensated the missing value of age with the average age of Pclass?
    Can't we simply replace the NaN values with the median values of the age column as: train['Age']=train['Age'].fillna(train['Age'].median())

    • @glenn8781
      @glenn8781 3 года назад +2

      In practical reality, every person has an age value but that data is missing for some people in the titanic dataset. Our goal is not just to fill in any random age where the age is missing but to fill in an educated guess/ estimate of the missing age of a person so that it can be a close representative of the true ages of those people. Of course, like you mentioned, the median of the entire age column could be used as an estimate but would that be a good representative value for ALL missing ages? Some people would have ages far above or below the median age. So on further exploration we notice that the median age for each Passenger Class is different, which would mean that in reality, people from a certain p-class would more likely be of a certain age, than someone who belongs to another p-class. And this difference is considerable (37 vs 29 vs 24). So by using using p-class to estimate age, we're just making a more educated guess for missing age values. You could of course go several steps further and consider other factors (like maybe SibSp, Parch etc.) in order to get a higher probability age value.

  • @asfandyarsaeed6402
    @asfandyarsaeed6402 2 года назад

    hi Krish do I need to do shipro wilk test to check the normality as its not normal if you apply this test on age column

  • @yashaskumargb3827
    @yashaskumargb3827 2 года назад

    Sir play list is best
    But please share the link from which u downloaded dataset fir every vedio
    So that we can do what u explained in vegio

  • @diprajkadlag
    @diprajkadlag 3 года назад

    one note, in boxplot the middle line inside the box is median value, not the mean value

  • @gunjanmishra6673
    @gunjanmishra6673 3 года назад

    Hi.. can you suggest some other data set that can be used for implementing all these functionalities.

  • @aryanrana5658
    @aryanrana5658 3 года назад

    My doubt is
    When u are apply that 'Age' and 'PClass' apply function ,but in that what is the use of axis=1. Could u plz explain that.

  • @joelbraganza3819
    @joelbraganza3819 4 года назад

    Why do we need to get dummy variables for binary class variables like Sex and Embark, and why didn't we treat the variable pclass with One-hot-encoding, is it because we are treating it as ordinal, but wouldn't it cause problems with linear-regression and DNN algorithms to apply over it? Let me know Sir. Thanks.

  • @dipeshlimaje8998
    @dipeshlimaje8998 2 года назад

    sir im confuse coz we are predicting survival so it is 0 and 1 which means means its a categorical data and we r solving with regression

  • @MrArvindSaha
    @MrArvindSaha 4 года назад

    At IN[26]- box plot results, straight line(2nd or 50th percent quartile) inside rect box, you are saying mean value, is it mean or median?

  • @bhavanshah1368
    @bhavanshah1368 3 года назад

    @Krish Naik : Hi Krish, could you please explain why Age assigned cols[0] and Pclass cols[1],??I have not understood this

  • @aasthasingh67
    @aasthasingh67 3 года назад +1

    How do you know for one kind of result, which plot to use exactly?

  • @balajiabhi9039
    @balajiabhi9039 3 года назад

    @Krish Naik what is that test size =0.30 why did u use that .from beginiinng of video everything was very good but in the end i couldn't understand x train ytrain test size whats that accuracy 0.7190 etc. please tell me sir else your efforts will go waste ...

  • @matinpathan5186
    @matinpathan5186 4 года назад +2

    Input contains NaN, infinity or a value too large for dtype('float64'). after logmodel.ft(x_train,Y_train).... any solutions

    • @shalinishashi3521
      @shalinishashi3521 3 года назад

      Actually here male,q and s column contain 884 null values....so here if we remove these columns from train then can remove this error either we can use some statical concept mean mode to remove this...u can try this..hope u will be able to find your answer

  • @yashkhilavdiya5693
    @yashkhilavdiya5693 2 года назад

    Thank You So Much

  • @hrcnszn
    @hrcnszn 2 года назад

    totally unrelated to the topic but how does your taskbar look like that

  • @fancy4926
    @fancy4926 4 года назад

    In some cases, I use label encoding etc to change a character column into numbers. When using dtypes, it says that column is int32 (or int 64 or float), I think it actually should be categorical and then I can use it for ML. Is that right that I should use astype('category') to convert the format and then I can use ML?

  • @vamshikrishna5333
    @vamshikrishna5333 3 года назад

    Hi Can anyone help me with difference beyween EDA of this Titanic Dataset and EDA of Housing Price Prediction. Both follow a Different Steps. Iam quite Confused. Will Really appreciate any help.

  • @kasturidas4081
    @kasturidas4081 3 года назад

    Where are the previous and next videos of this video?
    I couldn't find
    Someone help me please

  • @biswajitsahoo1542
    @biswajitsahoo1542 4 года назад

    Fantastic

  • @KimJennie-fl3sg
    @KimJennie-fl3sg 4 года назад +5

    20:20 hey, uhmm.. 50% percentile gives us MEDIAN of the age of people with 1st class... So we are using MEDIAN value instead of MEAN right?
    Very helpful video for me to understand EDA

    • @sharathkumar8422
      @sharathkumar8422 4 года назад

      You're right, 50%ile is the median. I think you should check out the definition of median and percentiles on this page - www.statisticshowto.com/probability-and-statistics/percentiles-rank-range/#:~:text=The%2050th%20percentile%20is%20generally,quartiles%20is%20the%20interquartile%20range.
      That should clear your doubt.

  • @adeniyi5875
    @adeniyi5875 Год назад

    I like the video, but how did you know exactly the graphical representation to use, i mean why countplot why not jointplot? Why line plot not boxplot?
    I hope you really understand my questions sir

  • @rhitwijmukherjee7589
    @rhitwijmukherjee7589 4 года назад

    LogisticRegression() nothing showing inside parentheses in output I tried your code sir but still it's not showing output inside parentheses. What is problem..?

  • @milindankur
    @milindankur Год назад

    Dont we do hypothesis testing on the dependency with respect to each of the features? I see we have taken all the features simply based on visual cues, is it a normal thing to do in data science? I thought you data science guys perform rigorous feature engineering based on multiple t-tests/chi-2 tests/annova/correl etc... to statistically establish the dependencies.
    Please correct me if this is incorrect assumption from the outside.