Data Analyst Portfolio Project | Correlation in Python | Project 4/4

Поделиться
HTML-код
  • Опубликовано: 21 ноя 2024

Комментарии • 506

  • @danielbristow6954
    @danielbristow6954 3 года назад +391

    Update: Alex, I just accepted my first job as a junior data analyst. This completes my 6-month journey to learn data analytics and change careers, and I could not have done it without your excellent Portfolio videos. Thank you so much for making these available to your viewers for free. After I built my portfolio, companies started taking a second look at my resume and inviting me to interviews. BEFORE the portfolio, I received ONLY rejection emails. Thank you, thank you, thank you!

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад +14

      Congratulations!!

    • @Datalover-Analysts
      @Datalover-Analysts 3 года назад +5

      I am getting only rejection

    • @danielbristow6954
      @danielbristow6954 3 года назад +26

      @@Datalover-Analysts Hi Pooja, I’m sorry. I know rejection can be discouraging. I received over 100 rejection emails from job applications before I finally started getting interviews. Without knowing the details of your situation, I can only encourage you to keep trying and don’t give up. Everyone’s journey to data is different, but I don’t think it is ever easy, especially if you’re trying to change careers, which is what I was doing. I wish you the best.

    • @Datalover-Analysts
      @Datalover-Analysts 3 года назад +2

      @@danielbristow6954 Sure, I am making portfolio with the help of Alex videos. Did some certification from coursera and Azure Fundamentals too

    • @thehash8
      @thehash8 3 года назад

      Hey Daniel, congratulations.!!! Can you please also mention what certifications you did? With the help of Alex’s platform i am building my portfolio.Also, completed my degree in MSCS this month.

  • @neella97
    @neella97 2 года назад +75

    If anyone else is having issues due to IntCastingNanError, I advise to try the following:
    df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(int)
    df['gross'] = pd.to_numeric(df['gross'], errors='coerce').fillna(0).astype(int)
    it worked! :) Thank you Alex for your amazing videos!

    • @SearchingforScraps
      @SearchingforScraps Год назад +3

      Thank you !!! I almost gave up as i am not too versed in python to make these changes as the data for the original set he worked on has changed.

    • @esrakareem7071
      @esrakareem7071 Год назад +1

      thank you ❤

    • @SquooHipPa
      @SquooHipPa Год назад +3

      Thank you so much for this, I was trying to google it before realizing it might be in the comments. If you have the time can you explain this part of the code? errors='coerce').fillna(0).astype(int)
      I did look it up but was getting a little confused by it. Thank you again :)

    • @IsmailAbdikadir-lk6lf
      @IsmailAbdikadir-lk6lf Год назад +1

      thank this was a lot of help. i used this to avoid the int32 and it worked.
      df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(np.int64)

    • @victorbegnini5754
      @victorbegnini5754 Год назад +1

      Top notch! Thanks

  • @izzyinsync
    @izzyinsync 2 года назад +398

    Hello! The dataset appears to be updated on Kaggle and for anyone new, you will run into some issues that you need to fix to follow along.
    1. Missing data. There are missing values opposed to this video so you will need to fix that. There are many ways to handle missing values but for the sake of time, I decided to drop all rows that have missing data. You will have about 71% of your data remaining. You will need to run the following if your dataframe is named df.
    df = df.dropna()
    2. Extracting the year is different as the formatting is different. Running the following should extract the correct year.
    df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(int)
    3. Duplicates, there aren't any in this dataset so you should be fine on that.
    I hope this helps anyone that is working on this and best of luck on your analytics journey!

    • @Leo28039
      @Leo28039 2 года назад +7

      Sir you are a hero

    • @ishaangupta6915
      @ishaangupta6915 2 года назад

      thank you, i have just made ammends after reading ur comments, and other comments and got the solution to check and verify what was the status before and after execution, thanks

    • @_danfiz
      @_danfiz 2 года назад +2

      Thank you so much!

    • @aliceemma135
      @aliceemma135 2 года назад +1

      Thanks, I was stuck at extracting the correct year and now can finally solve it!

    • @RaNdUmBjAkE
      @RaNdUmBjAkE 2 года назад

      Thank you. I could not figurethe year out.

  • @woahnelly3286
    @woahnelly3286 2 года назад +61

    Hey all, just a "stats" heads up/correction you might want to make for your portfolio:
    In this video, Alex wanted to see if the company was "correlated" with gross revenue. What he did was assign values (randomly, I think) to companies, countries, etc. Then he tried to see if those values were related to the gross revenue.
    Those randomly assigned values are "measuring" the company, country, etc at the Nominal scale. In other words, they're essentially just being used as a numeric "name"-the values themselves don't mean anything. What that means is that one value being higher than the other doesn't represent an increase in the thing being measured (for example, the USA was assigned a 54 and the UK was assigned a 53. Those are just names... the USA isn't one more of something than the UK).
    Because the values themselves don't represent anything, it doesn't make sense to do a correlation with them.
    Correlations tell us, as one variable increases, what happens to the other? So in the first question, as the budget increased, what happened to revenue? It increased. But with country, company, or other categorical variables, correlations don't make sense. The values for country and company are random, so the numbers that represent them going up doesn't tell us anything. It's no wonder then, that the correlations weren't large.
    Instead, it would make sense to do a t-test or ANOVA and compare means. In that case, the question would be, "Do some companies tend to produce higher revenue than others?" Or, "Do some countries tend to produce higher revenue?" etc. (For more discussion, see: www.webpages.uidaho.edu/~stevel/519/Correlation%20between%20a%20nominal%20(IV)%20and%20a%20continuous%20(DV)%20variable.html).
    Since this is a portfolio project and you want to show potential employers the result, maybe just take that part out-you wouldn't want to make a mistake like that in an application to a potential employer!
    (Alex, thanks so much for doing these videos! They're super helpful and I'm very very grateful!)

    • @vishalmalar8406
      @vishalmalar8406 Год назад +5

      Thank you so much for that clarification. I was so much confused and spent a lot of time wondering how random values made sense in determining a correlation.

    • @InsightIntoLife
      @InsightIntoLife Год назад +2

      where do we make the corrections?

    • @somtoobi-ym8ic
      @somtoobi-ym8ic Год назад +1

      I noticed that in my dataset, avatar has a gross revenue of -2,147,483,648, and it just feels wrong. Is there something I am not doing right?

    • @somtoobi-ym8ic
      @somtoobi-ym8ic Год назад +1

      I just noticed that converting to int type gave me this error

  • @darkavenger100
    @darkavenger100 3 года назад +66

    I can't wait for the beginner, intermediate, and advanced Python series by Alex the Analyst. It's what the people want, besides a happy Alex.

  • @gastonsuarez5320
    @gastonsuarez5320 Год назад +1

    I really appreciate the fact that you did not edit out the parts were you made "mistakes" and actually fixed them.

  • @rebekhathangam7466
    @rebekhathangam7466 Год назад +2

    If you are facing an error in datatype change, try the following
    df_copy = df.copy()
    df_copy['budget'] = df_copy['budget'].astype('int64')
    df_copy['gross'] = df_copy['gross'].astype('int64')
    df_copy
    Thank you Alex for this amazing video

  • @rickydonne802
    @rickydonne802 2 года назад +14

    At 11:08, instead of printing null percent, we can use:
    for col in df.columns:
    print(df[col].isnull().value_counts(), "
    ")
    This will print how many values are null. Cause you might have 1 missing in 10k values, and you will need high precision in decimals.

  • @omashan6634
    @omashan6634 3 года назад +5

    Honestly, you're an absolute legend. You really break down some of the technical barriers that exist for people entering the field of data science. You really are gem to the community.

  • @moushmi_nishiganddha
    @moushmi_nishiganddha 3 года назад +17

    there are some missing value in this dataset
    Alex try this instead of that for loop statement
    df.isnull().sum()
    this will give total number of nulls for every column/variable

  • @snudgegalbraith3447
    @snudgegalbraith3447 3 года назад +1

    I have recently decided on becoming a data analyst and your videos are really helping me understand what i need to do and keep me motivated on that goal which will improve my life. I want to say thank you for your content and your honest helpfulness.

  • @purneswarprasad4710
    @purneswarprasad4710 3 года назад +22

    In 29:48 , I think x should be 'Budget' and y as 'Gross earning'

  • @tylerlaquinta2996
    @tylerlaquinta2996 3 года назад +60

    Hey guys the info got updated since this video was posted. While I was going through the project I was able to google the problems as they came up.
    In case you guys get stumped here's what I found that works:
    This will drop any rows with null values
    df = df.dropna(how='any',axis=0)
    This will add the released date column into a separate column
    df['yearcorrect'] = df['released'].astype(str).str.split(', ').str[-1].astype(str).str[:4]
    Let me know if you that works for y'all

    • @IProXie
      @IProXie 3 года назад +1

      I still get the error name 'df' is not defined

    • @cuoofyisme4468
      @cuoofyisme4468 3 года назад +1

      thank you so much for the solution for updated dataset. Your solution save me from struggling on updated release date

    • @shyamkumar6009
      @shyamkumar6009 2 года назад

      i droped the rows but i think it s just dropping temporarily, because if i scatterplot after that it is still showing it has na values.

    • @meghanakurada7242
      @meghanakurada7242 2 года назад +1

      @@shyamkumar6009 I think you should use df = df.dropna(how='any', axis=0, inplace=True) to drop the null values permanantly.

    • @kizzyleee
      @kizzyleee 2 года назад +2

      Also the released changed forms again and I used this to fix it
      # fix the date released format
      df['release_date'] = df.apply(lambda x: x['released'][0:x['released'].find(' (')],axis=1)
      df['release_date'] = pd.to_datetime(df['release_date'], infer_datetime_format=True)

  • @vishakhasingh3162
    @vishakhasingh3162 Год назад +7

    At 13:40 If you are facing an error in datatype change, try the following :-
    df['budget'].round().astype('Int64')
    df['budget']=df['budget'].astype('Int64')
    hope it will help uh

  • @reezalzainudin8097
    @reezalzainudin8097 2 года назад +16

    Hey guys, at 46:24, we can simply assign .copy() method to our new variable if we want to use for loop to iterate over our new variable without affecting the original dataframe or df:
    df_numerized = df.copy()
    for col_name in df_numerized.columns:
    if(df_numerized[col_name].dtype == 'object'):
    df_numerized[col_name] = df_numerized[col_name].astype('category')
    df_numerized[col_name] = df_numerized[col_name].cat.codes
    df_numerized

    • @mikeneumann5611
      @mikeneumann5611 2 года назад

      I don't know how much time this saved me but it would have been a lot.

  • @freidkholy
    @freidkholy Год назад +8

    to whom ever noticed that the 'released' column we have is not in the same format that Alex have and getting errors because of that; 15:27
    i've been where you were, it took me 4 days just to figure this out, here is the line of code you need:
    df['released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format = '%B %d, %Y')
    hope it helped..

    • @karanikabj4422
      @karanikabj4422 Год назад +1

      that worked! Thankyou. But can you explain your code pls?

    • @freidkholy
      @freidkholy Год назад

      @@karanikabj4422 ​ believe me when I tell you I wish I can xd
      but because it's basically my very first time using python so i don't really fully understand it unlike R and SQL that i know well, but at least i know how to research
      buy what i understand is that basically we use the pandas function to_datetime , excluded the '(United States)' part and set the format to the one we have in the rest of the date with applying it to all the column and in the same time assigning that outcome to a new column under the same name witch basically overwrite the original column

    • @Driven-dave
      @Driven-dave Год назад

      I don’t get the year like Alex does only the months

    • @c.obazeIII
      @c.obazeIII Год назад

      Man i'm so grateful, you won't believe how much time i was stuck on this.

    • @vishakhasingh3162
      @vishakhasingh3162 Год назад

      thanks☺☺

  • @yuli3435
    @yuli3435 2 года назад +2

    Thank you very much for the video! I want to change my career path to data analytics, and your videos have been a very good learning material. Although the data has been updated and some of the methods in this video do not work anymore, it is a fantastic guidance (and, ultimately, to become good at something, you have to do a fair share of self-study).
    One thing to note though: I don't think the pearson correlation coefficient can be used to check the relationship between a categorical and a continuous variable. So, the low correlation coefficient for company, for example, might be misleading. Since, after all, the numeric ID assigned to the string values does not necessarily increase with the size of the company.

  • @OmarJimenez-dq8sr
    @OmarJimenez-dq8sr Год назад +34

    The dataset is updated and is not the same as the one in the video, if you guys have problems in the 'Create correct year section' you can do a split of the data to get only the year
    df['yearcorrect'] = df['released'].astype(str).str.split().str[2]

  • @eeshangautam
    @eeshangautam Год назад

    whoever is coming here after completing their portfolios watching all the 3 videos and here for the 4th...
    I WISH YOU ALL THE BEST!
    with so much love and gratitude for Alex!

  • @srimonmahapatra4667
    @srimonmahapatra4667 2 года назад +1

    Jus on today date i am doing this data set.
    Tooltip:
    FYI before converting the 'budget' and 'gross' column look for any null values , as i have downloaded the data set recently i had some.
    And it thrown a error during the conversion, just make sure that the NaN value in both columns to be 0 before converting
    And during creating the 'year corrected' column try to split it using .str.split(',', n=1,expand =True) and the use the df['yearcorrect'] = year astype(str).str[:5]
    This is to be done for getting year out of released column i have done the same way as shown but got the month, so if works for you its fine otherwise try above method
    This get things done
    Thank u
    And also thank u Alex you are doing a great job🙂

  • @busarakummusikaput534
    @busarakummusikaput534 Год назад +1

    Thank you so much for your Video those help me a lot and finally I got a job as Data engineer by no experience in this role but I had learn from your channel in 1 moth! I got a lot of knowledge. really appreciated your support. thank you very very much!

    • @TalesHQ
      @TalesHQ Год назад

      Congratulations

  • @Major_Data
    @Major_Data 3 года назад +13

    As The Rock says; "FINALLY!"
    I'm a bit embarrassed by how excited I get when an ATA video clocks in at over an hour...

  • @hazimrashid1231
    @hazimrashid1231 3 года назад +3

    Hi Alex, just finished the project. It’s awesome. Thanks for everything. I pray for your success in the future.

  • @xman9087
    @xman9087 3 года назад +7

    The only thing that encourages me to watch is your smile
    Keep smiling 🙏❤️

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад

      Haha I hope the high quality video content also makes you smile 😁

  • @ompandey2012
    @ompandey2012 3 года назад +8

    Hey Alex, please make videos on, how to handle missing/null values in python.

  • @davidyolchuyev2905
    @davidyolchuyev2905 3 года назад +2

    This video came in just on time.
    I finished building my portfolio yesterday. Thank you for the tips.

    • @umairahmed2418
      @umairahmed2418 7 месяцев назад

      Can you provide some tips? I have worked on projects but having troubles with how to present and display the projects? Can you share your link

  • @Dpereira96
    @Dpereira96 3 года назад +6

    Man, thank you so much for this, I know you've put a lot of effort into this project serie and I can definitely say that i'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people to pursuit their dreams! Greetings from Brazil :)

  • @hoonzip
    @hoonzip Год назад +3

    I can't believe I followed along and understood everything. I wasn't even sure if I would be able to before I started. Thank you so much Alex! With your help, I've gained more confidence in pursuing a career in data analytics. I'm definitely going to do more of your projects and hope to be able to land a full-time data analytics job this year. Thanks again!

    • @AlexTheAnalyst
      @AlexTheAnalyst  Год назад +1

      Woohoo! You're doing great!

    • @adolfvictor1500
      @adolfvictor1500 Год назад

      i see its barely 4 months since you made this comment. i have an issue with the scatter plot, can you help me out?

  • @naincypushpad2093
    @naincypushpad2093 Год назад +40

    if df.corr() shows the error that a string variable can't be converted into int pass parameter df.corr(numeric_only=TRUE)

  • @shanali3473
    @shanali3473 2 года назад +2

    Man, thank you so much for this, I know you've put a lot of effort into this project series and I can definitely say that I'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people pursue their dreams! Greetings from the UK :)

  • @okojieamos82
    @okojieamos82 3 года назад +2

    Hi Alex, thank you so much for all the videos, Ok here is the thing, I haven't taken this class, I actually was learning Python before seeing your page and decided to learn SQL, I took all the videos you have on SQL and the 1st 3 portfolio projects on SQL and Tableau. so I went to stratascratch and register for the free option, they gave me access to 50 Interviews questions, some are easy, some medium, and some hard But the interesting this is, I was only able to answer 1 question from the easy ones and others I couldn't answer. That almost made me feel discouraged but I am just thinking I need to spend more time on more tutorials on SQL before moving back to Python. I will like to hear about your option and others who had a similar experience. Again, thank you so much for all your effort, you are touching lives!

  • @kshitijsingh7176
    @kshitijsingh7176 2 года назад +4

    All you folks getting data analyst jobs left and right, could you give a glimpse of how you presented this project on your CVs? Alex some help would be great.

  • @MO-fo7on
    @MO-fo7on 3 года назад +3

    Hi Alex. Thank you for the portfolio project series.
    For the missing values, I think the 0.0% it showed for every column has been approximated. If you use describe() and info(), you will notice some null values.
    Thanks again for the videos, they are really helpful.

    • @synaestheticVI
      @synaestheticVI 2 года назад +1

      that's what I thought at first, too, but the data set has simply changed since he uploaded the video (or he used an already edited one). So now there are a few columns that even have values like 0.28%...

    • @sonalrao7656
      @sonalrao7656 2 года назад +1

      @@synaestheticVI yes can anyone help that what should we do in that situation?

  • @priankakibria6976
    @priankakibria6976 2 года назад +1

    Very well instructed! Way better than any of the BootCamp lectures I had gotten previously. Perfect for a refresher and portfolio work. Thank you!

  • @d_dharawat
    @d_dharawat 3 года назад +2

    Soooooooooooooooooooooooo excited for the last website video to come out
    For the first time in my 19 years of living, i feel pretty confident of making something to its perfection by myself (and your help too🙌)

  • @mikeg4691
    @mikeg4691 Год назад +2

    5:20 Faster way to do this is to shift right-click the file and copy as path.
    The "apostrophes" are just called single quotes

  • @nikosbako1974
    @nikosbako1974 10 месяцев назад

    ' ' -> single quotes and {} -> curly brackets. Just in case you have not already received a similar answer. Either way, keep up the great work :)

  • @netol02
    @netol02 Год назад

    This 4 part tutorial is pure gold! After your announcement that you were launching your version of data analyst course/certification, can’t wait for when it goes live, as to follow up in more depth for the concepts presented in this series. Really appreciate the time, dedication and quality of content you produce Alex.

  • @danielbristow6954
    @danielbristow6954 3 года назад +4

    Another excellent portfolio project from Alex! My portfolio is starting to look very good, and I finally have something to upload to job applications that request a portfolio! Thank you, Alex!!

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад

      That's great! Glad to hear it's been helpful!

  • @modi_cl
    @modi_cl Год назад +7

    The 'released' column is updated, now it comes with a text format date and the country of release. What I did was to split the column in two : Release date and Country release.
    The code I used was this:
    df[['released','country_release']] = df['released'].str.split(' \(',n=1,expand=True)
    Then you have to clean a littile bit the 'country_release' column with:
    df['country_release'] = df['country_release'].str.replace(')','')
    And finally give the 'released' column the datetime format with this:
    df['released'] = pd.to_datetime(df['released'],format='mixed')
    For some reason using format = 'mixed' did the magic trick for me, i tried '%B %d, %Y' but It never worked.

  • @newyork397
    @newyork397 2 года назад +1

    Great community in the comment section. Thanks for this analysis Alex! I couldn't make my career pivot without your help

  • @surbhivishwakarma8007
    @surbhivishwakarma8007 6 месяцев назад

    I made it ti 4th part..thanks alex for this tutorial those who dont get released date only you can use this code df['format_released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format='%B %d, %Y')

  • @sarahcongcongyang
    @sarahcongcongyang Год назад +7

    Thank you, Alex! I learned so much.
    Anyone's correlation matrix doesn't work? Need to add 'numeric_only = True'. Now the default is false.
    correlation_matrix = df.corr(method = 'pearson',numeric_only = True)

  • @matts9577
    @matts9577 Год назад +1

    Thank you so much again! Portfolio completed thanks to you! My first résolution of the year is done thanks to you! Happy new year 🎉🎉🎉

  • @motlatsimoea6901
    @motlatsimoea6901 Год назад

    This project taught me a number of new things about pandas. Very helpful! Thank you Alex!

  • @nidhigupta7606
    @nidhigupta7606 3 года назад +2

    Thanks Alex, waiting from so long for this#4 video.Thanks for sharing.

  • @Kangae-Ashi
    @Kangae-Ashi 2 года назад

    Now I have multiple new projects I can add to my portfolio. Thanks, Alex!

  • @ezhankhan1035
    @ezhankhan1035 9 месяцев назад

    This one was a cool challenge! Required a bit of research on my part to better understand some steps taken (python for data analysis/visualization is not really my area), but well worth it. Great video Alex, thank you!

  • @videohub9521
    @videohub9521 3 года назад +5

    Sir, I am a civil engineer doing my masters during my thesis I got some work of machine learning and then through your channel once I presented my data in tableau my supervisor gave me extra credit thankyou to you...now I am thinking of switching to this field thankyou for your efforts.
    My question is after learning the required skills how can I start applying in the companies? Second please start an interview series where you discuss how and what type of technical questions are asked in the interview.

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад +1

      That’s great! I have a few videos on how to work with recruiters and prep for interviews - I think those would be helpful to you 👍

    • @videohub9521
      @videohub9521 3 года назад +3

      @@AlexTheAnalyst I am new comapritively new to the channel I will surely go through those videos. Thanks from all the student community to you for your great contribution in our guidance and learning. Looking forward to learn more insights from the channel 🙌🏽
      Love from India 🇮🇳 to alex the analyst.

  • @JuanPerez-iu9vk
    @JuanPerez-iu9vk 7 месяцев назад

    Hi Alex, great job you are doing in your channel, thank you very much. I just wanted to say, if it might help anybody who is watching you (because I believe that you already know it after two years), that a correlation measures the reaction of one value against the movement of another value. For this, both values need to be able to move and get bigger or smaller, something that letters (and names, in consequence) cannot do, neither can letters or names disguised as numbers, because those are static too, hence, the "non numeric" correlations showed in the video, are false.
    This problem could just be a mess up in any project, but, in a project that is our portfolio, our showroom, it will only show how much of a data analyst we are NOT.
    Kindly study a work around to this statistical problem, which is not to change letters by numbers. So many great developers behind Pandas would have implemented it many years ago. 😀

  • @shirt59
    @shirt59 3 года назад +2

    Almost done with this fantastic series. Excited for your upcoming video on data scrapping. For future videos in this series could you possibly do one on APIs (making a project using some public API) and something on big data maybe?

  • @victorbegnini5754
    @victorbegnini5754 Год назад +1

    You're a monster, Alex!
    Thanks a million (how they say here in Ireland)
    God bless, man

  • @yousifaldossary1903
    @yousifaldossary1903 3 года назад +2

    I support your channel all the way. keep up the projects

  • @NroShock
    @NroShock 3 года назад +1

    Thank you Alex, this has been a great project! You are a great teacher and this has been very helpful. Looking forward to everything you release in the future!

  • @mikeramirez7238
    @mikeramirez7238 Год назад +5

    To everyone getting error for df.corr()
    this was my fix:
    # since pandas version 2.0.0 now you need to add numeric_only=True param to avoid issue
    df.corr(method='pearson', numeric_only=True) #pearson, kendall, spearman
    ---
    correlation_matrix = df.corr(method='pearson', numeric_only=True)
    sns.heatmap(correlation_matrix, annot=True)
    plt.show()

  • @JenOween
    @JenOween Год назад

    @5:20 You can also just highlight+copy the path in the address bar above instead of taking the extra steps to right click and go into properties to select the path. Much more efficient.

  • @ishitatandon4890
    @ishitatandon4890 3 года назад +3

    Hey Alex. Kudos for the good work you do! Can't believe all these resources are free!
    I had a doubt actually,
    Am I the only one, or is it true that the dataset on Kaggle is slightly different than used in the projects? The columns are still the same but values have changed!

  • @aliceemma135
    @aliceemma135 2 года назад

    Thank you so much, Alex! I've taken a few courses about Python and yours is clear and awesome!

  • @prajjvalverma
    @prajjvalverma Год назад +1

    Here at 11:00 when finding missing values write command as ---
    pct_missing = np.mean(df[col].isnull())
    OR you can write
    pct_missing = df.isnull().mean().sort_values(ascending=False)
    If there are missing values in Your dataset try to fill it up with 0, here as ----
    df = df.fillna(0)
    at 16:50 to get the Released year only write command as ----
    df['yearCorrect'] = df['released'].astype(str).str.split(',').str[1].str.split('(').str[0]
    at 28:20 to get the scatter plot first try to replace the Null with 0 using code ----
    df.fillna(0, inplace=True)

  • @akinakin4920
    @akinakin4920 2 года назад +4

    Somewhere around 14:05 where Alex was converting the gross and c=budget column to int64, his code wouldn't work for me but after some research I found this to work:
    df['gross'] = df['gross'].fillna(0).astype('int64')
    df['budget'] = df['budget'].fillna(0).astype('int64')

    • @jameslemley2258
      @jameslemley2258 2 года назад +1

      Thanks a bunch for this!

    • @saketsharma7413
      @saketsharma7413 2 года назад

      or you can use df['gross'] = df['gross'].astype("Int64')

  • @teddybagadonuts2617
    @teddybagadonuts2617 3 года назад +2

    Can't wait to finally finish everything! Thanks for creating these awesome guides Alex!

  • @mayurshinde7443
    @mayurshinde7443 3 года назад +2

    Thank you so so much Alex!! You have been a guiding light!!

  • @pythonking1705
    @pythonking1705 3 года назад +2

    You are the best plz do more videos about data cleaning with SQL server 😍😍😍

  • @unclehorouzoezie8744
    @unclehorouzoezie8744 3 года назад +5

    Thanks Alex. You are still the best.

  • @AlejandroFernandez-of9jy
    @AlejandroFernandez-of9jy 6 месяцев назад

    Not apostrophe. It is single quotes. Thank you for the great tutorial!

  • @LeviSouder-b7r
    @LeviSouder-b7r Год назад +2

    Came here to say that if you're trying to run the df.corr() and it's trying to run the correlation math on string data columns, simply add in the argument df.corr(numeric_only=True)

    • @BrunoRamos-v7k
      @BrunoRamos-v7k Год назад

      Life saver, thank you mate! any idea why this is happening?

    • @HaticeMlikzade
      @HaticeMlikzade 10 месяцев назад

      thankkkkkk you

  • @tiffanyder2377
    @tiffanyder2377 3 года назад +11

    Hey Alex - Thank you for this. Right around 49:25 you talk about the correlation matrix of the df_numerized dataframe that is being shown as a heatmap. I do have a question about that....: when you did .cat.codes in the cells above, how did the category values of the previous objects (company, country, director) represent any value that can be correlated? For instance, using one row as an example, I'm confused how index 6380 at the top of the dataframe has a company categorical value of 1428. Is this by random or did the code construct some sort of logical thinking and gave a numeric value based on other data patterns?? .... Sorry if I am confusing you, it's just when I got to the heatmap part of the df_numerized dataframe I was kind of lost as to how categories can actually represent correlations if the categorical value given to it was completely random. thanks,

    • @Lmoriond
      @Lmoriond 2 года назад +4

      i thought the same. There is no correlation between a random number with gross. only possible if by any luck you get a random higher number for your movie that has also a high budget. then=high correlation

  • @madhumitachaudhary6270
    @madhumitachaudhary6270 Год назад

    Your videos have been so helpful. Just dropped in to say thank you !

  • @spedies12
    @spedies12 3 года назад +1

    I was looking forward for last one. You are the best.

  • @lucaspassosbarreto
    @lucaspassosbarreto 2 года назад +5

    if you downloaded the after the video it seems it might have some values missing that prevent you from converting columns into integers.. use df = df.fillna(0)

    • @beardedmtbr
      @beardedmtbr 2 года назад

      should we just ignore the missing entries or did you delete them?

  • @Wei-HsuanTseng
    @Wei-HsuanTseng Год назад +1

    An easier way to calculate the percentage of missing value
    df.isnull().sum().sort_values(ascending=False)/len(df)*100
    Extract the year from released
    df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(float)
    df['yearcorrect'].fillna(0, inplace=True)
    df['yearcorrect'] = df['yearcorrect'].astype(int)

    • @tranthituyetmai8037
      @tranthituyetmai8037 Год назад

      alternative for calculating % of missing values: df.isnull().mean().sort_values(ascending=False)

  • @wesleydavis3387
    @wesleydavis3387 2 года назад

    Good video. I learned how to use correlation matrices, which is new to me. The whole np.mean(df['col'].isnull()) is something I'm still trying to wrap my head around but for now I'll just hit the easy button on it.

  • @akshayrawal5584
    @akshayrawal5584 2 года назад

    Thank you so much sir for taking time out from such a busy schedule and coming up with an initiative of making such videos in order to help many people around the world interested in starting and developing a career in data analytics :-)

  • @girishbhagwanani1604
    @girishbhagwanani1604 11 месяцев назад +1

    At 13:40 If you are facing an error in datatype change, try the following
    df['gross'] = pd.to_numeric(df['gross'], errors='coerce', downcast='integer')
    df['gross'].isna().sum()
    df[df['gross'].isna()]
    df['gross'].fillna(0, inplace=True)
    df['budget'] = df['budget'].fillna(0)
    df['budget'] = df['budget'].astype('int64')
    df['gross'] = df['gross'].round().astype('int64')

  • @augustmee
    @augustmee Год назад +7

    started doing the project and noticed Kaggle data slightly different from one in the video. there were some negative numbers in the gross column. to change that to positive had to run this code # apply conditional function to the column containing negative numbers
    df['gross'] = df['gross'].apply(lambda x: abs(x) if x < 0 else x)

    • @SquooHipPa
      @SquooHipPa Год назад

      Thank you so much for this!

    • @daisywestern9290
      @daisywestern9290 Год назад

      Thank you for this, my scatter plot was not showing properly because of the negative numbers and this fixed it. Again, thank you!

    • @victorbegnini5754
      @victorbegnini5754 Год назад

      Great, man! Thanks!

  • @e.ghelbur
    @e.ghelbur 7 месяцев назад

    Looks like there's been a mix-up with the axis labels on the graph. The 'budget' and 'revenue' labels are swapped. The 'budget' should actually be labeled as 'revenue' (250 million) and vice versa for the 'revenue' (a billion). Thanks!

  • @shreyapatel6556
    @shreyapatel6556 Год назад

    Thanks Alex, Really good video. I just want to know if you could give some bullet points about this project to add in the resume?

  • @norilouis
    @norilouis 2 года назад +1

    Thank you Alex! These projects really help!

  • @valentineonyemeziri9396
    @valentineonyemeziri9396 4 месяца назад

    I learnt so much from this. Thanks Alex

  • @sohrabkhan9590
    @sohrabkhan9590 Год назад

    Thank You so much Alex for creating such amazing content. You are the best Teacher anyone could ask for.

  • @seritiaymen4140
    @seritiaymen4140 2 года назад +16

    08/02/2022 - I'm using the dataset at this date and there has been many changes and unfortunatly the pct_missing is not 0.0%.
    For me I copied the content of df in a new dataframe that I called Newdf and then deleted the rows:
    Newdf = df.dropna(axis=0)
    print(Newdf.isnull().sum(),'
    ')

  • @michaelcollins3685
    @michaelcollins3685 2 года назад +3

    I might be missing something, please correct me if I'm wrong as I'm tired as I type:
    Hasn't Alex mislabeled the first scatterplot? Isn't budget on x and gross on y? Whereas he has labelled the opposite. This is around the 30:00 mark.

  • @huongle-np7rm
    @huongle-np7rm Год назад

    Thank you Alex for making this project free. I am making a career change and pretty new to this field. I am wondering if this level of project is sufficient for a entry level position yet or does it need to trickier? I hope that it is enough for us to start applying jobs. Thanks a ton.

  • @rajschauhan
    @rajschauhan 3 года назад +1

    was waiting for this one and looking forward for such projects
    thank you so much :D

  • @abhishekpancholi9484
    @abhishekpancholi9484 3 года назад +1

    Much Awaited! Very Exited

  • @pradipbhatta4986
    @pradipbhatta4986 2 года назад

    Thank you so much for this awesome class, Just subscribed and added to my portfolio. Thanks Heaps.

  • @samuelmey2052
    @samuelmey2052 3 года назад +2

    Hey Alex,
    Thank you so much for these videos, they’re incredibly helpful for aspiring data analysts like myself! I have an interview in two weeks for an entry level data analyst position and I’m pretty nervous not having any previous experience as a data analyst… I was wondering if you offered any consulting services ? Thank you again !

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад

      I'm so glad to hear that! I do, but I'm quite booked lately so I don't know if I can fit anything in in the next 2 weeks. You can always email me at AlexTheAnalyst95@gmail.com and we can chat about it.

  • @Arctect
    @Arctect Год назад

    Super mega awesome helpful and useful. Thank you very much for these video series!!!

  • @datping7377
    @datping7377 3 месяца назад

    35:46 if anyone is stuck like I was with the df.corr(method ='pearson') you can try this:
    numeric_df = df.select_dtypes(include=[np.number])
    correlation_matrix = numeric_df.corr(method='pearson')
    print(correlation_matrix)

  • @StatiQQQ
    @StatiQQQ 3 года назад

    Thanks for the time that you put into all of your content, this video helped a lot. I hope all is well and once again Thank you

  • @oubayryuuk
    @oubayryuuk 2 года назад

    Thank you for the effort you are putting in theses videos, it's really helpful.

  • @candacedillon97
    @candacedillon97 3 года назад

    I actually decided not to do this one..... Yet. I have ZERO experience with Python so I want to get familiar with it first. I will be back for this one though!

  • @MohammadKhan-l8j
    @MohammadKhan-l8j 9 месяцев назад

    At 46:00, if after running the code, 'name' column hasn't numerized, change the datatype of name in the csv to string. For example, if the name is 21 in the csv, change it to '21' with the quotation marks so that the value becomes a string. Do this for all numbers so that they become string. Save. Then re-run the code. Should fix it.

  • @rajeshn5006
    @rajeshn5006 2 года назад +1

    Thank you sir ... Nice explanation ...

  • @kateryna6700
    @kateryna6700 2 года назад

    For the Pearson correlation there is a couple of assumptions to timeseries under consideration. The timeseries should be normally distributed, linearly dependent and homoscedastic. In the video you've only checked the linearity. What about other assumptions? Thanks.

  • @menaa843
    @menaa843 3 года назад +8

    I don't think that correlation with categorical data will work. Even after being turned into numbers, correlation and regression won't work at this case. The only way to introduce categorical data into correlation or regression is s it is turned it into multiple dummy variables.
    Thanks for the awesome video. 4/4 what does this mean? the series is done :(

    • @tahsinserkanyaman3459
      @tahsinserkanyaman3459 2 года назад

      He doesnt know what he is doing. He is just directing People to wrong lanes.

    • @jacerains
      @jacerains 2 года назад

      @@tahsinserkanyaman3459 I don't think its right to say he doesn't know what he's doing. That's a little ridiculous. But yeah I don't think the correlation and linear regression really work well here with the categorical data.

  • @avinashbisram4946
    @avinashbisram4946 3 года назад +3

    Hi Alex, can you clarify how cat.codes work? I tried researching more about them online but couldn't really wrap my head around it. They all look like random numbers. How can we be confident that our final correlation matrix actually worked the way we wanted to? Also do the cat codes take into account very similar names like the multiple variations of "Walt Disney". Thanks so much!

  • @kuldeepsinghdudi2679
    @kuldeepsinghdudi2679 3 года назад +1

    Looking forward for more 🥰

  • @tendaimoyo
    @tendaimoyo 3 года назад +1

    Thank you for another tutorial, Alex. Really appreciate the effort you put in your explanations. I'm really looking forward to the next video in the series. Watching from halfway across the world in Cape Town. #savingsoulsfromworkplaceembarassement

  • @elisdavanzo2854
    @elisdavanzo2854 3 года назад +1

    Thank you so much for this video!! Really appreciated