Data Analyst Portfolio Project | Correlation in Python | Project 4/4

Поделиться
HTML-код
  • Опубликовано: 6 сен 2024

Комментарии • 493

  • @danielbristow6954
    @danielbristow6954 3 года назад +384

    Update: Alex, I just accepted my first job as a junior data analyst. This completes my 6-month journey to learn data analytics and change careers, and I could not have done it without your excellent Portfolio videos. Thank you so much for making these available to your viewers for free. After I built my portfolio, companies started taking a second look at my resume and inviting me to interviews. BEFORE the portfolio, I received ONLY rejection emails. Thank you, thank you, thank you!

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад +14

      Congratulations!!

    • @Datalover-Analysts
      @Datalover-Analysts 3 года назад +5

      I am getting only rejection

    • @danielbristow6954
      @danielbristow6954 3 года назад +26

      @@Datalover-Analysts Hi Pooja, I’m sorry. I know rejection can be discouraging. I received over 100 rejection emails from job applications before I finally started getting interviews. Without knowing the details of your situation, I can only encourage you to keep trying and don’t give up. Everyone’s journey to data is different, but I don’t think it is ever easy, especially if you’re trying to change careers, which is what I was doing. I wish you the best.

    • @Datalover-Analysts
      @Datalover-Analysts 3 года назад +2

      @@danielbristow6954 Sure, I am making portfolio with the help of Alex videos. Did some certification from coursera and Azure Fundamentals too

    • @thehash8
      @thehash8 2 года назад

      Hey Daniel, congratulations.!!! Can you please also mention what certifications you did? With the help of Alex’s platform i am building my portfolio.Also, completed my degree in MSCS this month.

  • @neella97
    @neella97 Год назад +72

    If anyone else is having issues due to IntCastingNanError, I advise to try the following:
    df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(int)
    df['gross'] = pd.to_numeric(df['gross'], errors='coerce').fillna(0).astype(int)
    it worked! :) Thank you Alex for your amazing videos!

    • @SearchingforScraps
      @SearchingforScraps Год назад +3

      Thank you !!! I almost gave up as i am not too versed in python to make these changes as the data for the original set he worked on has changed.

    • @esrakareem7071
      @esrakareem7071 Год назад +1

      thank you ❤

    • @SquooHipPa
      @SquooHipPa Год назад +2

      Thank you so much for this, I was trying to google it before realizing it might be in the comments. If you have the time can you explain this part of the code? errors='coerce').fillna(0).astype(int)
      I did look it up but was getting a little confused by it. Thank you again :)

    • @IsmailAbdikadir-lk6lf
      @IsmailAbdikadir-lk6lf Год назад +1

      thank this was a lot of help. i used this to avoid the int32 and it worked.
      df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(np.int64)

    • @victorbegnini5754
      @victorbegnini5754 10 месяцев назад +1

      Top notch! Thanks

  • @woahnelly3286
    @woahnelly3286 Год назад +57

    Hey all, just a "stats" heads up/correction you might want to make for your portfolio:
    In this video, Alex wanted to see if the company was "correlated" with gross revenue. What he did was assign values (randomly, I think) to companies, countries, etc. Then he tried to see if those values were related to the gross revenue.
    Those randomly assigned values are "measuring" the company, country, etc at the Nominal scale. In other words, they're essentially just being used as a numeric "name"-the values themselves don't mean anything. What that means is that one value being higher than the other doesn't represent an increase in the thing being measured (for example, the USA was assigned a 54 and the UK was assigned a 53. Those are just names... the USA isn't one more of something than the UK).
    Because the values themselves don't represent anything, it doesn't make sense to do a correlation with them.
    Correlations tell us, as one variable increases, what happens to the other? So in the first question, as the budget increased, what happened to revenue? It increased. But with country, company, or other categorical variables, correlations don't make sense. The values for country and company are random, so the numbers that represent them going up doesn't tell us anything. It's no wonder then, that the correlations weren't large.
    Instead, it would make sense to do a t-test or ANOVA and compare means. In that case, the question would be, "Do some companies tend to produce higher revenue than others?" Or, "Do some countries tend to produce higher revenue?" etc. (For more discussion, see: www.webpages.uidaho.edu/~stevel/519/Correlation%20between%20a%20nominal%20(IV)%20and%20a%20continuous%20(DV)%20variable.html).
    Since this is a portfolio project and you want to show potential employers the result, maybe just take that part out-you wouldn't want to make a mistake like that in an application to a potential employer!
    (Alex, thanks so much for doing these videos! They're super helpful and I'm very very grateful!)

    • @vishalmalar8406
      @vishalmalar8406 Год назад +5

      Thank you so much for that clarification. I was so much confused and spent a lot of time wondering how random values made sense in determining a correlation.

    • @InsightIntoLife
      @InsightIntoLife Год назад +2

      where do we make the corrections?

    • @somtoobi-ym8ic
      @somtoobi-ym8ic 10 месяцев назад +1

      I noticed that in my dataset, avatar has a gross revenue of -2,147,483,648, and it just feels wrong. Is there something I am not doing right?

    • @somtoobi-ym8ic
      @somtoobi-ym8ic 10 месяцев назад +1

      I just noticed that converting to int type gave me this error

  • @izzyinsc
    @izzyinsc 2 года назад +387

    Hello! The dataset appears to be updated on Kaggle and for anyone new, you will run into some issues that you need to fix to follow along.
    1. Missing data. There are missing values opposed to this video so you will need to fix that. There are many ways to handle missing values but for the sake of time, I decided to drop all rows that have missing data. You will have about 71% of your data remaining. You will need to run the following if your dataframe is named df.
    df = df.dropna()
    2. Extracting the year is different as the formatting is different. Running the following should extract the correct year.
    df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(int)
    3. Duplicates, there aren't any in this dataset so you should be fine on that.
    I hope this helps anyone that is working on this and best of luck on your analytics journey!

    • @Leo28039
      @Leo28039 2 года назад +7

      Sir you are a hero

    • @ishaangupta6915
      @ishaangupta6915 2 года назад

      thank you, i have just made ammends after reading ur comments, and other comments and got the solution to check and verify what was the status before and after execution, thanks

    • @_danfiz
      @_danfiz 2 года назад +2

      Thank you so much!

    • @aliceemma135
      @aliceemma135 2 года назад +1

      Thanks, I was stuck at extracting the correct year and now can finally solve it!

    • @RaNdUmBjAkE
      @RaNdUmBjAkE 2 года назад

      Thank you. I could not figurethe year out.

  • @darkavenger100
    @darkavenger100 3 года назад +66

    I can't wait for the beginner, intermediate, and advanced Python series by Alex the Analyst. It's what the people want, besides a happy Alex.

  • @naincypushpad2093
    @naincypushpad2093 11 месяцев назад +36

    if df.corr() shows the error that a string variable can't be converted into int pass parameter df.corr(numeric_only=TRUE)

  • @rickydonne802
    @rickydonne802 2 года назад +13

    At 11:08, instead of printing null percent, we can use:
    for col in df.columns:
    print(df[col].isnull().value_counts(), "
    ")
    This will print how many values are null. Cause you might have 1 missing in 10k values, and you will need high precision in decimals.

  • @OmarJimenez-dq8sr
    @OmarJimenez-dq8sr Год назад +33

    The dataset is updated and is not the same as the one in the video, if you guys have problems in the 'Create correct year section' you can do a split of the data to get only the year
    df['yearcorrect'] = df['released'].astype(str).str.split().str[2]

  • @tylerlaquinta2996
    @tylerlaquinta2996 3 года назад +59

    Hey guys the info got updated since this video was posted. While I was going through the project I was able to google the problems as they came up.
    In case you guys get stumped here's what I found that works:
    This will drop any rows with null values
    df = df.dropna(how='any',axis=0)
    This will add the released date column into a separate column
    df['yearcorrect'] = df['released'].astype(str).str.split(', ').str[-1].astype(str).str[:4]
    Let me know if you that works for y'all

    • @IProXie
      @IProXie 3 года назад +1

      I still get the error name 'df' is not defined

    • @cuoofyisme4468
      @cuoofyisme4468 2 года назад +1

      thank you so much for the solution for updated dataset. Your solution save me from struggling on updated release date

    • @shyamkumar6009
      @shyamkumar6009 2 года назад

      i droped the rows but i think it s just dropping temporarily, because if i scatterplot after that it is still showing it has na values.

    • @meghanakurada7242
      @meghanakurada7242 2 года назад +1

      @@shyamkumar6009 I think you should use df = df.dropna(how='any', axis=0, inplace=True) to drop the null values permanantly.

    • @kizzyleee
      @kizzyleee 2 года назад +2

      Also the released changed forms again and I used this to fix it
      # fix the date released format
      df['release_date'] = df.apply(lambda x: x['released'][0:x['released'].find(' (')],axis=1)
      df['release_date'] = pd.to_datetime(df['release_date'], infer_datetime_format=True)

  • @rebekhathangam7466
    @rebekhathangam7466 Год назад +2

    If you are facing an error in datatype change, try the following
    df_copy = df.copy()
    df_copy['budget'] = df_copy['budget'].astype('int64')
    df_copy['gross'] = df_copy['gross'].astype('int64')
    df_copy
    Thank you Alex for this amazing video

  • @freidkholy
    @freidkholy Год назад +8

    to whom ever noticed that the 'released' column we have is not in the same format that Alex have and getting errors because of that; 15:27
    i've been where you were, it took me 4 days just to figure this out, here is the line of code you need:
    df['released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format = '%B %d, %Y')
    hope it helped..

    • @karanikabj4422
      @karanikabj4422 Год назад +1

      that worked! Thankyou. But can you explain your code pls?

    • @freidkholy
      @freidkholy Год назад

      @@karanikabj4422 ​ believe me when I tell you I wish I can xd
      but because it's basically my very first time using python so i don't really fully understand it unlike R and SQL that i know well, but at least i know how to research
      buy what i understand is that basically we use the pandas function to_datetime , excluded the '(United States)' part and set the format to the one we have in the rest of the date with applying it to all the column and in the same time assigning that outcome to a new column under the same name witch basically overwrite the original column

    • @Driven-dave
      @Driven-dave Год назад

      I don’t get the year like Alex does only the months

    • @c.obazeIII
      @c.obazeIII Год назад

      Man i'm so grateful, you won't believe how much time i was stuck on this.

    • @vishakhasingh3162
      @vishakhasingh3162 10 месяцев назад

      thanks☺☺

  • @vishakhasingh3162
    @vishakhasingh3162 10 месяцев назад +7

    At 13:40 If you are facing an error in datatype change, try the following :-
    df['budget'].round().astype('Int64')
    df['budget']=df['budget'].astype('Int64')
    hope it will help uh

  • @reezalzainudin8097
    @reezalzainudin8097 2 года назад +15

    Hey guys, at 46:24, we can simply assign .copy() method to our new variable if we want to use for loop to iterate over our new variable without affecting the original dataframe or df:
    df_numerized = df.copy()
    for col_name in df_numerized.columns:
    if(df_numerized[col_name].dtype == 'object'):
    df_numerized[col_name] = df_numerized[col_name].astype('category')
    df_numerized[col_name] = df_numerized[col_name].cat.codes
    df_numerized

    • @mikeneumann5611
      @mikeneumann5611 2 года назад

      I don't know how much time this saved me but it would have been a lot.

  • @moushmi_nishiganddha
    @moushmi_nishiganddha 3 года назад +17

    there are some missing value in this dataset
    Alex try this instead of that for loop statement
    df.isnull().sum()
    this will give total number of nulls for every column/variable

  • @gastonsuarez5320
    @gastonsuarez5320 Год назад +1

    I really appreciate the fact that you did not edit out the parts were you made "mistakes" and actually fixed them.

  • @omashan6634
    @omashan6634 3 года назад +5

    Honestly, you're an absolute legend. You really break down some of the technical barriers that exist for people entering the field of data science. You really are gem to the community.

  • @Major_Data
    @Major_Data 3 года назад +13

    As The Rock says; "FINALLY!"
    I'm a bit embarrassed by how excited I get when an ATA video clocks in at over an hour...

  • @mikeramirez7238
    @mikeramirez7238 10 месяцев назад +5

    To everyone getting error for df.corr()
    this was my fix:
    # since pandas version 2.0.0 now you need to add numeric_only=True param to avoid issue
    df.corr(method='pearson', numeric_only=True) #pearson, kendall, spearman
    ---
    correlation_matrix = df.corr(method='pearson', numeric_only=True)
    sns.heatmap(correlation_matrix, annot=True)
    plt.show()

  • @sarahcongcongyang
    @sarahcongcongyang 11 месяцев назад +5

    Thank you, Alex! I learned so much.
    Anyone's correlation matrix doesn't work? Need to add 'numeric_only = True'. Now the default is false.
    correlation_matrix = df.corr(method = 'pearson',numeric_only = True)

  • @purneswarprasad4710
    @purneswarprasad4710 3 года назад +21

    In 29:48 , I think x should be 'Budget' and y as 'Gross earning'

  • @snudgegalbraith3447
    @snudgegalbraith3447 3 года назад +1

    I have recently decided on becoming a data analyst and your videos are really helping me understand what i need to do and keep me motivated on that goal which will improve my life. I want to say thank you for your content and your honest helpfulness.

  • @busarakummusikaput534
    @busarakummusikaput534 Год назад +1

    Thank you so much for your Video those help me a lot and finally I got a job as Data engineer by no experience in this role but I had learn from your channel in 1 moth! I got a lot of knowledge. really appreciated your support. thank you very very much!

    • @TalesHQ
      @TalesHQ 11 месяцев назад

      Congratulations

  • @modi_cl
    @modi_cl Год назад +7

    The 'released' column is updated, now it comes with a text format date and the country of release. What I did was to split the column in two : Release date and Country release.
    The code I used was this:
    df[['released','country_release']] = df['released'].str.split(' \(',n=1,expand=True)
    Then you have to clean a littile bit the 'country_release' column with:
    df['country_release'] = df['country_release'].str.replace(')','')
    And finally give the 'released' column the datetime format with this:
    df['released'] = pd.to_datetime(df['released'],format='mixed')
    For some reason using format = 'mixed' did the magic trick for me, i tried '%B %d, %Y' but It never worked.

  • @srimonmahapatra4667
    @srimonmahapatra4667 2 года назад +1

    Jus on today date i am doing this data set.
    Tooltip:
    FYI before converting the 'budget' and 'gross' column look for any null values , as i have downloaded the data set recently i had some.
    And it thrown a error during the conversion, just make sure that the NaN value in both columns to be 0 before converting
    And during creating the 'year corrected' column try to split it using .str.split(',', n=1,expand =True) and the use the df['yearcorrect'] = year astype(str).str[:5]
    This is to be done for getting year out of released column i have done the same way as shown but got the month, so if works for you its fine otherwise try above method
    This get things done
    Thank u
    And also thank u Alex you are doing a great job🙂

  • @seritiaymen4140
    @seritiaymen4140 2 года назад +16

    08/02/2022 - I'm using the dataset at this date and there has been many changes and unfortunatly the pct_missing is not 0.0%.
    For me I copied the content of df in a new dataframe that I called Newdf and then deleted the rows:
    Newdf = df.dropna(axis=0)
    print(Newdf.isnull().sum(),'
    ')

  • @Dpereira96
    @Dpereira96 3 года назад +6

    Man, thank you so much for this, I know you've put a lot of effort into this project serie and I can definitely say that i'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people to pursuit their dreams! Greetings from Brazil :)

  • @xman9087
    @xman9087 3 года назад +7

    The only thing that encourages me to watch is your smile
    Keep smiling 🙏❤️

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад

      Haha I hope the high quality video content also makes you smile 😁

  • @yuli3435
    @yuli3435 Год назад +2

    Thank you very much for the video! I want to change my career path to data analytics, and your videos have been a very good learning material. Although the data has been updated and some of the methods in this video do not work anymore, it is a fantastic guidance (and, ultimately, to become good at something, you have to do a fair share of self-study).
    One thing to note though: I don't think the pearson correlation coefficient can be used to check the relationship between a categorical and a continuous variable. So, the low correlation coefficient for company, for example, might be misleading. Since, after all, the numeric ID assigned to the string values does not necessarily increase with the size of the company.

  • @MO-fo7on
    @MO-fo7on 2 года назад +3

    Hi Alex. Thank you for the portfolio project series.
    For the missing values, I think the 0.0% it showed for every column has been approximated. If you use describe() and info(), you will notice some null values.
    Thanks again for the videos, they are really helpful.

    • @synaestheticVI
      @synaestheticVI 2 года назад +1

      that's what I thought at first, too, but the data set has simply changed since he uploaded the video (or he used an already edited one). So now there are a few columns that even have values like 0.28%...

    • @sonalrao7656
      @sonalrao7656 2 года назад +1

      @@synaestheticVI yes can anyone help that what should we do in that situation?

  • @d_dharawat
    @d_dharawat 3 года назад +2

    Soooooooooooooooooooooooo excited for the last website video to come out
    For the first time in my 19 years of living, i feel pretty confident of making something to its perfection by myself (and your help too🙌)

  • @mikeg4691
    @mikeg4691 Год назад +2

    5:20 Faster way to do this is to shift right-click the file and copy as path.
    The "apostrophes" are just called single quotes

  • @shanali3473
    @shanali3473 2 года назад +2

    Man, thank you so much for this, I know you've put a lot of effort into this project series and I can definitely say that I'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people pursue their dreams! Greetings from the UK :)

  • @okojieamos82
    @okojieamos82 3 года назад +2

    Hi Alex, thank you so much for all the videos, Ok here is the thing, I haven't taken this class, I actually was learning Python before seeing your page and decided to learn SQL, I took all the videos you have on SQL and the 1st 3 portfolio projects on SQL and Tableau. so I went to stratascratch and register for the free option, they gave me access to 50 Interviews questions, some are easy, some medium, and some hard But the interesting this is, I was only able to answer 1 question from the easy ones and others I couldn't answer. That almost made me feel discouraged but I am just thinking I need to spend more time on more tutorials on SQL before moving back to Python. I will like to hear about your option and others who had a similar experience. Again, thank you so much for all your effort, you are touching lives!

  • @hazimrashid1231
    @hazimrashid1231 3 года назад +3

    Hi Alex, just finished the project. It’s awesome. Thanks for everything. I pray for your success in the future.

  • @davidyolchuyev2905
    @davidyolchuyev2905 3 года назад +2

    This video came in just on time.
    I finished building my portfolio yesterday. Thank you for the tips.

    • @umairahmed2418
      @umairahmed2418 5 месяцев назад

      Can you provide some tips? I have worked on projects but having troubles with how to present and display the projects? Can you share your link

  • @kshitijsingh7176
    @kshitijsingh7176 2 года назад +4

    All you folks getting data analyst jobs left and right, could you give a glimpse of how you presented this project on your CVs? Alex some help would be great.

  • @danielbristow6954
    @danielbristow6954 3 года назад +4

    Another excellent portfolio project from Alex! My portfolio is starting to look very good, and I finally have something to upload to job applications that request a portfolio! Thank you, Alex!!

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад

      That's great! Glad to hear it's been helpful!

  • @netol02
    @netol02 Год назад

    This 4 part tutorial is pure gold! After your announcement that you were launching your version of data analyst course/certification, can’t wait for when it goes live, as to follow up in more depth for the concepts presented in this series. Really appreciate the time, dedication and quality of content you produce Alex.

  • @ompandey2012
    @ompandey2012 3 года назад +8

    Hey Alex, please make videos on, how to handle missing/null values in python.

  • @girishbhagwanani1604
    @girishbhagwanani1604 8 месяцев назад +1

    At 13:40 If you are facing an error in datatype change, try the following
    df['gross'] = pd.to_numeric(df['gross'], errors='coerce', downcast='integer')
    df['gross'].isna().sum()
    df[df['gross'].isna()]
    df['gross'].fillna(0, inplace=True)
    df['budget'] = df['budget'].fillna(0)
    df['budget'] = df['budget'].astype('int64')
    df['gross'] = df['gross'].round().astype('int64')

  • @surbhivishwakarma8007
    @surbhivishwakarma8007 4 месяца назад

    I made it ti 4th part..thanks alex for this tutorial those who dont get released date only you can use this code df['format_released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format='%B %d, %Y')

  • @lucaspassosbarreto
    @lucaspassosbarreto 2 года назад +5

    if you downloaded the after the video it seems it might have some values missing that prevent you from converting columns into integers.. use df = df.fillna(0)

    • @beardedmtbr
      @beardedmtbr 2 года назад

      should we just ignore the missing entries or did you delete them?

  • @akinakin4920
    @akinakin4920 2 года назад +4

    Somewhere around 14:05 where Alex was converting the gross and c=budget column to int64, his code wouldn't work for me but after some research I found this to work:
    df['gross'] = df['gross'].fillna(0).astype('int64')
    df['budget'] = df['budget'].fillna(0).astype('int64')

  • @hoonzip
    @hoonzip Год назад +3

    I can't believe I followed along and understood everything. I wasn't even sure if I would be able to before I started. Thank you so much Alex! With your help, I've gained more confidence in pursuing a career in data analytics. I'm definitely going to do more of your projects and hope to be able to land a full-time data analytics job this year. Thanks again!

    • @AlexTheAnalyst
      @AlexTheAnalyst  Год назад +1

      Woohoo! You're doing great!

    • @adolfvictor1500
      @adolfvictor1500 Год назад

      i see its barely 4 months since you made this comment. i have an issue with the scatter plot, can you help me out?

  • @priankakibria6976
    @priankakibria6976 2 года назад +1

    Very well instructed! Way better than any of the BootCamp lectures I had gotten previously. Perfect for a refresher and portfolio work. Thank you!

  • @JuanPerez-iu9vk
    @JuanPerez-iu9vk 4 месяца назад

    Hi Alex, great job you are doing in your channel, thank you very much. I just wanted to say, if it might help anybody who is watching you (because I believe that you already know it after two years), that a correlation measures the reaction of one value against the movement of another value. For this, both values need to be able to move and get bigger or smaller, something that letters (and names, in consequence) cannot do, neither can letters or names disguised as numbers, because those are static too, hence, the "non numeric" correlations showed in the video, are false.
    This problem could just be a mess up in any project, but, in a project that is our portfolio, our showroom, it will only show how much of a data analyst we are NOT.
    Kindly study a work around to this statistical problem, which is not to change letters by numbers. So many great developers behind Pandas would have implemented it many years ago. 😀

  • @nikosbako1974
    @nikosbako1974 7 месяцев назад

    ' ' -> single quotes and {} -> curly brackets. Just in case you have not already received a similar answer. Either way, keep up the great work :)

  • @aparna1498
    @aparna1498 Год назад +4

    Hi, I faced an issue for command at 13:48, so this might help someone
    For me jupyter gives *ValueError: Cannot convert non-finite values (NA or inf) to integer*
    Instead, you can use *I* for int64 in the same command
    *df['budget'] = df['budget'].astype('Int64')*

    • @grammytailor7715
      @grammytailor7715 Год назад +1

      Thank you I was stuck here. 😀

    • @user-fw3vv4id5z
      @user-fw3vv4id5z Год назад

      Thank you, I was stuck there trying to understand what I was doing wrong, I'd never figure out it was just I instead of i 🤣

  • @tiffanyder2377
    @tiffanyder2377 3 года назад +11

    Hey Alex - Thank you for this. Right around 49:25 you talk about the correlation matrix of the df_numerized dataframe that is being shown as a heatmap. I do have a question about that....: when you did .cat.codes in the cells above, how did the category values of the previous objects (company, country, director) represent any value that can be correlated? For instance, using one row as an example, I'm confused how index 6380 at the top of the dataframe has a company categorical value of 1428. Is this by random or did the code construct some sort of logical thinking and gave a numeric value based on other data patterns?? .... Sorry if I am confusing you, it's just when I got to the heatmap part of the df_numerized dataframe I was kind of lost as to how categories can actually represent correlations if the categorical value given to it was completely random. thanks,

  • @newyork397
    @newyork397 2 года назад +1

    Great community in the comment section. Thanks for this analysis Alex! I couldn't make my career pivot without your help

  • @augustmee
    @augustmee Год назад +5

    started doing the project and noticed Kaggle data slightly different from one in the video. there were some negative numbers in the gross column. to change that to positive had to run this code # apply conditional function to the column containing negative numbers
    df['gross'] = df['gross'].apply(lambda x: abs(x) if x < 0 else x)

    • @SquooHipPa
      @SquooHipPa Год назад

      Thank you so much for this!

    • @daisywestern9290
      @daisywestern9290 Год назад

      Thank you for this, my scatter plot was not showing properly because of the negative numbers and this fixed it. Again, thank you!

    • @victorbegnini5754
      @victorbegnini5754 10 месяцев назад

      Great, man! Thanks!

  • @videohub9521
    @videohub9521 3 года назад +5

    Sir, I am a civil engineer doing my masters during my thesis I got some work of machine learning and then through your channel once I presented my data in tableau my supervisor gave me extra credit thankyou to you...now I am thinking of switching to this field thankyou for your efforts.
    My question is after learning the required skills how can I start applying in the companies? Second please start an interview series where you discuss how and what type of technical questions are asked in the interview.

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад +1

      That’s great! I have a few videos on how to work with recruiters and prep for interviews - I think those would be helpful to you 👍

    • @videohub9521
      @videohub9521 3 года назад +3

      @@AlexTheAnalyst I am new comapritively new to the channel I will surely go through those videos. Thanks from all the student community to you for your great contribution in our guidance and learning. Looking forward to learn more insights from the channel 🙌🏽
      Love from India 🇮🇳 to alex the analyst.

  • @eeshangautam
    @eeshangautam Год назад

    whoever is coming here after completing their portfolios watching all the 3 videos and here for the 4th...
    I WISH YOU ALL THE BEST!
    with so much love and gratitude for Alex!

  • @unclehorouzoezie8744
    @unclehorouzoezie8744 3 года назад +5

    Thanks Alex. You are still the best.

  • @user-iy9xv9tb9y
    @user-iy9xv9tb9y 10 месяцев назад +2

    Came here to say that if you're trying to run the df.corr() and it's trying to run the correlation math on string data columns, simply add in the argument df.corr(numeric_only=True)

    • @user-ls4xk3zd6n
      @user-ls4xk3zd6n 10 месяцев назад

      Life saver, thank you mate! any idea why this is happening?

    • @user-sz8gw5yn6c
      @user-sz8gw5yn6c 8 месяцев назад

      thankkkkkk you

  • @michaelcollins3685
    @michaelcollins3685 2 года назад +3

    I might be missing something, please correct me if I'm wrong as I'm tired as I type:
    Hasn't Alex mislabeled the first scatterplot? Isn't budget on x and gross on y? Whereas he has labelled the opposite. This is around the 30:00 mark.

  • @matts9577
    @matts9577 Год назад +1

    Thank you so much again! Portfolio completed thanks to you! My first résolution of the year is done thanks to you! Happy new year 🎉🎉🎉

  • @NroShock
    @NroShock 3 года назад +1

    Thank you Alex, this has been a great project! You are a great teacher and this has been very helpful. Looking forward to everything you release in the future!

  • @p3shitonlin3
    @p3shitonlin3 3 года назад +3

    I love you so much. You really make my life easier. Thank you for putting out all of those helpful video. You are the best!

  • @shirt59
    @shirt59 3 года назад +2

    Almost done with this fantastic series. Excited for your upcoming video on data scrapping. For future videos in this series could you possibly do one on APIs (making a project using some public API) and something on big data maybe?

  • @prajjvalverma
    @prajjvalverma Год назад +1

    Here at 11:00 when finding missing values write command as ---
    pct_missing = np.mean(df[col].isnull())
    OR you can write
    pct_missing = df.isnull().mean().sort_values(ascending=False)
    If there are missing values in Your dataset try to fill it up with 0, here as ----
    df = df.fillna(0)
    at 16:50 to get the Released year only write command as ----
    df['yearCorrect'] = df['released'].astype(str).str.split(',').str[1].str.split('(').str[0]
    at 28:20 to get the scatter plot first try to replace the Null with 0 using code ----
    df.fillna(0, inplace=True)

  • @metecakmak376
    @metecakmak376 2 года назад +3

    Why is this code not working? İs there anyone who knows the reason please tell us!
    for col_name in df_numerized.columns:
    if(df_numerized[col_name].dtype == 'object'):
    df_numerized[col_name] = df_numerized[col_name].astype('category')
    df_numerized[col_name] = df_numerized[col_name]

  • @hollandanalytics
    @hollandanalytics Год назад +2

    Alex, when I run the sort_values, it throws Avatar, Titanic, and Avengers: Endgame in to a negative gross. How do I fix this? I don't want to drop the whole column, leaving the negatives skews the data.
    update: I got it fixed. But now when I am trying to do the correlation matrix, I get valueerror stating could not covert string to a float 'The Shinning'. I am not sure what this means.

    • @victorbegnini5754
      @victorbegnini5754 10 месяцев назад +1

      Hi, Hollanda!
      Try this. Worked for me :)
      correlation_matrix = df.corr(method = 'pearson',numeric_only = True)
      sns.heatmap(correlation_matrix, annot = True)
      plt.title("Correlation matrix for Numeric Features")
      plt.xlabel("Movie features")
      plt.ylabel("Movie features")
      plt.show()

  • @menaa843
    @menaa843 3 года назад +8

    I don't think that correlation with categorical data will work. Even after being turned into numbers, correlation and regression won't work at this case. The only way to introduce categorical data into correlation or regression is s it is turned it into multiple dummy variables.
    Thanks for the awesome video. 4/4 what does this mean? the series is done :(

    • @tahsinserkanyaman3459
      @tahsinserkanyaman3459 2 года назад

      He doesnt know what he is doing. He is just directing People to wrong lanes.

    • @jacerains
      @jacerains 2 года назад

      @@tahsinserkanyaman3459 I don't think its right to say he doesn't know what he's doing. That's a little ridiculous. But yeah I don't think the correlation and linear regression really work well here with the categorical data.

  • @nidhigupta7606
    @nidhigupta7606 3 года назад +2

    Thanks Alex, waiting from so long for this#4 video.Thanks for sharing.

  • @motlatsimoea6901
    @motlatsimoea6901 Год назад

    This project taught me a number of new things about pandas. Very helpful! Thank you Alex!

  • @tinuolaabolaji8643
    @tinuolaabolaji8643 2 года назад +3

    Problem: scatter plot with budget vs gross : TypeError: float( ) argument must be a string or a number, not 'NAType'
    Solution: x=df['budget].astype('float')
    y=df['gross'].astype('float')
    plt.scatter(x,y)

  • @Kangae-Ashi
    @Kangae-Ashi 2 года назад

    Now I have multiple new projects I can add to my portfolio. Thanks, Alex!

  • @aliceemma135
    @aliceemma135 2 года назад

    Thank you so much, Alex! I've taken a few courses about Python and yours is clear and awesome!

  • @victorbegnini5754
    @victorbegnini5754 10 месяцев назад +1

    You're a monster, Alex!
    Thanks a million (how they say here in Ireland)
    God bless, man

  • @ishitatandon4890
    @ishitatandon4890 3 года назад +3

    Hey Alex. Kudos for the good work you do! Can't believe all these resources are free!
    I had a doubt actually,
    Am I the only one, or is it true that the dataset on Kaggle is slightly different than used in the projects? The columns are still the same but values have changed!

  • @mayurshinde7443
    @mayurshinde7443 2 года назад +2

    Thank you so so much Alex!! You have been a guiding light!!

  • @MarekSnip3r
    @MarekSnip3r 3 года назад +4

    Hello Alex! Can you elaborate on the meaning of the correlation matrix which uses df_numerized (48:50)? The numbers assigned to company or country for example, in my understanding, cannot be used for correlation like that.

    • @DustInTheAir
      @DustInTheAir 2 года назад

      that's what I was thinking too, since on the video, he assigned random numbers to replace company names, writers and stars, how can you figure out any correlation on those random values?

  • @JenOween
    @JenOween Год назад

    @5:20 You can also just highlight+copy the path in the address bar above instead of taking the extra steps to right click and go into properties to select the path. Much more efficient.

  • @ezhankhan1035
    @ezhankhan1035 7 месяцев назад

    This one was a cool challenge! Required a bit of research on my part to better understand some steps taken (python for data analysis/visualization is not really my area), but well worth it. Great video Alex, thank you!

  • @cheesywombat2661
    @cheesywombat2661 2 месяца назад

    Anyone getting an issue where in the correlation graph only the top row is filling out the numbers? I looked it up and to have it fill out you need to write annot = True, but still getting issues. Only the name row is filled out.

  • @AlejandroFernandez-of9jy
    @AlejandroFernandez-of9jy 4 месяца назад

    Not apostrophe. It is single quotes. Thank you for the great tutorial!

  • @e.ghelbur
    @e.ghelbur 4 месяца назад

    Looks like there's been a mix-up with the axis labels on the graph. The 'budget' and 'revenue' labels are swapped. The 'budget' should actually be labeled as 'revenue' (250 million) and vice versa for the 'revenue' (a billion). Thanks!

  • @pythonking1705
    @pythonking1705 3 года назад +2

    You are the best plz do more videos about data cleaning with SQL server 😍😍😍

  • @qlintdwayne9044
    @qlintdwayne9044 8 месяцев назад +1

    min 35:35 looking at correlation, it returns ValueError, anybody find out why? Or the solution?

  • @datping7377
    @datping7377 Месяц назад

    35:46 if anyone is stuck like I was with the df.corr(method ='pearson') you can try this:
    numeric_df = df.select_dtypes(include=[np.number])
    correlation_matrix = numeric_df.corr(method='pearson')
    print(correlation_matrix)

  • @akshayrawal5584
    @akshayrawal5584 2 года назад

    Thank you so much sir for taking time out from such a busy schedule and coming up with an initiative of making such videos in order to help many people around the world interested in starting and developing a career in data analytics :-)

  • @Wei-HsuanTseng
    @Wei-HsuanTseng Год назад +1

    An easier way to calculate the percentage of missing value
    df.isnull().sum().sort_values(ascending=False)/len(df)*100
    Extract the year from released
    df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(float)
    df['yearcorrect'].fillna(0, inplace=True)
    df['yearcorrect'] = df['yearcorrect'].astype(int)

    • @tranthituyetmai8037
      @tranthituyetmai8037 Год назад

      alternative for calculating % of missing values: df.isnull().mean().sort_values(ascending=False)

  • @shehzanshaikh6904
    @shehzanshaikh6904 2 года назад +2

    Hey Alex,
    I'm getting an error for using cat.codes at 45.41 ,
    Its showing this error ( 'CategoricalDtype' object has no attribute 'cat' ).
    A little help would be much appreciated.

    • @nishipishi
      @nishipishi 5 месяцев назад

      i'm having the same problem. did you figure it out?

  • @wesleydavis3387
    @wesleydavis3387 2 года назад

    Good video. I learned how to use correlation matrices, which is new to me. The whole np.mean(df['col'].isnull()) is something I'm still trying to wrap my head around but for now I'll just hit the easy button on it.

  • @yousifaldossary1903
    @yousifaldossary1903 3 года назад +2

    I support your channel all the way. keep up the projects

  • @RongHammer
    @RongHammer 2 месяца назад

    I am wondering the x label is budget, right? As Alex did, x=df ['budget'] and xlabel is gross earning
    # correlation exploration
    # scatter plot: budget vs gross
    plt.scatter(x= df_cleaned['budget'] ,y=df_cleaned['gross'])
    plt.title('Relationship between Budget and Gross')
    plt.xlabel('Budget')
    plt.ylabel('Gross earning')

  • @avinashbisram4946
    @avinashbisram4946 3 года назад +3

    Hi Alex, can you clarify how cat.codes work? I tried researching more about them online but couldn't really wrap my head around it. They all look like random numbers. How can we be confident that our final correlation matrix actually worked the way we wanted to? Also do the cat codes take into account very similar names like the multiple variations of "Walt Disney". Thanks so much!

  • @newbtechhelp
    @newbtechhelp 2 года назад +1

    Is anybody having issues with importing seaborn into jupyter network. I can't get it to recognize the module.

  • @valentineonyemeziri9396
    @valentineonyemeziri9396 Месяц назад

    I learnt so much from this. Thanks Alex

  • @spedies12
    @spedies12 3 года назад +1

    I was looking forward for last one. You are the best.

  • @jahelramirez3729
    @jahelramirez3729 2 года назад +1

    The columns of the dataset in the kaggle link are very different now. Does anybody knows where I can find the original dataset?

  • @abhishekpancholi9484
    @abhishekpancholi9484 3 года назад +1

    Much Awaited! Very Exited

  • @xsangrezulx
    @xsangrezulx 2 года назад +2

    Hey Alex, just a heads up. The data set that is available in the link, is currently a little different from what you worked on. I keep getting errors because the data set that I downloaded has several NaN values.

  • @stevenjeong3331
    @stevenjeong3331 3 года назад +4

    Hi Alex, thanks for the helpful video.
    I was wondering if the labels on scatter plot (30:00) for x and y should be reversed as x takes budget and y takes gross as an input.

    • @AlexTheAnalyst
      @AlexTheAnalyst  3 года назад

      You are absolutely right - woops!

    • @stevenjeong3331
      @stevenjeong3331 3 года назад

      @@AlexTheAnalyst I've followed through the video and now I want to try this with my own dataset, but worried if my dataset is too small compared to this example. What would be the good number to say that there is correlation between two variables?

  • @rajschauhan
    @rajschauhan 3 года назад +1

    was waiting for this one and looking forward for such projects
    thank you so much :D

  • @PurpleTurkeyPatty
    @PurpleTurkeyPatty 3 года назад +2

    the dataset in that link is different from yours just a FYI

  • @sohrabkhan9590
    @sohrabkhan9590 Год назад

    Thank You so much Alex for creating such amazing content. You are the best Teacher anyone could ask for.

  • @donrickles845
    @donrickles845 3 года назад +1

    Can someone help me? When I create yearcorrect @16:56 I'm getting the Month 'June' returned back instead of the year

  • @teddybagadonuts2617
    @teddybagadonuts2617 3 года назад +2

    Can't wait to finally finish everything! Thanks for creating these awesome guides Alex!