Update: Alex, I just accepted my first job as a junior data analyst. This completes my 6-month journey to learn data analytics and change careers, and I could not have done it without your excellent Portfolio videos. Thank you so much for making these available to your viewers for free. After I built my portfolio, companies started taking a second look at my resume and inviting me to interviews. BEFORE the portfolio, I received ONLY rejection emails. Thank you, thank you, thank you!
@@Datalover-Analysts Hi Pooja, I’m sorry. I know rejection can be discouraging. I received over 100 rejection emails from job applications before I finally started getting interviews. Without knowing the details of your situation, I can only encourage you to keep trying and don’t give up. Everyone’s journey to data is different, but I don’t think it is ever easy, especially if you’re trying to change careers, which is what I was doing. I wish you the best.
Hey Daniel, congratulations.!!! Can you please also mention what certifications you did? With the help of Alex’s platform i am building my portfolio.Also, completed my degree in MSCS this month.
If anyone else is having issues due to IntCastingNanError, I advise to try the following: df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(int) df['gross'] = pd.to_numeric(df['gross'], errors='coerce').fillna(0).astype(int) it worked! :) Thank you Alex for your amazing videos!
Thank you so much for this, I was trying to google it before realizing it might be in the comments. If you have the time can you explain this part of the code? errors='coerce').fillna(0).astype(int) I did look it up but was getting a little confused by it. Thank you again :)
thank this was a lot of help. i used this to avoid the int32 and it worked. df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(np.int64)
Hey all, just a "stats" heads up/correction you might want to make for your portfolio: In this video, Alex wanted to see if the company was "correlated" with gross revenue. What he did was assign values (randomly, I think) to companies, countries, etc. Then he tried to see if those values were related to the gross revenue. Those randomly assigned values are "measuring" the company, country, etc at the Nominal scale. In other words, they're essentially just being used as a numeric "name"-the values themselves don't mean anything. What that means is that one value being higher than the other doesn't represent an increase in the thing being measured (for example, the USA was assigned a 54 and the UK was assigned a 53. Those are just names... the USA isn't one more of something than the UK). Because the values themselves don't represent anything, it doesn't make sense to do a correlation with them. Correlations tell us, as one variable increases, what happens to the other? So in the first question, as the budget increased, what happened to revenue? It increased. But with country, company, or other categorical variables, correlations don't make sense. The values for country and company are random, so the numbers that represent them going up doesn't tell us anything. It's no wonder then, that the correlations weren't large. Instead, it would make sense to do a t-test or ANOVA and compare means. In that case, the question would be, "Do some companies tend to produce higher revenue than others?" Or, "Do some countries tend to produce higher revenue?" etc. (For more discussion, see: www.webpages.uidaho.edu/~stevel/519/Correlation%20between%20a%20nominal%20(IV)%20and%20a%20continuous%20(DV)%20variable.html). Since this is a portfolio project and you want to show potential employers the result, maybe just take that part out-you wouldn't want to make a mistake like that in an application to a potential employer! (Alex, thanks so much for doing these videos! They're super helpful and I'm very very grateful!)
Thank you so much for that clarification. I was so much confused and spent a lot of time wondering how random values made sense in determining a correlation.
Hello! The dataset appears to be updated on Kaggle and for anyone new, you will run into some issues that you need to fix to follow along. 1. Missing data. There are missing values opposed to this video so you will need to fix that. There are many ways to handle missing values but for the sake of time, I decided to drop all rows that have missing data. You will have about 71% of your data remaining. You will need to run the following if your dataframe is named df. df = df.dropna() 2. Extracting the year is different as the formatting is different. Running the following should extract the correct year. df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(int) 3. Duplicates, there aren't any in this dataset so you should be fine on that. I hope this helps anyone that is working on this and best of luck on your analytics journey!
thank you, i have just made ammends after reading ur comments, and other comments and got the solution to check and verify what was the status before and after execution, thanks
At 11:08, instead of printing null percent, we can use: for col in df.columns: print(df[col].isnull().value_counts(), " ") This will print how many values are null. Cause you might have 1 missing in 10k values, and you will need high precision in decimals.
The dataset is updated and is not the same as the one in the video, if you guys have problems in the 'Create correct year section' you can do a split of the data to get only the year df['yearcorrect'] = df['released'].astype(str).str.split().str[2]
Hey guys the info got updated since this video was posted. While I was going through the project I was able to google the problems as they came up. In case you guys get stumped here's what I found that works: This will drop any rows with null values df = df.dropna(how='any',axis=0) This will add the released date column into a separate column df['yearcorrect'] = df['released'].astype(str).str.split(', ').str[-1].astype(str).str[:4] Let me know if you that works for y'all
Also the released changed forms again and I used this to fix it # fix the date released format df['release_date'] = df.apply(lambda x: x['released'][0:x['released'].find(' (')],axis=1) df['release_date'] = pd.to_datetime(df['release_date'], infer_datetime_format=True)
If you are facing an error in datatype change, try the following df_copy = df.copy() df_copy['budget'] = df_copy['budget'].astype('int64') df_copy['gross'] = df_copy['gross'].astype('int64') df_copy Thank you Alex for this amazing video
to whom ever noticed that the 'released' column we have is not in the same format that Alex have and getting errors because of that; 15:27 i've been where you were, it took me 4 days just to figure this out, here is the line of code you need: df['released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format = '%B %d, %Y') hope it helped..
@@karanikabj4422 believe me when I tell you I wish I can xd but because it's basically my very first time using python so i don't really fully understand it unlike R and SQL that i know well, but at least i know how to research buy what i understand is that basically we use the pandas function to_datetime , excluded the '(United States)' part and set the format to the one we have in the rest of the date with applying it to all the column and in the same time assigning that outcome to a new column under the same name witch basically overwrite the original column
At 13:40 If you are facing an error in datatype change, try the following :- df['budget'].round().astype('Int64') df['budget']=df['budget'].astype('Int64') hope it will help uh
Hey guys, at 46:24, we can simply assign .copy() method to our new variable if we want to use for loop to iterate over our new variable without affecting the original dataframe or df: df_numerized = df.copy() for col_name in df_numerized.columns: if(df_numerized[col_name].dtype == 'object'): df_numerized[col_name] = df_numerized[col_name].astype('category') df_numerized[col_name] = df_numerized[col_name].cat.codes df_numerized
there are some missing value in this dataset Alex try this instead of that for loop statement df.isnull().sum() this will give total number of nulls for every column/variable
Honestly, you're an absolute legend. You really break down some of the technical barriers that exist for people entering the field of data science. You really are gem to the community.
To everyone getting error for df.corr() this was my fix: # since pandas version 2.0.0 now you need to add numeric_only=True param to avoid issue df.corr(method='pearson', numeric_only=True) #pearson, kendall, spearman --- correlation_matrix = df.corr(method='pearson', numeric_only=True) sns.heatmap(correlation_matrix, annot=True) plt.show()
Thank you, Alex! I learned so much. Anyone's correlation matrix doesn't work? Need to add 'numeric_only = True'. Now the default is false. correlation_matrix = df.corr(method = 'pearson',numeric_only = True)
I have recently decided on becoming a data analyst and your videos are really helping me understand what i need to do and keep me motivated on that goal which will improve my life. I want to say thank you for your content and your honest helpfulness.
Thank you so much for your Video those help me a lot and finally I got a job as Data engineer by no experience in this role but I had learn from your channel in 1 moth! I got a lot of knowledge. really appreciated your support. thank you very very much!
The 'released' column is updated, now it comes with a text format date and the country of release. What I did was to split the column in two : Release date and Country release. The code I used was this: df[['released','country_release']] = df['released'].str.split(' \(',n=1,expand=True) Then you have to clean a littile bit the 'country_release' column with: df['country_release'] = df['country_release'].str.replace(')','') And finally give the 'released' column the datetime format with this: df['released'] = pd.to_datetime(df['released'],format='mixed') For some reason using format = 'mixed' did the magic trick for me, i tried '%B %d, %Y' but It never worked.
Jus on today date i am doing this data set. Tooltip: FYI before converting the 'budget' and 'gross' column look for any null values , as i have downloaded the data set recently i had some. And it thrown a error during the conversion, just make sure that the NaN value in both columns to be 0 before converting And during creating the 'year corrected' column try to split it using .str.split(',', n=1,expand =True) and the use the df['yearcorrect'] = year astype(str).str[:5] This is to be done for getting year out of released column i have done the same way as shown but got the month, so if works for you its fine otherwise try above method This get things done Thank u And also thank u Alex you are doing a great job🙂
08/02/2022 - I'm using the dataset at this date and there has been many changes and unfortunatly the pct_missing is not 0.0%. For me I copied the content of df in a new dataframe that I called Newdf and then deleted the rows: Newdf = df.dropna(axis=0) print(Newdf.isnull().sum(),' ')
Man, thank you so much for this, I know you've put a lot of effort into this project serie and I can definitely say that i'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people to pursuit their dreams! Greetings from Brazil :)
Thank you very much for the video! I want to change my career path to data analytics, and your videos have been a very good learning material. Although the data has been updated and some of the methods in this video do not work anymore, it is a fantastic guidance (and, ultimately, to become good at something, you have to do a fair share of self-study). One thing to note though: I don't think the pearson correlation coefficient can be used to check the relationship between a categorical and a continuous variable. So, the low correlation coefficient for company, for example, might be misleading. Since, after all, the numeric ID assigned to the string values does not necessarily increase with the size of the company.
Hi Alex. Thank you for the portfolio project series. For the missing values, I think the 0.0% it showed for every column has been approximated. If you use describe() and info(), you will notice some null values. Thanks again for the videos, they are really helpful.
that's what I thought at first, too, but the data set has simply changed since he uploaded the video (or he used an already edited one). So now there are a few columns that even have values like 0.28%...
Soooooooooooooooooooooooo excited for the last website video to come out For the first time in my 19 years of living, i feel pretty confident of making something to its perfection by myself (and your help too🙌)
Man, thank you so much for this, I know you've put a lot of effort into this project series and I can definitely say that I'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people pursue their dreams! Greetings from the UK :)
Hi Alex, thank you so much for all the videos, Ok here is the thing, I haven't taken this class, I actually was learning Python before seeing your page and decided to learn SQL, I took all the videos you have on SQL and the 1st 3 portfolio projects on SQL and Tableau. so I went to stratascratch and register for the free option, they gave me access to 50 Interviews questions, some are easy, some medium, and some hard But the interesting this is, I was only able to answer 1 question from the easy ones and others I couldn't answer. That almost made me feel discouraged but I am just thinking I need to spend more time on more tutorials on SQL before moving back to Python. I will like to hear about your option and others who had a similar experience. Again, thank you so much for all your effort, you are touching lives!
All you folks getting data analyst jobs left and right, could you give a glimpse of how you presented this project on your CVs? Alex some help would be great.
Another excellent portfolio project from Alex! My portfolio is starting to look very good, and I finally have something to upload to job applications that request a portfolio! Thank you, Alex!!
This 4 part tutorial is pure gold! After your announcement that you were launching your version of data analyst course/certification, can’t wait for when it goes live, as to follow up in more depth for the concepts presented in this series. Really appreciate the time, dedication and quality of content you produce Alex.
At 13:40 If you are facing an error in datatype change, try the following df['gross'] = pd.to_numeric(df['gross'], errors='coerce', downcast='integer') df['gross'].isna().sum() df[df['gross'].isna()] df['gross'].fillna(0, inplace=True) df['budget'] = df['budget'].fillna(0) df['budget'] = df['budget'].astype('int64') df['gross'] = df['gross'].round().astype('int64')
I made it ti 4th part..thanks alex for this tutorial those who dont get released date only you can use this code df['format_released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format='%B %d, %Y')
if you downloaded the after the video it seems it might have some values missing that prevent you from converting columns into integers.. use df = df.fillna(0)
Somewhere around 14:05 where Alex was converting the gross and c=budget column to int64, his code wouldn't work for me but after some research I found this to work: df['gross'] = df['gross'].fillna(0).astype('int64') df['budget'] = df['budget'].fillna(0).astype('int64')
I can't believe I followed along and understood everything. I wasn't even sure if I would be able to before I started. Thank you so much Alex! With your help, I've gained more confidence in pursuing a career in data analytics. I'm definitely going to do more of your projects and hope to be able to land a full-time data analytics job this year. Thanks again!
Hi Alex, great job you are doing in your channel, thank you very much. I just wanted to say, if it might help anybody who is watching you (because I believe that you already know it after two years), that a correlation measures the reaction of one value against the movement of another value. For this, both values need to be able to move and get bigger or smaller, something that letters (and names, in consequence) cannot do, neither can letters or names disguised as numbers, because those are static too, hence, the "non numeric" correlations showed in the video, are false. This problem could just be a mess up in any project, but, in a project that is our portfolio, our showroom, it will only show how much of a data analyst we are NOT. Kindly study a work around to this statistical problem, which is not to change letters by numbers. So many great developers behind Pandas would have implemented it many years ago. 😀
Hi, I faced an issue for command at 13:48, so this might help someone For me jupyter gives *ValueError: Cannot convert non-finite values (NA or inf) to integer* Instead, you can use *I* for int64 in the same command *df['budget'] = df['budget'].astype('Int64')*
Hey Alex - Thank you for this. Right around 49:25 you talk about the correlation matrix of the df_numerized dataframe that is being shown as a heatmap. I do have a question about that....: when you did .cat.codes in the cells above, how did the category values of the previous objects (company, country, director) represent any value that can be correlated? For instance, using one row as an example, I'm confused how index 6380 at the top of the dataframe has a company categorical value of 1428. Is this by random or did the code construct some sort of logical thinking and gave a numeric value based on other data patterns?? .... Sorry if I am confusing you, it's just when I got to the heatmap part of the df_numerized dataframe I was kind of lost as to how categories can actually represent correlations if the categorical value given to it was completely random. thanks,
started doing the project and noticed Kaggle data slightly different from one in the video. there were some negative numbers in the gross column. to change that to positive had to run this code # apply conditional function to the column containing negative numbers df['gross'] = df['gross'].apply(lambda x: abs(x) if x < 0 else x)
Sir, I am a civil engineer doing my masters during my thesis I got some work of machine learning and then through your channel once I presented my data in tableau my supervisor gave me extra credit thankyou to you...now I am thinking of switching to this field thankyou for your efforts. My question is after learning the required skills how can I start applying in the companies? Second please start an interview series where you discuss how and what type of technical questions are asked in the interview.
@@AlexTheAnalyst I am new comapritively new to the channel I will surely go through those videos. Thanks from all the student community to you for your great contribution in our guidance and learning. Looking forward to learn more insights from the channel 🙌🏽 Love from India 🇮🇳 to alex the analyst.
whoever is coming here after completing their portfolios watching all the 3 videos and here for the 4th... I WISH YOU ALL THE BEST! with so much love and gratitude for Alex!
Came here to say that if you're trying to run the df.corr() and it's trying to run the correlation math on string data columns, simply add in the argument df.corr(numeric_only=True)
I might be missing something, please correct me if I'm wrong as I'm tired as I type: Hasn't Alex mislabeled the first scatterplot? Isn't budget on x and gross on y? Whereas he has labelled the opposite. This is around the 30:00 mark.
Thank you Alex, this has been a great project! You are a great teacher and this has been very helpful. Looking forward to everything you release in the future!
Almost done with this fantastic series. Excited for your upcoming video on data scrapping. For future videos in this series could you possibly do one on APIs (making a project using some public API) and something on big data maybe?
Here at 11:00 when finding missing values write command as --- pct_missing = np.mean(df[col].isnull()) OR you can write pct_missing = df.isnull().mean().sort_values(ascending=False) If there are missing values in Your dataset try to fill it up with 0, here as ---- df = df.fillna(0) at 16:50 to get the Released year only write command as ---- df['yearCorrect'] = df['released'].astype(str).str.split(',').str[1].str.split('(').str[0] at 28:20 to get the scatter plot first try to replace the Null with 0 using code ---- df.fillna(0, inplace=True)
Why is this code not working? İs there anyone who knows the reason please tell us! for col_name in df_numerized.columns: if(df_numerized[col_name].dtype == 'object'): df_numerized[col_name] = df_numerized[col_name].astype('category') df_numerized[col_name] = df_numerized[col_name]
Alex, when I run the sort_values, it throws Avatar, Titanic, and Avengers: Endgame in to a negative gross. How do I fix this? I don't want to drop the whole column, leaving the negatives skews the data. update: I got it fixed. But now when I am trying to do the correlation matrix, I get valueerror stating could not covert string to a float 'The Shinning'. I am not sure what this means.
I don't think that correlation with categorical data will work. Even after being turned into numbers, correlation and regression won't work at this case. The only way to introduce categorical data into correlation or regression is s it is turned it into multiple dummy variables. Thanks for the awesome video. 4/4 what does this mean? the series is done :(
@@tahsinserkanyaman3459 I don't think its right to say he doesn't know what he's doing. That's a little ridiculous. But yeah I don't think the correlation and linear regression really work well here with the categorical data.
Problem: scatter plot with budget vs gross : TypeError: float( ) argument must be a string or a number, not 'NAType' Solution: x=df['budget].astype('float') y=df['gross'].astype('float') plt.scatter(x,y)
Hey Alex. Kudos for the good work you do! Can't believe all these resources are free! I had a doubt actually, Am I the only one, or is it true that the dataset on Kaggle is slightly different than used in the projects? The columns are still the same but values have changed!
Hello Alex! Can you elaborate on the meaning of the correlation matrix which uses df_numerized (48:50)? The numbers assigned to company or country for example, in my understanding, cannot be used for correlation like that.
that's what I was thinking too, since on the video, he assigned random numbers to replace company names, writers and stars, how can you figure out any correlation on those random values?
@5:20 You can also just highlight+copy the path in the address bar above instead of taking the extra steps to right click and go into properties to select the path. Much more efficient.
This one was a cool challenge! Required a bit of research on my part to better understand some steps taken (python for data analysis/visualization is not really my area), but well worth it. Great video Alex, thank you!
Anyone getting an issue where in the correlation graph only the top row is filling out the numbers? I looked it up and to have it fill out you need to write annot = True, but still getting issues. Only the name row is filled out.
Looks like there's been a mix-up with the axis labels on the graph. The 'budget' and 'revenue' labels are swapped. The 'budget' should actually be labeled as 'revenue' (250 million) and vice versa for the 'revenue' (a billion). Thanks!
35:46 if anyone is stuck like I was with the df.corr(method ='pearson') you can try this: numeric_df = df.select_dtypes(include=[np.number]) correlation_matrix = numeric_df.corr(method='pearson') print(correlation_matrix)
Thank you so much sir for taking time out from such a busy schedule and coming up with an initiative of making such videos in order to help many people around the world interested in starting and developing a career in data analytics :-)
An easier way to calculate the percentage of missing value df.isnull().sum().sort_values(ascending=False)/len(df)*100 Extract the year from released df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(float) df['yearcorrect'].fillna(0, inplace=True) df['yearcorrect'] = df['yearcorrect'].astype(int)
Hey Alex, I'm getting an error for using cat.codes at 45.41 , Its showing this error ( 'CategoricalDtype' object has no attribute 'cat' ). A little help would be much appreciated.
Good video. I learned how to use correlation matrices, which is new to me. The whole np.mean(df['col'].isnull()) is something I'm still trying to wrap my head around but for now I'll just hit the easy button on it.
I am wondering the x label is budget, right? As Alex did, x=df ['budget'] and xlabel is gross earning # correlation exploration # scatter plot: budget vs gross plt.scatter(x= df_cleaned['budget'] ,y=df_cleaned['gross']) plt.title('Relationship between Budget and Gross') plt.xlabel('Budget') plt.ylabel('Gross earning')
Hi Alex, can you clarify how cat.codes work? I tried researching more about them online but couldn't really wrap my head around it. They all look like random numbers. How can we be confident that our final correlation matrix actually worked the way we wanted to? Also do the cat codes take into account very similar names like the multiple variations of "Walt Disney". Thanks so much!
Hey Alex, just a heads up. The data set that is available in the link, is currently a little different from what you worked on. I keep getting errors because the data set that I downloaded has several NaN values.
Hi Alex, thanks for the helpful video. I was wondering if the labels on scatter plot (30:00) for x and y should be reversed as x takes budget and y takes gross as an input.
@@AlexTheAnalyst I've followed through the video and now I want to try this with my own dataset, but worried if my dataset is too small compared to this example. What would be the good number to say that there is correlation between two variables?
Update: Alex, I just accepted my first job as a junior data analyst. This completes my 6-month journey to learn data analytics and change careers, and I could not have done it without your excellent Portfolio videos. Thank you so much for making these available to your viewers for free. After I built my portfolio, companies started taking a second look at my resume and inviting me to interviews. BEFORE the portfolio, I received ONLY rejection emails. Thank you, thank you, thank you!
Congratulations!!
I am getting only rejection
@@Datalover-Analysts Hi Pooja, I’m sorry. I know rejection can be discouraging. I received over 100 rejection emails from job applications before I finally started getting interviews. Without knowing the details of your situation, I can only encourage you to keep trying and don’t give up. Everyone’s journey to data is different, but I don’t think it is ever easy, especially if you’re trying to change careers, which is what I was doing. I wish you the best.
@@danielbristow6954 Sure, I am making portfolio with the help of Alex videos. Did some certification from coursera and Azure Fundamentals too
Hey Daniel, congratulations.!!! Can you please also mention what certifications you did? With the help of Alex’s platform i am building my portfolio.Also, completed my degree in MSCS this month.
If anyone else is having issues due to IntCastingNanError, I advise to try the following:
df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(int)
df['gross'] = pd.to_numeric(df['gross'], errors='coerce').fillna(0).astype(int)
it worked! :) Thank you Alex for your amazing videos!
Thank you !!! I almost gave up as i am not too versed in python to make these changes as the data for the original set he worked on has changed.
thank you ❤
Thank you so much for this, I was trying to google it before realizing it might be in the comments. If you have the time can you explain this part of the code? errors='coerce').fillna(0).astype(int)
I did look it up but was getting a little confused by it. Thank you again :)
thank this was a lot of help. i used this to avoid the int32 and it worked.
df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(np.int64)
Top notch! Thanks
Hey all, just a "stats" heads up/correction you might want to make for your portfolio:
In this video, Alex wanted to see if the company was "correlated" with gross revenue. What he did was assign values (randomly, I think) to companies, countries, etc. Then he tried to see if those values were related to the gross revenue.
Those randomly assigned values are "measuring" the company, country, etc at the Nominal scale. In other words, they're essentially just being used as a numeric "name"-the values themselves don't mean anything. What that means is that one value being higher than the other doesn't represent an increase in the thing being measured (for example, the USA was assigned a 54 and the UK was assigned a 53. Those are just names... the USA isn't one more of something than the UK).
Because the values themselves don't represent anything, it doesn't make sense to do a correlation with them.
Correlations tell us, as one variable increases, what happens to the other? So in the first question, as the budget increased, what happened to revenue? It increased. But with country, company, or other categorical variables, correlations don't make sense. The values for country and company are random, so the numbers that represent them going up doesn't tell us anything. It's no wonder then, that the correlations weren't large.
Instead, it would make sense to do a t-test or ANOVA and compare means. In that case, the question would be, "Do some companies tend to produce higher revenue than others?" Or, "Do some countries tend to produce higher revenue?" etc. (For more discussion, see: www.webpages.uidaho.edu/~stevel/519/Correlation%20between%20a%20nominal%20(IV)%20and%20a%20continuous%20(DV)%20variable.html).
Since this is a portfolio project and you want to show potential employers the result, maybe just take that part out-you wouldn't want to make a mistake like that in an application to a potential employer!
(Alex, thanks so much for doing these videos! They're super helpful and I'm very very grateful!)
Thank you so much for that clarification. I was so much confused and spent a lot of time wondering how random values made sense in determining a correlation.
where do we make the corrections?
I noticed that in my dataset, avatar has a gross revenue of -2,147,483,648, and it just feels wrong. Is there something I am not doing right?
I just noticed that converting to int type gave me this error
Hello! The dataset appears to be updated on Kaggle and for anyone new, you will run into some issues that you need to fix to follow along.
1. Missing data. There are missing values opposed to this video so you will need to fix that. There are many ways to handle missing values but for the sake of time, I decided to drop all rows that have missing data. You will have about 71% of your data remaining. You will need to run the following if your dataframe is named df.
df = df.dropna()
2. Extracting the year is different as the formatting is different. Running the following should extract the correct year.
df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(int)
3. Duplicates, there aren't any in this dataset so you should be fine on that.
I hope this helps anyone that is working on this and best of luck on your analytics journey!
Sir you are a hero
thank you, i have just made ammends after reading ur comments, and other comments and got the solution to check and verify what was the status before and after execution, thanks
Thank you so much!
Thanks, I was stuck at extracting the correct year and now can finally solve it!
Thank you. I could not figurethe year out.
I can't wait for the beginner, intermediate, and advanced Python series by Alex the Analyst. It's what the people want, besides a happy Alex.
They're coming! :D
@@AlexTheAnalyst
Hey Alex, please some of we newbies are still waiting for your python for beginners series
@@AlexTheAnalyst when ?🥺
Did they come already?
@@salehfiroozabadi8068 it is happening in the coming months. Alex is posting at the moment the Power BI series.
if df.corr() shows the error that a string variable can't be converted into int pass parameter df.corr(numeric_only=TRUE)
df.corr(numeric_only=True)
Thanks
love you soooo
@@ajibadeabdulateef2818 Hero
This was super helpful for me thanks , used df.corr(numeric_only = 'True')
At 11:08, instead of printing null percent, we can use:
for col in df.columns:
print(df[col].isnull().value_counts(), "
")
This will print how many values are null. Cause you might have 1 missing in 10k values, and you will need high precision in decimals.
The dataset is updated and is not the same as the one in the video, if you guys have problems in the 'Create correct year section' you can do a split of the data to get only the year
df['yearcorrect'] = df['released'].astype(str).str.split().str[2]
Thanks mate
Fantastic, bro! Thanks!
what about min 35:35 looking at correlation, it returns ValueError, anybody find out why? Or the solution?
Sweet!
Thank you bro i was stuck in it for a long time
Hey guys the info got updated since this video was posted. While I was going through the project I was able to google the problems as they came up.
In case you guys get stumped here's what I found that works:
This will drop any rows with null values
df = df.dropna(how='any',axis=0)
This will add the released date column into a separate column
df['yearcorrect'] = df['released'].astype(str).str.split(', ').str[-1].astype(str).str[:4]
Let me know if you that works for y'all
I still get the error name 'df' is not defined
thank you so much for the solution for updated dataset. Your solution save me from struggling on updated release date
i droped the rows but i think it s just dropping temporarily, because if i scatterplot after that it is still showing it has na values.
@@shyamkumar6009 I think you should use df = df.dropna(how='any', axis=0, inplace=True) to drop the null values permanantly.
Also the released changed forms again and I used this to fix it
# fix the date released format
df['release_date'] = df.apply(lambda x: x['released'][0:x['released'].find(' (')],axis=1)
df['release_date'] = pd.to_datetime(df['release_date'], infer_datetime_format=True)
If you are facing an error in datatype change, try the following
df_copy = df.copy()
df_copy['budget'] = df_copy['budget'].astype('int64')
df_copy['gross'] = df_copy['gross'].astype('int64')
df_copy
Thank you Alex for this amazing video
to whom ever noticed that the 'released' column we have is not in the same format that Alex have and getting errors because of that; 15:27
i've been where you were, it took me 4 days just to figure this out, here is the line of code you need:
df['released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format = '%B %d, %Y')
hope it helped..
that worked! Thankyou. But can you explain your code pls?
@@karanikabj4422 believe me when I tell you I wish I can xd
but because it's basically my very first time using python so i don't really fully understand it unlike R and SQL that i know well, but at least i know how to research
buy what i understand is that basically we use the pandas function to_datetime , excluded the '(United States)' part and set the format to the one we have in the rest of the date with applying it to all the column and in the same time assigning that outcome to a new column under the same name witch basically overwrite the original column
I don’t get the year like Alex does only the months
Man i'm so grateful, you won't believe how much time i was stuck on this.
thanks☺☺
At 13:40 If you are facing an error in datatype change, try the following :-
df['budget'].round().astype('Int64')
df['budget']=df['budget'].astype('Int64')
hope it will help uh
Hey guys, at 46:24, we can simply assign .copy() method to our new variable if we want to use for loop to iterate over our new variable without affecting the original dataframe or df:
df_numerized = df.copy()
for col_name in df_numerized.columns:
if(df_numerized[col_name].dtype == 'object'):
df_numerized[col_name] = df_numerized[col_name].astype('category')
df_numerized[col_name] = df_numerized[col_name].cat.codes
df_numerized
I don't know how much time this saved me but it would have been a lot.
there are some missing value in this dataset
Alex try this instead of that for loop statement
df.isnull().sum()
this will give total number of nulls for every column/variable
I really appreciate the fact that you did not edit out the parts were you made "mistakes" and actually fixed them.
Honestly, you're an absolute legend. You really break down some of the technical barriers that exist for people entering the field of data science. You really are gem to the community.
As The Rock says; "FINALLY!"
I'm a bit embarrassed by how excited I get when an ATA video clocks in at over an hour...
Hahaha 😁
To everyone getting error for df.corr()
this was my fix:
# since pandas version 2.0.0 now you need to add numeric_only=True param to avoid issue
df.corr(method='pearson', numeric_only=True) #pearson, kendall, spearman
---
correlation_matrix = df.corr(method='pearson', numeric_only=True)
sns.heatmap(correlation_matrix, annot=True)
plt.show()
Thank you, Alex! I learned so much.
Anyone's correlation matrix doesn't work? Need to add 'numeric_only = True'. Now the default is false.
correlation_matrix = df.corr(method = 'pearson',numeric_only = True)
Pure gold, man. Saved the day! Thanks!
In 29:48 , I think x should be 'Budget' and y as 'Gross earning'
You are absolutely right! Woops!
I have recently decided on becoming a data analyst and your videos are really helping me understand what i need to do and keep me motivated on that goal which will improve my life. I want to say thank you for your content and your honest helpfulness.
Thank you so much for your Video those help me a lot and finally I got a job as Data engineer by no experience in this role but I had learn from your channel in 1 moth! I got a lot of knowledge. really appreciated your support. thank you very very much!
Congratulations
The 'released' column is updated, now it comes with a text format date and the country of release. What I did was to split the column in two : Release date and Country release.
The code I used was this:
df[['released','country_release']] = df['released'].str.split(' \(',n=1,expand=True)
Then you have to clean a littile bit the 'country_release' column with:
df['country_release'] = df['country_release'].str.replace(')','')
And finally give the 'released' column the datetime format with this:
df['released'] = pd.to_datetime(df['released'],format='mixed')
For some reason using format = 'mixed' did the magic trick for me, i tried '%B %d, %Y' but It never worked.
Jus on today date i am doing this data set.
Tooltip:
FYI before converting the 'budget' and 'gross' column look for any null values , as i have downloaded the data set recently i had some.
And it thrown a error during the conversion, just make sure that the NaN value in both columns to be 0 before converting
And during creating the 'year corrected' column try to split it using .str.split(',', n=1,expand =True) and the use the df['yearcorrect'] = year astype(str).str[:5]
This is to be done for getting year out of released column i have done the same way as shown but got the month, so if works for you its fine otherwise try above method
This get things done
Thank u
And also thank u Alex you are doing a great job🙂
08/02/2022 - I'm using the dataset at this date and there has been many changes and unfortunatly the pct_missing is not 0.0%.
For me I copied the content of df in a new dataframe that I called Newdf and then deleted the rows:
Newdf = df.dropna(axis=0)
print(Newdf.isnull().sum(),'
')
Thank you.
Worked!!
good job
Man, thank you so much for this, I know you've put a lot of effort into this project serie and I can definitely say that i'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people to pursuit their dreams! Greetings from Brazil :)
The only thing that encourages me to watch is your smile
Keep smiling 🙏❤️
Haha I hope the high quality video content also makes you smile 😁
Thank you very much for the video! I want to change my career path to data analytics, and your videos have been a very good learning material. Although the data has been updated and some of the methods in this video do not work anymore, it is a fantastic guidance (and, ultimately, to become good at something, you have to do a fair share of self-study).
One thing to note though: I don't think the pearson correlation coefficient can be used to check the relationship between a categorical and a continuous variable. So, the low correlation coefficient for company, for example, might be misleading. Since, after all, the numeric ID assigned to the string values does not necessarily increase with the size of the company.
Hi Alex. Thank you for the portfolio project series.
For the missing values, I think the 0.0% it showed for every column has been approximated. If you use describe() and info(), you will notice some null values.
Thanks again for the videos, they are really helpful.
that's what I thought at first, too, but the data set has simply changed since he uploaded the video (or he used an already edited one). So now there are a few columns that even have values like 0.28%...
@@synaestheticVI yes can anyone help that what should we do in that situation?
Soooooooooooooooooooooooo excited for the last website video to come out
For the first time in my 19 years of living, i feel pretty confident of making something to its perfection by myself (and your help too🙌)
5:20 Faster way to do this is to shift right-click the file and copy as path.
The "apostrophes" are just called single quotes
Man, thank you so much for this, I know you've put a lot of effort into this project series and I can definitely say that I'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people pursue their dreams! Greetings from the UK :)
Hi Alex, thank you so much for all the videos, Ok here is the thing, I haven't taken this class, I actually was learning Python before seeing your page and decided to learn SQL, I took all the videos you have on SQL and the 1st 3 portfolio projects on SQL and Tableau. so I went to stratascratch and register for the free option, they gave me access to 50 Interviews questions, some are easy, some medium, and some hard But the interesting this is, I was only able to answer 1 question from the easy ones and others I couldn't answer. That almost made me feel discouraged but I am just thinking I need to spend more time on more tutorials on SQL before moving back to Python. I will like to hear about your option and others who had a similar experience. Again, thank you so much for all your effort, you are touching lives!
Hi Alex, just finished the project. It’s awesome. Thanks for everything. I pray for your success in the future.
This video came in just on time.
I finished building my portfolio yesterday. Thank you for the tips.
Can you provide some tips? I have worked on projects but having troubles with how to present and display the projects? Can you share your link
All you folks getting data analyst jobs left and right, could you give a glimpse of how you presented this project on your CVs? Alex some help would be great.
Another excellent portfolio project from Alex! My portfolio is starting to look very good, and I finally have something to upload to job applications that request a portfolio! Thank you, Alex!!
That's great! Glad to hear it's been helpful!
This 4 part tutorial is pure gold! After your announcement that you were launching your version of data analyst course/certification, can’t wait for when it goes live, as to follow up in more depth for the concepts presented in this series. Really appreciate the time, dedication and quality of content you produce Alex.
Hey Alex, please make videos on, how to handle missing/null values in python.
At 13:40 If you are facing an error in datatype change, try the following
df['gross'] = pd.to_numeric(df['gross'], errors='coerce', downcast='integer')
df['gross'].isna().sum()
df[df['gross'].isna()]
df['gross'].fillna(0, inplace=True)
df['budget'] = df['budget'].fillna(0)
df['budget'] = df['budget'].astype('int64')
df['gross'] = df['gross'].round().astype('int64')
I made it ti 4th part..thanks alex for this tutorial those who dont get released date only you can use this code df['format_released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format='%B %d, %Y')
if you downloaded the after the video it seems it might have some values missing that prevent you from converting columns into integers.. use df = df.fillna(0)
should we just ignore the missing entries or did you delete them?
Somewhere around 14:05 where Alex was converting the gross and c=budget column to int64, his code wouldn't work for me but after some research I found this to work:
df['gross'] = df['gross'].fillna(0).astype('int64')
df['budget'] = df['budget'].fillna(0).astype('int64')
Thanks a bunch for this!
or you can use df['gross'] = df['gross'].astype("Int64')
I can't believe I followed along and understood everything. I wasn't even sure if I would be able to before I started. Thank you so much Alex! With your help, I've gained more confidence in pursuing a career in data analytics. I'm definitely going to do more of your projects and hope to be able to land a full-time data analytics job this year. Thanks again!
Woohoo! You're doing great!
i see its barely 4 months since you made this comment. i have an issue with the scatter plot, can you help me out?
Very well instructed! Way better than any of the BootCamp lectures I had gotten previously. Perfect for a refresher and portfolio work. Thank you!
Hi Alex, great job you are doing in your channel, thank you very much. I just wanted to say, if it might help anybody who is watching you (because I believe that you already know it after two years), that a correlation measures the reaction of one value against the movement of another value. For this, both values need to be able to move and get bigger or smaller, something that letters (and names, in consequence) cannot do, neither can letters or names disguised as numbers, because those are static too, hence, the "non numeric" correlations showed in the video, are false.
This problem could just be a mess up in any project, but, in a project that is our portfolio, our showroom, it will only show how much of a data analyst we are NOT.
Kindly study a work around to this statistical problem, which is not to change letters by numbers. So many great developers behind Pandas would have implemented it many years ago. 😀
' ' -> single quotes and {} -> curly brackets. Just in case you have not already received a similar answer. Either way, keep up the great work :)
Hi, I faced an issue for command at 13:48, so this might help someone
For me jupyter gives *ValueError: Cannot convert non-finite values (NA or inf) to integer*
Instead, you can use *I* for int64 in the same command
*df['budget'] = df['budget'].astype('Int64')*
Thank you I was stuck here. 😀
Thank you, I was stuck there trying to understand what I was doing wrong, I'd never figure out it was just I instead of i 🤣
Hey Alex - Thank you for this. Right around 49:25 you talk about the correlation matrix of the df_numerized dataframe that is being shown as a heatmap. I do have a question about that....: when you did .cat.codes in the cells above, how did the category values of the previous objects (company, country, director) represent any value that can be correlated? For instance, using one row as an example, I'm confused how index 6380 at the top of the dataframe has a company categorical value of 1428. Is this by random or did the code construct some sort of logical thinking and gave a numeric value based on other data patterns?? .... Sorry if I am confusing you, it's just when I got to the heatmap part of the df_numerized dataframe I was kind of lost as to how categories can actually represent correlations if the categorical value given to it was completely random. thanks,
Great community in the comment section. Thanks for this analysis Alex! I couldn't make my career pivot without your help
started doing the project and noticed Kaggle data slightly different from one in the video. there were some negative numbers in the gross column. to change that to positive had to run this code # apply conditional function to the column containing negative numbers
df['gross'] = df['gross'].apply(lambda x: abs(x) if x < 0 else x)
Thank you so much for this!
Thank you for this, my scatter plot was not showing properly because of the negative numbers and this fixed it. Again, thank you!
Great, man! Thanks!
Sir, I am a civil engineer doing my masters during my thesis I got some work of machine learning and then through your channel once I presented my data in tableau my supervisor gave me extra credit thankyou to you...now I am thinking of switching to this field thankyou for your efforts.
My question is after learning the required skills how can I start applying in the companies? Second please start an interview series where you discuss how and what type of technical questions are asked in the interview.
That’s great! I have a few videos on how to work with recruiters and prep for interviews - I think those would be helpful to you 👍
@@AlexTheAnalyst I am new comapritively new to the channel I will surely go through those videos. Thanks from all the student community to you for your great contribution in our guidance and learning. Looking forward to learn more insights from the channel 🙌🏽
Love from India 🇮🇳 to alex the analyst.
whoever is coming here after completing their portfolios watching all the 3 videos and here for the 4th...
I WISH YOU ALL THE BEST!
with so much love and gratitude for Alex!
Thanks Alex. You are still the best.
Thanks for watching!
Came here to say that if you're trying to run the df.corr() and it's trying to run the correlation math on string data columns, simply add in the argument df.corr(numeric_only=True)
Life saver, thank you mate! any idea why this is happening?
thankkkkkk you
I might be missing something, please correct me if I'm wrong as I'm tired as I type:
Hasn't Alex mislabeled the first scatterplot? Isn't budget on x and gross on y? Whereas he has labelled the opposite. This is around the 30:00 mark.
Thank you so much again! Portfolio completed thanks to you! My first résolution of the year is done thanks to you! Happy new year 🎉🎉🎉
Thank you Alex, this has been a great project! You are a great teacher and this has been very helpful. Looking forward to everything you release in the future!
I love you so much. You really make my life easier. Thank you for putting out all of those helpful video. You are the best!
Almost done with this fantastic series. Excited for your upcoming video on data scrapping. For future videos in this series could you possibly do one on APIs (making a project using some public API) and something on big data maybe?
Here at 11:00 when finding missing values write command as ---
pct_missing = np.mean(df[col].isnull())
OR you can write
pct_missing = df.isnull().mean().sort_values(ascending=False)
If there are missing values in Your dataset try to fill it up with 0, here as ----
df = df.fillna(0)
at 16:50 to get the Released year only write command as ----
df['yearCorrect'] = df['released'].astype(str).str.split(',').str[1].str.split('(').str[0]
at 28:20 to get the scatter plot first try to replace the Null with 0 using code ----
df.fillna(0, inplace=True)
Why is this code not working? İs there anyone who knows the reason please tell us!
for col_name in df_numerized.columns:
if(df_numerized[col_name].dtype == 'object'):
df_numerized[col_name] = df_numerized[col_name].astype('category')
df_numerized[col_name] = df_numerized[col_name]
I am also wondering why
I have exactly the same question!
Alex, when I run the sort_values, it throws Avatar, Titanic, and Avengers: Endgame in to a negative gross. How do I fix this? I don't want to drop the whole column, leaving the negatives skews the data.
update: I got it fixed. But now when I am trying to do the correlation matrix, I get valueerror stating could not covert string to a float 'The Shinning'. I am not sure what this means.
Hi, Hollanda!
Try this. Worked for me :)
correlation_matrix = df.corr(method = 'pearson',numeric_only = True)
sns.heatmap(correlation_matrix, annot = True)
plt.title("Correlation matrix for Numeric Features")
plt.xlabel("Movie features")
plt.ylabel("Movie features")
plt.show()
I don't think that correlation with categorical data will work. Even after being turned into numbers, correlation and regression won't work at this case. The only way to introduce categorical data into correlation or regression is s it is turned it into multiple dummy variables.
Thanks for the awesome video. 4/4 what does this mean? the series is done :(
He doesnt know what he is doing. He is just directing People to wrong lanes.
@@tahsinserkanyaman3459 I don't think its right to say he doesn't know what he's doing. That's a little ridiculous. But yeah I don't think the correlation and linear regression really work well here with the categorical data.
Thanks Alex, waiting from so long for this#4 video.Thanks for sharing.
This project taught me a number of new things about pandas. Very helpful! Thank you Alex!
Problem: scatter plot with budget vs gross : TypeError: float( ) argument must be a string or a number, not 'NAType'
Solution: x=df['budget].astype('float')
y=df['gross'].astype('float')
plt.scatter(x,y)
missing : apostrophe at the end of budget
Now I have multiple new projects I can add to my portfolio. Thanks, Alex!
Thank you so much, Alex! I've taken a few courses about Python and yours is clear and awesome!
You're a monster, Alex!
Thanks a million (how they say here in Ireland)
God bless, man
Hey Alex. Kudos for the good work you do! Can't believe all these resources are free!
I had a doubt actually,
Am I the only one, or is it true that the dataset on Kaggle is slightly different than used in the projects? The columns are still the same but values have changed!
I have experienced this too.
I am facing this issue too
Thank you so so much Alex!! You have been a guiding light!!
Hello Alex! Can you elaborate on the meaning of the correlation matrix which uses df_numerized (48:50)? The numbers assigned to company or country for example, in my understanding, cannot be used for correlation like that.
that's what I was thinking too, since on the video, he assigned random numbers to replace company names, writers and stars, how can you figure out any correlation on those random values?
@5:20 You can also just highlight+copy the path in the address bar above instead of taking the extra steps to right click and go into properties to select the path. Much more efficient.
This one was a cool challenge! Required a bit of research on my part to better understand some steps taken (python for data analysis/visualization is not really my area), but well worth it. Great video Alex, thank you!
Anyone getting an issue where in the correlation graph only the top row is filling out the numbers? I looked it up and to have it fill out you need to write annot = True, but still getting issues. Only the name row is filled out.
Not apostrophe. It is single quotes. Thank you for the great tutorial!
Looks like there's been a mix-up with the axis labels on the graph. The 'budget' and 'revenue' labels are swapped. The 'budget' should actually be labeled as 'revenue' (250 million) and vice versa for the 'revenue' (a billion). Thanks!
You are the best plz do more videos about data cleaning with SQL server 😍😍😍
min 35:35 looking at correlation, it returns ValueError, anybody find out why? Or the solution?
35:46 if anyone is stuck like I was with the df.corr(method ='pearson') you can try this:
numeric_df = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr(method='pearson')
print(correlation_matrix)
Thank you so much sir for taking time out from such a busy schedule and coming up with an initiative of making such videos in order to help many people around the world interested in starting and developing a career in data analytics :-)
An easier way to calculate the percentage of missing value
df.isnull().sum().sort_values(ascending=False)/len(df)*100
Extract the year from released
df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(float)
df['yearcorrect'].fillna(0, inplace=True)
df['yearcorrect'] = df['yearcorrect'].astype(int)
alternative for calculating % of missing values: df.isnull().mean().sort_values(ascending=False)
Hey Alex,
I'm getting an error for using cat.codes at 45.41 ,
Its showing this error ( 'CategoricalDtype' object has no attribute 'cat' ).
A little help would be much appreciated.
i'm having the same problem. did you figure it out?
Good video. I learned how to use correlation matrices, which is new to me. The whole np.mean(df['col'].isnull()) is something I'm still trying to wrap my head around but for now I'll just hit the easy button on it.
I support your channel all the way. keep up the projects
I am wondering the x label is budget, right? As Alex did, x=df ['budget'] and xlabel is gross earning
# correlation exploration
# scatter plot: budget vs gross
plt.scatter(x= df_cleaned['budget'] ,y=df_cleaned['gross'])
plt.title('Relationship between Budget and Gross')
plt.xlabel('Budget')
plt.ylabel('Gross earning')
Hi Alex, can you clarify how cat.codes work? I tried researching more about them online but couldn't really wrap my head around it. They all look like random numbers. How can we be confident that our final correlation matrix actually worked the way we wanted to? Also do the cat codes take into account very similar names like the multiple variations of "Walt Disney". Thanks so much!
Is anybody having issues with importing seaborn into jupyter network. I can't get it to recognize the module.
I learnt so much from this. Thanks Alex
I was looking forward for last one. You are the best.
The columns of the dataset in the kaggle link are very different now. Does anybody knows where I can find the original dataset?
Much Awaited! Very Exited
Hey Alex, just a heads up. The data set that is available in the link, is currently a little different from what you worked on. I keep getting errors because the data set that I downloaded has several NaN values.
Hi Alex, thanks for the helpful video.
I was wondering if the labels on scatter plot (30:00) for x and y should be reversed as x takes budget and y takes gross as an input.
You are absolutely right - woops!
@@AlexTheAnalyst I've followed through the video and now I want to try this with my own dataset, but worried if my dataset is too small compared to this example. What would be the good number to say that there is correlation between two variables?
was waiting for this one and looking forward for such projects
thank you so much :D
the dataset in that link is different from yours just a FYI
Thank You so much Alex for creating such amazing content. You are the best Teacher anyone could ask for.
Can someone help me? When I create yearcorrect @16:56 I'm getting the Month 'June' returned back instead of the year
Can't wait to finally finish everything! Thanks for creating these awesome guides Alex!