Update: Alex, I just accepted my first job as a junior data analyst. This completes my 6-month journey to learn data analytics and change careers, and I could not have done it without your excellent Portfolio videos. Thank you so much for making these available to your viewers for free. After I built my portfolio, companies started taking a second look at my resume and inviting me to interviews. BEFORE the portfolio, I received ONLY rejection emails. Thank you, thank you, thank you!
@@Datalover-Analysts Hi Pooja, I’m sorry. I know rejection can be discouraging. I received over 100 rejection emails from job applications before I finally started getting interviews. Without knowing the details of your situation, I can only encourage you to keep trying and don’t give up. Everyone’s journey to data is different, but I don’t think it is ever easy, especially if you’re trying to change careers, which is what I was doing. I wish you the best.
Hey Daniel, congratulations.!!! Can you please also mention what certifications you did? With the help of Alex’s platform i am building my portfolio.Also, completed my degree in MSCS this month.
If anyone else is having issues due to IntCastingNanError, I advise to try the following: df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(int) df['gross'] = pd.to_numeric(df['gross'], errors='coerce').fillna(0).astype(int) it worked! :) Thank you Alex for your amazing videos!
Thank you so much for this, I was trying to google it before realizing it might be in the comments. If you have the time can you explain this part of the code? errors='coerce').fillna(0).astype(int) I did look it up but was getting a little confused by it. Thank you again :)
thank this was a lot of help. i used this to avoid the int32 and it worked. df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(np.int64)
Hello! The dataset appears to be updated on Kaggle and for anyone new, you will run into some issues that you need to fix to follow along. 1. Missing data. There are missing values opposed to this video so you will need to fix that. There are many ways to handle missing values but for the sake of time, I decided to drop all rows that have missing data. You will have about 71% of your data remaining. You will need to run the following if your dataframe is named df. df = df.dropna() 2. Extracting the year is different as the formatting is different. Running the following should extract the correct year. df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(int) 3. Duplicates, there aren't any in this dataset so you should be fine on that. I hope this helps anyone that is working on this and best of luck on your analytics journey!
thank you, i have just made ammends after reading ur comments, and other comments and got the solution to check and verify what was the status before and after execution, thanks
Hey all, just a "stats" heads up/correction you might want to make for your portfolio: In this video, Alex wanted to see if the company was "correlated" with gross revenue. What he did was assign values (randomly, I think) to companies, countries, etc. Then he tried to see if those values were related to the gross revenue. Those randomly assigned values are "measuring" the company, country, etc at the Nominal scale. In other words, they're essentially just being used as a numeric "name"-the values themselves don't mean anything. What that means is that one value being higher than the other doesn't represent an increase in the thing being measured (for example, the USA was assigned a 54 and the UK was assigned a 53. Those are just names... the USA isn't one more of something than the UK). Because the values themselves don't represent anything, it doesn't make sense to do a correlation with them. Correlations tell us, as one variable increases, what happens to the other? So in the first question, as the budget increased, what happened to revenue? It increased. But with country, company, or other categorical variables, correlations don't make sense. The values for country and company are random, so the numbers that represent them going up doesn't tell us anything. It's no wonder then, that the correlations weren't large. Instead, it would make sense to do a t-test or ANOVA and compare means. In that case, the question would be, "Do some companies tend to produce higher revenue than others?" Or, "Do some countries tend to produce higher revenue?" etc. (For more discussion, see: www.webpages.uidaho.edu/~stevel/519/Correlation%20between%20a%20nominal%20(IV)%20and%20a%20continuous%20(DV)%20variable.html). Since this is a portfolio project and you want to show potential employers the result, maybe just take that part out-you wouldn't want to make a mistake like that in an application to a potential employer! (Alex, thanks so much for doing these videos! They're super helpful and I'm very very grateful!)
Thank you so much for that clarification. I was so much confused and spent a lot of time wondering how random values made sense in determining a correlation.
If you are facing an error in datatype change, try the following df_copy = df.copy() df_copy['budget'] = df_copy['budget'].astype('int64') df_copy['gross'] = df_copy['gross'].astype('int64') df_copy Thank you Alex for this amazing video
At 11:08, instead of printing null percent, we can use: for col in df.columns: print(df[col].isnull().value_counts(), " ") This will print how many values are null. Cause you might have 1 missing in 10k values, and you will need high precision in decimals.
Honestly, you're an absolute legend. You really break down some of the technical barriers that exist for people entering the field of data science. You really are gem to the community.
there are some missing value in this dataset Alex try this instead of that for loop statement df.isnull().sum() this will give total number of nulls for every column/variable
I have recently decided on becoming a data analyst and your videos are really helping me understand what i need to do and keep me motivated on that goal which will improve my life. I want to say thank you for your content and your honest helpfulness.
Hey guys the info got updated since this video was posted. While I was going through the project I was able to google the problems as they came up. In case you guys get stumped here's what I found that works: This will drop any rows with null values df = df.dropna(how='any',axis=0) This will add the released date column into a separate column df['yearcorrect'] = df['released'].astype(str).str.split(', ').str[-1].astype(str).str[:4] Let me know if you that works for y'all
Also the released changed forms again and I used this to fix it # fix the date released format df['release_date'] = df.apply(lambda x: x['released'][0:x['released'].find(' (')],axis=1) df['release_date'] = pd.to_datetime(df['release_date'], infer_datetime_format=True)
At 13:40 If you are facing an error in datatype change, try the following :- df['budget'].round().astype('Int64') df['budget']=df['budget'].astype('Int64') hope it will help uh
Hey guys, at 46:24, we can simply assign .copy() method to our new variable if we want to use for loop to iterate over our new variable without affecting the original dataframe or df: df_numerized = df.copy() for col_name in df_numerized.columns: if(df_numerized[col_name].dtype == 'object'): df_numerized[col_name] = df_numerized[col_name].astype('category') df_numerized[col_name] = df_numerized[col_name].cat.codes df_numerized
to whom ever noticed that the 'released' column we have is not in the same format that Alex have and getting errors because of that; 15:27 i've been where you were, it took me 4 days just to figure this out, here is the line of code you need: df['released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format = '%B %d, %Y') hope it helped..
@@karanikabj4422 believe me when I tell you I wish I can xd but because it's basically my very first time using python so i don't really fully understand it unlike R and SQL that i know well, but at least i know how to research buy what i understand is that basically we use the pandas function to_datetime , excluded the '(United States)' part and set the format to the one we have in the rest of the date with applying it to all the column and in the same time assigning that outcome to a new column under the same name witch basically overwrite the original column
Thank you very much for the video! I want to change my career path to data analytics, and your videos have been a very good learning material. Although the data has been updated and some of the methods in this video do not work anymore, it is a fantastic guidance (and, ultimately, to become good at something, you have to do a fair share of self-study). One thing to note though: I don't think the pearson correlation coefficient can be used to check the relationship between a categorical and a continuous variable. So, the low correlation coefficient for company, for example, might be misleading. Since, after all, the numeric ID assigned to the string values does not necessarily increase with the size of the company.
The dataset is updated and is not the same as the one in the video, if you guys have problems in the 'Create correct year section' you can do a split of the data to get only the year df['yearcorrect'] = df['released'].astype(str).str.split().str[2]
whoever is coming here after completing their portfolios watching all the 3 videos and here for the 4th... I WISH YOU ALL THE BEST! with so much love and gratitude for Alex!
Jus on today date i am doing this data set. Tooltip: FYI before converting the 'budget' and 'gross' column look for any null values , as i have downloaded the data set recently i had some. And it thrown a error during the conversion, just make sure that the NaN value in both columns to be 0 before converting And during creating the 'year corrected' column try to split it using .str.split(',', n=1,expand =True) and the use the df['yearcorrect'] = year astype(str).str[:5] This is to be done for getting year out of released column i have done the same way as shown but got the month, so if works for you its fine otherwise try above method This get things done Thank u And also thank u Alex you are doing a great job🙂
Thank you so much for your Video those help me a lot and finally I got a job as Data engineer by no experience in this role but I had learn from your channel in 1 moth! I got a lot of knowledge. really appreciated your support. thank you very very much!
Man, thank you so much for this, I know you've put a lot of effort into this project serie and I can definitely say that i'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people to pursuit their dreams! Greetings from Brazil :)
I can't believe I followed along and understood everything. I wasn't even sure if I would be able to before I started. Thank you so much Alex! With your help, I've gained more confidence in pursuing a career in data analytics. I'm definitely going to do more of your projects and hope to be able to land a full-time data analytics job this year. Thanks again!
Man, thank you so much for this, I know you've put a lot of effort into this project series and I can definitely say that I'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people pursue their dreams! Greetings from the UK :)
Hi Alex, thank you so much for all the videos, Ok here is the thing, I haven't taken this class, I actually was learning Python before seeing your page and decided to learn SQL, I took all the videos you have on SQL and the 1st 3 portfolio projects on SQL and Tableau. so I went to stratascratch and register for the free option, they gave me access to 50 Interviews questions, some are easy, some medium, and some hard But the interesting this is, I was only able to answer 1 question from the easy ones and others I couldn't answer. That almost made me feel discouraged but I am just thinking I need to spend more time on more tutorials on SQL before moving back to Python. I will like to hear about your option and others who had a similar experience. Again, thank you so much for all your effort, you are touching lives!
All you folks getting data analyst jobs left and right, could you give a glimpse of how you presented this project on your CVs? Alex some help would be great.
Hi Alex. Thank you for the portfolio project series. For the missing values, I think the 0.0% it showed for every column has been approximated. If you use describe() and info(), you will notice some null values. Thanks again for the videos, they are really helpful.
that's what I thought at first, too, but the data set has simply changed since he uploaded the video (or he used an already edited one). So now there are a few columns that even have values like 0.28%...
Soooooooooooooooooooooooo excited for the last website video to come out For the first time in my 19 years of living, i feel pretty confident of making something to its perfection by myself (and your help too🙌)
This 4 part tutorial is pure gold! After your announcement that you were launching your version of data analyst course/certification, can’t wait for when it goes live, as to follow up in more depth for the concepts presented in this series. Really appreciate the time, dedication and quality of content you produce Alex.
Another excellent portfolio project from Alex! My portfolio is starting to look very good, and I finally have something to upload to job applications that request a portfolio! Thank you, Alex!!
The 'released' column is updated, now it comes with a text format date and the country of release. What I did was to split the column in two : Release date and Country release. The code I used was this: df[['released','country_release']] = df['released'].str.split(' \(',n=1,expand=True) Then you have to clean a littile bit the 'country_release' column with: df['country_release'] = df['country_release'].str.replace(')','') And finally give the 'released' column the datetime format with this: df['released'] = pd.to_datetime(df['released'],format='mixed') For some reason using format = 'mixed' did the magic trick for me, i tried '%B %d, %Y' but It never worked.
I made it ti 4th part..thanks alex for this tutorial those who dont get released date only you can use this code df['format_released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format='%B %d, %Y')
Thank you, Alex! I learned so much. Anyone's correlation matrix doesn't work? Need to add 'numeric_only = True'. Now the default is false. correlation_matrix = df.corr(method = 'pearson',numeric_only = True)
This one was a cool challenge! Required a bit of research on my part to better understand some steps taken (python for data analysis/visualization is not really my area), but well worth it. Great video Alex, thank you!
Sir, I am a civil engineer doing my masters during my thesis I got some work of machine learning and then through your channel once I presented my data in tableau my supervisor gave me extra credit thankyou to you...now I am thinking of switching to this field thankyou for your efforts. My question is after learning the required skills how can I start applying in the companies? Second please start an interview series where you discuss how and what type of technical questions are asked in the interview.
@@AlexTheAnalyst I am new comapritively new to the channel I will surely go through those videos. Thanks from all the student community to you for your great contribution in our guidance and learning. Looking forward to learn more insights from the channel 🙌🏽 Love from India 🇮🇳 to alex the analyst.
Hi Alex, great job you are doing in your channel, thank you very much. I just wanted to say, if it might help anybody who is watching you (because I believe that you already know it after two years), that a correlation measures the reaction of one value against the movement of another value. For this, both values need to be able to move and get bigger or smaller, something that letters (and names, in consequence) cannot do, neither can letters or names disguised as numbers, because those are static too, hence, the "non numeric" correlations showed in the video, are false. This problem could just be a mess up in any project, but, in a project that is our portfolio, our showroom, it will only show how much of a data analyst we are NOT. Kindly study a work around to this statistical problem, which is not to change letters by numbers. So many great developers behind Pandas would have implemented it many years ago. 😀
Almost done with this fantastic series. Excited for your upcoming video on data scrapping. For future videos in this series could you possibly do one on APIs (making a project using some public API) and something on big data maybe?
Thank you Alex, this has been a great project! You are a great teacher and this has been very helpful. Looking forward to everything you release in the future!
To everyone getting error for df.corr() this was my fix: # since pandas version 2.0.0 now you need to add numeric_only=True param to avoid issue df.corr(method='pearson', numeric_only=True) #pearson, kendall, spearman --- correlation_matrix = df.corr(method='pearson', numeric_only=True) sns.heatmap(correlation_matrix, annot=True) plt.show()
@5:20 You can also just highlight+copy the path in the address bar above instead of taking the extra steps to right click and go into properties to select the path. Much more efficient.
Hey Alex. Kudos for the good work you do! Can't believe all these resources are free! I had a doubt actually, Am I the only one, or is it true that the dataset on Kaggle is slightly different than used in the projects? The columns are still the same but values have changed!
Here at 11:00 when finding missing values write command as --- pct_missing = np.mean(df[col].isnull()) OR you can write pct_missing = df.isnull().mean().sort_values(ascending=False) If there are missing values in Your dataset try to fill it up with 0, here as ---- df = df.fillna(0) at 16:50 to get the Released year only write command as ---- df['yearCorrect'] = df['released'].astype(str).str.split(',').str[1].str.split('(').str[0] at 28:20 to get the scatter plot first try to replace the Null with 0 using code ---- df.fillna(0, inplace=True)
Somewhere around 14:05 where Alex was converting the gross and c=budget column to int64, his code wouldn't work for me but after some research I found this to work: df['gross'] = df['gross'].fillna(0).astype('int64') df['budget'] = df['budget'].fillna(0).astype('int64')
Came here to say that if you're trying to run the df.corr() and it's trying to run the correlation math on string data columns, simply add in the argument df.corr(numeric_only=True)
Hey Alex - Thank you for this. Right around 49:25 you talk about the correlation matrix of the df_numerized dataframe that is being shown as a heatmap. I do have a question about that....: when you did .cat.codes in the cells above, how did the category values of the previous objects (company, country, director) represent any value that can be correlated? For instance, using one row as an example, I'm confused how index 6380 at the top of the dataframe has a company categorical value of 1428. Is this by random or did the code construct some sort of logical thinking and gave a numeric value based on other data patterns?? .... Sorry if I am confusing you, it's just when I got to the heatmap part of the df_numerized dataframe I was kind of lost as to how categories can actually represent correlations if the categorical value given to it was completely random. thanks,
i thought the same. There is no correlation between a random number with gross. only possible if by any luck you get a random higher number for your movie that has also a high budget. then=high correlation
if you downloaded the after the video it seems it might have some values missing that prevent you from converting columns into integers.. use df = df.fillna(0)
An easier way to calculate the percentage of missing value df.isnull().sum().sort_values(ascending=False)/len(df)*100 Extract the year from released df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(float) df['yearcorrect'].fillna(0, inplace=True) df['yearcorrect'] = df['yearcorrect'].astype(int)
Good video. I learned how to use correlation matrices, which is new to me. The whole np.mean(df['col'].isnull()) is something I'm still trying to wrap my head around but for now I'll just hit the easy button on it.
Thank you so much sir for taking time out from such a busy schedule and coming up with an initiative of making such videos in order to help many people around the world interested in starting and developing a career in data analytics :-)
At 13:40 If you are facing an error in datatype change, try the following df['gross'] = pd.to_numeric(df['gross'], errors='coerce', downcast='integer') df['gross'].isna().sum() df[df['gross'].isna()] df['gross'].fillna(0, inplace=True) df['budget'] = df['budget'].fillna(0) df['budget'] = df['budget'].astype('int64') df['gross'] = df['gross'].round().astype('int64')
started doing the project and noticed Kaggle data slightly different from one in the video. there were some negative numbers in the gross column. to change that to positive had to run this code # apply conditional function to the column containing negative numbers df['gross'] = df['gross'].apply(lambda x: abs(x) if x < 0 else x)
Looks like there's been a mix-up with the axis labels on the graph. The 'budget' and 'revenue' labels are swapped. The 'budget' should actually be labeled as 'revenue' (250 million) and vice versa for the 'revenue' (a billion). Thanks!
08/02/2022 - I'm using the dataset at this date and there has been many changes and unfortunatly the pct_missing is not 0.0%. For me I copied the content of df in a new dataframe that I called Newdf and then deleted the rows: Newdf = df.dropna(axis=0) print(Newdf.isnull().sum(),' ')
I might be missing something, please correct me if I'm wrong as I'm tired as I type: Hasn't Alex mislabeled the first scatterplot? Isn't budget on x and gross on y? Whereas he has labelled the opposite. This is around the 30:00 mark.
Thank you Alex for making this project free. I am making a career change and pretty new to this field. I am wondering if this level of project is sufficient for a entry level position yet or does it need to trickier? I hope that it is enough for us to start applying jobs. Thanks a ton.
Hey Alex, Thank you so much for these videos, they’re incredibly helpful for aspiring data analysts like myself! I have an interview in two weeks for an entry level data analyst position and I’m pretty nervous not having any previous experience as a data analyst… I was wondering if you offered any consulting services ? Thank you again !
I'm so glad to hear that! I do, but I'm quite booked lately so I don't know if I can fit anything in in the next 2 weeks. You can always email me at AlexTheAnalyst95@gmail.com and we can chat about it.
35:46 if anyone is stuck like I was with the df.corr(method ='pearson') you can try this: numeric_df = df.select_dtypes(include=[np.number]) correlation_matrix = numeric_df.corr(method='pearson') print(correlation_matrix)
I actually decided not to do this one..... Yet. I have ZERO experience with Python so I want to get familiar with it first. I will be back for this one though!
At 46:00, if after running the code, 'name' column hasn't numerized, change the datatype of name in the csv to string. For example, if the name is 21 in the csv, change it to '21' with the quotation marks so that the value becomes a string. Do this for all numbers so that they become string. Save. Then re-run the code. Should fix it.
For the Pearson correlation there is a couple of assumptions to timeseries under consideration. The timeseries should be normally distributed, linearly dependent and homoscedastic. In the video you've only checked the linearity. What about other assumptions? Thanks.
I don't think that correlation with categorical data will work. Even after being turned into numbers, correlation and regression won't work at this case. The only way to introduce categorical data into correlation or regression is s it is turned it into multiple dummy variables. Thanks for the awesome video. 4/4 what does this mean? the series is done :(
@@tahsinserkanyaman3459 I don't think its right to say he doesn't know what he's doing. That's a little ridiculous. But yeah I don't think the correlation and linear regression really work well here with the categorical data.
Hi Alex, can you clarify how cat.codes work? I tried researching more about them online but couldn't really wrap my head around it. They all look like random numbers. How can we be confident that our final correlation matrix actually worked the way we wanted to? Also do the cat codes take into account very similar names like the multiple variations of "Walt Disney". Thanks so much!
Thank you for another tutorial, Alex. Really appreciate the effort you put in your explanations. I'm really looking forward to the next video in the series. Watching from halfway across the world in Cape Town. #savingsoulsfromworkplaceembarassement
Update: Alex, I just accepted my first job as a junior data analyst. This completes my 6-month journey to learn data analytics and change careers, and I could not have done it without your excellent Portfolio videos. Thank you so much for making these available to your viewers for free. After I built my portfolio, companies started taking a second look at my resume and inviting me to interviews. BEFORE the portfolio, I received ONLY rejection emails. Thank you, thank you, thank you!
Congratulations!!
I am getting only rejection
@@Datalover-Analysts Hi Pooja, I’m sorry. I know rejection can be discouraging. I received over 100 rejection emails from job applications before I finally started getting interviews. Without knowing the details of your situation, I can only encourage you to keep trying and don’t give up. Everyone’s journey to data is different, but I don’t think it is ever easy, especially if you’re trying to change careers, which is what I was doing. I wish you the best.
@@danielbristow6954 Sure, I am making portfolio with the help of Alex videos. Did some certification from coursera and Azure Fundamentals too
Hey Daniel, congratulations.!!! Can you please also mention what certifications you did? With the help of Alex’s platform i am building my portfolio.Also, completed my degree in MSCS this month.
If anyone else is having issues due to IntCastingNanError, I advise to try the following:
df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(int)
df['gross'] = pd.to_numeric(df['gross'], errors='coerce').fillna(0).astype(int)
it worked! :) Thank you Alex for your amazing videos!
Thank you !!! I almost gave up as i am not too versed in python to make these changes as the data for the original set he worked on has changed.
thank you ❤
Thank you so much for this, I was trying to google it before realizing it might be in the comments. If you have the time can you explain this part of the code? errors='coerce').fillna(0).astype(int)
I did look it up but was getting a little confused by it. Thank you again :)
thank this was a lot of help. i used this to avoid the int32 and it worked.
df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(np.int64)
Top notch! Thanks
Hello! The dataset appears to be updated on Kaggle and for anyone new, you will run into some issues that you need to fix to follow along.
1. Missing data. There are missing values opposed to this video so you will need to fix that. There are many ways to handle missing values but for the sake of time, I decided to drop all rows that have missing data. You will have about 71% of your data remaining. You will need to run the following if your dataframe is named df.
df = df.dropna()
2. Extracting the year is different as the formatting is different. Running the following should extract the correct year.
df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(int)
3. Duplicates, there aren't any in this dataset so you should be fine on that.
I hope this helps anyone that is working on this and best of luck on your analytics journey!
Sir you are a hero
thank you, i have just made ammends after reading ur comments, and other comments and got the solution to check and verify what was the status before and after execution, thanks
Thank you so much!
Thanks, I was stuck at extracting the correct year and now can finally solve it!
Thank you. I could not figurethe year out.
Hey all, just a "stats" heads up/correction you might want to make for your portfolio:
In this video, Alex wanted to see if the company was "correlated" with gross revenue. What he did was assign values (randomly, I think) to companies, countries, etc. Then he tried to see if those values were related to the gross revenue.
Those randomly assigned values are "measuring" the company, country, etc at the Nominal scale. In other words, they're essentially just being used as a numeric "name"-the values themselves don't mean anything. What that means is that one value being higher than the other doesn't represent an increase in the thing being measured (for example, the USA was assigned a 54 and the UK was assigned a 53. Those are just names... the USA isn't one more of something than the UK).
Because the values themselves don't represent anything, it doesn't make sense to do a correlation with them.
Correlations tell us, as one variable increases, what happens to the other? So in the first question, as the budget increased, what happened to revenue? It increased. But with country, company, or other categorical variables, correlations don't make sense. The values for country and company are random, so the numbers that represent them going up doesn't tell us anything. It's no wonder then, that the correlations weren't large.
Instead, it would make sense to do a t-test or ANOVA and compare means. In that case, the question would be, "Do some companies tend to produce higher revenue than others?" Or, "Do some countries tend to produce higher revenue?" etc. (For more discussion, see: www.webpages.uidaho.edu/~stevel/519/Correlation%20between%20a%20nominal%20(IV)%20and%20a%20continuous%20(DV)%20variable.html).
Since this is a portfolio project and you want to show potential employers the result, maybe just take that part out-you wouldn't want to make a mistake like that in an application to a potential employer!
(Alex, thanks so much for doing these videos! They're super helpful and I'm very very grateful!)
Thank you so much for that clarification. I was so much confused and spent a lot of time wondering how random values made sense in determining a correlation.
where do we make the corrections?
I noticed that in my dataset, avatar has a gross revenue of -2,147,483,648, and it just feels wrong. Is there something I am not doing right?
I just noticed that converting to int type gave me this error
I can't wait for the beginner, intermediate, and advanced Python series by Alex the Analyst. It's what the people want, besides a happy Alex.
They're coming! :D
@@AlexTheAnalyst
Hey Alex, please some of we newbies are still waiting for your python for beginners series
@@AlexTheAnalyst when ?🥺
Did they come already?
@@salehfiroozabadi8068 it is happening in the coming months. Alex is posting at the moment the Power BI series.
I really appreciate the fact that you did not edit out the parts were you made "mistakes" and actually fixed them.
If you are facing an error in datatype change, try the following
df_copy = df.copy()
df_copy['budget'] = df_copy['budget'].astype('int64')
df_copy['gross'] = df_copy['gross'].astype('int64')
df_copy
Thank you Alex for this amazing video
At 11:08, instead of printing null percent, we can use:
for col in df.columns:
print(df[col].isnull().value_counts(), "
")
This will print how many values are null. Cause you might have 1 missing in 10k values, and you will need high precision in decimals.
Honestly, you're an absolute legend. You really break down some of the technical barriers that exist for people entering the field of data science. You really are gem to the community.
there are some missing value in this dataset
Alex try this instead of that for loop statement
df.isnull().sum()
this will give total number of nulls for every column/variable
I have recently decided on becoming a data analyst and your videos are really helping me understand what i need to do and keep me motivated on that goal which will improve my life. I want to say thank you for your content and your honest helpfulness.
In 29:48 , I think x should be 'Budget' and y as 'Gross earning'
You are absolutely right! Woops!
Hey guys the info got updated since this video was posted. While I was going through the project I was able to google the problems as they came up.
In case you guys get stumped here's what I found that works:
This will drop any rows with null values
df = df.dropna(how='any',axis=0)
This will add the released date column into a separate column
df['yearcorrect'] = df['released'].astype(str).str.split(', ').str[-1].astype(str).str[:4]
Let me know if you that works for y'all
I still get the error name 'df' is not defined
thank you so much for the solution for updated dataset. Your solution save me from struggling on updated release date
i droped the rows but i think it s just dropping temporarily, because if i scatterplot after that it is still showing it has na values.
@@shyamkumar6009 I think you should use df = df.dropna(how='any', axis=0, inplace=True) to drop the null values permanantly.
Also the released changed forms again and I used this to fix it
# fix the date released format
df['release_date'] = df.apply(lambda x: x['released'][0:x['released'].find(' (')],axis=1)
df['release_date'] = pd.to_datetime(df['release_date'], infer_datetime_format=True)
At 13:40 If you are facing an error in datatype change, try the following :-
df['budget'].round().astype('Int64')
df['budget']=df['budget'].astype('Int64')
hope it will help uh
Hey guys, at 46:24, we can simply assign .copy() method to our new variable if we want to use for loop to iterate over our new variable without affecting the original dataframe or df:
df_numerized = df.copy()
for col_name in df_numerized.columns:
if(df_numerized[col_name].dtype == 'object'):
df_numerized[col_name] = df_numerized[col_name].astype('category')
df_numerized[col_name] = df_numerized[col_name].cat.codes
df_numerized
I don't know how much time this saved me but it would have been a lot.
to whom ever noticed that the 'released' column we have is not in the same format that Alex have and getting errors because of that; 15:27
i've been where you were, it took me 4 days just to figure this out, here is the line of code you need:
df['released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format = '%B %d, %Y')
hope it helped..
that worked! Thankyou. But can you explain your code pls?
@@karanikabj4422 believe me when I tell you I wish I can xd
but because it's basically my very first time using python so i don't really fully understand it unlike R and SQL that i know well, but at least i know how to research
buy what i understand is that basically we use the pandas function to_datetime , excluded the '(United States)' part and set the format to the one we have in the rest of the date with applying it to all the column and in the same time assigning that outcome to a new column under the same name witch basically overwrite the original column
I don’t get the year like Alex does only the months
Man i'm so grateful, you won't believe how much time i was stuck on this.
thanks☺☺
Thank you very much for the video! I want to change my career path to data analytics, and your videos have been a very good learning material. Although the data has been updated and some of the methods in this video do not work anymore, it is a fantastic guidance (and, ultimately, to become good at something, you have to do a fair share of self-study).
One thing to note though: I don't think the pearson correlation coefficient can be used to check the relationship between a categorical and a continuous variable. So, the low correlation coefficient for company, for example, might be misleading. Since, after all, the numeric ID assigned to the string values does not necessarily increase with the size of the company.
The dataset is updated and is not the same as the one in the video, if you guys have problems in the 'Create correct year section' you can do a split of the data to get only the year
df['yearcorrect'] = df['released'].astype(str).str.split().str[2]
Thanks mate
Fantastic, bro! Thanks!
what about min 35:35 looking at correlation, it returns ValueError, anybody find out why? Or the solution?
Sweet!
Thank you bro i was stuck in it for a long time
whoever is coming here after completing their portfolios watching all the 3 videos and here for the 4th...
I WISH YOU ALL THE BEST!
with so much love and gratitude for Alex!
Jus on today date i am doing this data set.
Tooltip:
FYI before converting the 'budget' and 'gross' column look for any null values , as i have downloaded the data set recently i had some.
And it thrown a error during the conversion, just make sure that the NaN value in both columns to be 0 before converting
And during creating the 'year corrected' column try to split it using .str.split(',', n=1,expand =True) and the use the df['yearcorrect'] = year astype(str).str[:5]
This is to be done for getting year out of released column i have done the same way as shown but got the month, so if works for you its fine otherwise try above method
This get things done
Thank u
And also thank u Alex you are doing a great job🙂
Thank you so much for your Video those help me a lot and finally I got a job as Data engineer by no experience in this role but I had learn from your channel in 1 moth! I got a lot of knowledge. really appreciated your support. thank you very very much!
Congratulations
As The Rock says; "FINALLY!"
I'm a bit embarrassed by how excited I get when an ATA video clocks in at over an hour...
Hahaha 😁
Hi Alex, just finished the project. It’s awesome. Thanks for everything. I pray for your success in the future.
The only thing that encourages me to watch is your smile
Keep smiling 🙏❤️
Haha I hope the high quality video content also makes you smile 😁
Hey Alex, please make videos on, how to handle missing/null values in python.
This video came in just on time.
I finished building my portfolio yesterday. Thank you for the tips.
Can you provide some tips? I have worked on projects but having troubles with how to present and display the projects? Can you share your link
Man, thank you so much for this, I know you've put a lot of effort into this project serie and I can definitely say that i'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people to pursuit their dreams! Greetings from Brazil :)
I can't believe I followed along and understood everything. I wasn't even sure if I would be able to before I started. Thank you so much Alex! With your help, I've gained more confidence in pursuing a career in data analytics. I'm definitely going to do more of your projects and hope to be able to land a full-time data analytics job this year. Thanks again!
Woohoo! You're doing great!
i see its barely 4 months since you made this comment. i have an issue with the scatter plot, can you help me out?
if df.corr() shows the error that a string variable can't be converted into int pass parameter df.corr(numeric_only=TRUE)
df.corr(numeric_only=True)
Thanks
love you soooo
@@ajibadeabdulateef2818 Hero
This was super helpful for me thanks , used df.corr(numeric_only = 'True')
Man, thank you so much for this, I know you've put a lot of effort into this project series and I can definitely say that I'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people pursue their dreams! Greetings from the UK :)
Hi Alex, thank you so much for all the videos, Ok here is the thing, I haven't taken this class, I actually was learning Python before seeing your page and decided to learn SQL, I took all the videos you have on SQL and the 1st 3 portfolio projects on SQL and Tableau. so I went to stratascratch and register for the free option, they gave me access to 50 Interviews questions, some are easy, some medium, and some hard But the interesting this is, I was only able to answer 1 question from the easy ones and others I couldn't answer. That almost made me feel discouraged but I am just thinking I need to spend more time on more tutorials on SQL before moving back to Python. I will like to hear about your option and others who had a similar experience. Again, thank you so much for all your effort, you are touching lives!
All you folks getting data analyst jobs left and right, could you give a glimpse of how you presented this project on your CVs? Alex some help would be great.
Hi Alex. Thank you for the portfolio project series.
For the missing values, I think the 0.0% it showed for every column has been approximated. If you use describe() and info(), you will notice some null values.
Thanks again for the videos, they are really helpful.
that's what I thought at first, too, but the data set has simply changed since he uploaded the video (or he used an already edited one). So now there are a few columns that even have values like 0.28%...
@@synaestheticVI yes can anyone help that what should we do in that situation?
Very well instructed! Way better than any of the BootCamp lectures I had gotten previously. Perfect for a refresher and portfolio work. Thank you!
Soooooooooooooooooooooooo excited for the last website video to come out
For the first time in my 19 years of living, i feel pretty confident of making something to its perfection by myself (and your help too🙌)
5:20 Faster way to do this is to shift right-click the file and copy as path.
The "apostrophes" are just called single quotes
' ' -> single quotes and {} -> curly brackets. Just in case you have not already received a similar answer. Either way, keep up the great work :)
This 4 part tutorial is pure gold! After your announcement that you were launching your version of data analyst course/certification, can’t wait for when it goes live, as to follow up in more depth for the concepts presented in this series. Really appreciate the time, dedication and quality of content you produce Alex.
Another excellent portfolio project from Alex! My portfolio is starting to look very good, and I finally have something to upload to job applications that request a portfolio! Thank you, Alex!!
That's great! Glad to hear it's been helpful!
The 'released' column is updated, now it comes with a text format date and the country of release. What I did was to split the column in two : Release date and Country release.
The code I used was this:
df[['released','country_release']] = df['released'].str.split(' \(',n=1,expand=True)
Then you have to clean a littile bit the 'country_release' column with:
df['country_release'] = df['country_release'].str.replace(')','')
And finally give the 'released' column the datetime format with this:
df['released'] = pd.to_datetime(df['released'],format='mixed')
For some reason using format = 'mixed' did the magic trick for me, i tried '%B %d, %Y' but It never worked.
Great community in the comment section. Thanks for this analysis Alex! I couldn't make my career pivot without your help
I made it ti 4th part..thanks alex for this tutorial those who dont get released date only you can use this code df['format_released'] = pd.to_datetime(df['released'].str.extract(r'(\w+ \d+, \d+)', expand=False), format='%B %d, %Y')
Thank you, Alex! I learned so much.
Anyone's correlation matrix doesn't work? Need to add 'numeric_only = True'. Now the default is false.
correlation_matrix = df.corr(method = 'pearson',numeric_only = True)
Pure gold, man. Saved the day! Thanks!
This needs to be pinned to the top. Thank you it works!
Thank you so much again! Portfolio completed thanks to you! My first résolution of the year is done thanks to you! Happy new year 🎉🎉🎉
This project taught me a number of new things about pandas. Very helpful! Thank you Alex!
Thanks Alex, waiting from so long for this#4 video.Thanks for sharing.
Now I have multiple new projects I can add to my portfolio. Thanks, Alex!
This one was a cool challenge! Required a bit of research on my part to better understand some steps taken (python for data analysis/visualization is not really my area), but well worth it. Great video Alex, thank you!
Sir, I am a civil engineer doing my masters during my thesis I got some work of machine learning and then through your channel once I presented my data in tableau my supervisor gave me extra credit thankyou to you...now I am thinking of switching to this field thankyou for your efforts.
My question is after learning the required skills how can I start applying in the companies? Second please start an interview series where you discuss how and what type of technical questions are asked in the interview.
That’s great! I have a few videos on how to work with recruiters and prep for interviews - I think those would be helpful to you 👍
@@AlexTheAnalyst I am new comapritively new to the channel I will surely go through those videos. Thanks from all the student community to you for your great contribution in our guidance and learning. Looking forward to learn more insights from the channel 🙌🏽
Love from India 🇮🇳 to alex the analyst.
Hi Alex, great job you are doing in your channel, thank you very much. I just wanted to say, if it might help anybody who is watching you (because I believe that you already know it after two years), that a correlation measures the reaction of one value against the movement of another value. For this, both values need to be able to move and get bigger or smaller, something that letters (and names, in consequence) cannot do, neither can letters or names disguised as numbers, because those are static too, hence, the "non numeric" correlations showed in the video, are false.
This problem could just be a mess up in any project, but, in a project that is our portfolio, our showroom, it will only show how much of a data analyst we are NOT.
Kindly study a work around to this statistical problem, which is not to change letters by numbers. So many great developers behind Pandas would have implemented it many years ago. 😀
Almost done with this fantastic series. Excited for your upcoming video on data scrapping. For future videos in this series could you possibly do one on APIs (making a project using some public API) and something on big data maybe?
You're a monster, Alex!
Thanks a million (how they say here in Ireland)
God bless, man
I support your channel all the way. keep up the projects
Thank you Alex, this has been a great project! You are a great teacher and this has been very helpful. Looking forward to everything you release in the future!
To everyone getting error for df.corr()
this was my fix:
# since pandas version 2.0.0 now you need to add numeric_only=True param to avoid issue
df.corr(method='pearson', numeric_only=True) #pearson, kendall, spearman
---
correlation_matrix = df.corr(method='pearson', numeric_only=True)
sns.heatmap(correlation_matrix, annot=True)
plt.show()
@5:20 You can also just highlight+copy the path in the address bar above instead of taking the extra steps to right click and go into properties to select the path. Much more efficient.
Hey Alex. Kudos for the good work you do! Can't believe all these resources are free!
I had a doubt actually,
Am I the only one, or is it true that the dataset on Kaggle is slightly different than used in the projects? The columns are still the same but values have changed!
I have experienced this too.
I am facing this issue too
Thank you so much, Alex! I've taken a few courses about Python and yours is clear and awesome!
Here at 11:00 when finding missing values write command as ---
pct_missing = np.mean(df[col].isnull())
OR you can write
pct_missing = df.isnull().mean().sort_values(ascending=False)
If there are missing values in Your dataset try to fill it up with 0, here as ----
df = df.fillna(0)
at 16:50 to get the Released year only write command as ----
df['yearCorrect'] = df['released'].astype(str).str.split(',').str[1].str.split('(').str[0]
at 28:20 to get the scatter plot first try to replace the Null with 0 using code ----
df.fillna(0, inplace=True)
Somewhere around 14:05 where Alex was converting the gross and c=budget column to int64, his code wouldn't work for me but after some research I found this to work:
df['gross'] = df['gross'].fillna(0).astype('int64')
df['budget'] = df['budget'].fillna(0).astype('int64')
Thanks a bunch for this!
or you can use df['gross'] = df['gross'].astype("Int64')
Can't wait to finally finish everything! Thanks for creating these awesome guides Alex!
Thank you so so much Alex!! You have been a guiding light!!
You are the best plz do more videos about data cleaning with SQL server 😍😍😍
Thanks Alex. You are still the best.
Thanks for watching!
Not apostrophe. It is single quotes. Thank you for the great tutorial!
Came here to say that if you're trying to run the df.corr() and it's trying to run the correlation math on string data columns, simply add in the argument df.corr(numeric_only=True)
Life saver, thank you mate! any idea why this is happening?
thankkkkkk you
Hey Alex - Thank you for this. Right around 49:25 you talk about the correlation matrix of the df_numerized dataframe that is being shown as a heatmap. I do have a question about that....: when you did .cat.codes in the cells above, how did the category values of the previous objects (company, country, director) represent any value that can be correlated? For instance, using one row as an example, I'm confused how index 6380 at the top of the dataframe has a company categorical value of 1428. Is this by random or did the code construct some sort of logical thinking and gave a numeric value based on other data patterns?? .... Sorry if I am confusing you, it's just when I got to the heatmap part of the df_numerized dataframe I was kind of lost as to how categories can actually represent correlations if the categorical value given to it was completely random. thanks,
i thought the same. There is no correlation between a random number with gross. only possible if by any luck you get a random higher number for your movie that has also a high budget. then=high correlation
Your videos have been so helpful. Just dropped in to say thank you !
I was looking forward for last one. You are the best.
if you downloaded the after the video it seems it might have some values missing that prevent you from converting columns into integers.. use df = df.fillna(0)
should we just ignore the missing entries or did you delete them?
An easier way to calculate the percentage of missing value
df.isnull().sum().sort_values(ascending=False)/len(df)*100
Extract the year from released
df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(float)
df['yearcorrect'].fillna(0, inplace=True)
df['yearcorrect'] = df['yearcorrect'].astype(int)
alternative for calculating % of missing values: df.isnull().mean().sort_values(ascending=False)
Good video. I learned how to use correlation matrices, which is new to me. The whole np.mean(df['col'].isnull()) is something I'm still trying to wrap my head around but for now I'll just hit the easy button on it.
Thank you so much sir for taking time out from such a busy schedule and coming up with an initiative of making such videos in order to help many people around the world interested in starting and developing a career in data analytics :-)
At 13:40 If you are facing an error in datatype change, try the following
df['gross'] = pd.to_numeric(df['gross'], errors='coerce', downcast='integer')
df['gross'].isna().sum()
df[df['gross'].isna()]
df['gross'].fillna(0, inplace=True)
df['budget'] = df['budget'].fillna(0)
df['budget'] = df['budget'].astype('int64')
df['gross'] = df['gross'].round().astype('int64')
started doing the project and noticed Kaggle data slightly different from one in the video. there were some negative numbers in the gross column. to change that to positive had to run this code # apply conditional function to the column containing negative numbers
df['gross'] = df['gross'].apply(lambda x: abs(x) if x < 0 else x)
Thank you so much for this!
Thank you for this, my scatter plot was not showing properly because of the negative numbers and this fixed it. Again, thank you!
Great, man! Thanks!
Looks like there's been a mix-up with the axis labels on the graph. The 'budget' and 'revenue' labels are swapped. The 'budget' should actually be labeled as 'revenue' (250 million) and vice versa for the 'revenue' (a billion). Thanks!
Thanks Alex, Really good video. I just want to know if you could give some bullet points about this project to add in the resume?
Thank you Alex! These projects really help!
I learnt so much from this. Thanks Alex
Thank You so much Alex for creating such amazing content. You are the best Teacher anyone could ask for.
08/02/2022 - I'm using the dataset at this date and there has been many changes and unfortunatly the pct_missing is not 0.0%.
For me I copied the content of df in a new dataframe that I called Newdf and then deleted the rows:
Newdf = df.dropna(axis=0)
print(Newdf.isnull().sum(),'
')
Thank you.
Worked!!
good job
I might be missing something, please correct me if I'm wrong as I'm tired as I type:
Hasn't Alex mislabeled the first scatterplot? Isn't budget on x and gross on y? Whereas he has labelled the opposite. This is around the 30:00 mark.
Thank you Alex for making this project free. I am making a career change and pretty new to this field. I am wondering if this level of project is sufficient for a entry level position yet or does it need to trickier? I hope that it is enough for us to start applying jobs. Thanks a ton.
was waiting for this one and looking forward for such projects
thank you so much :D
Much Awaited! Very Exited
Thank you so much for this awesome class, Just subscribed and added to my portfolio. Thanks Heaps.
Hey Alex,
Thank you so much for these videos, they’re incredibly helpful for aspiring data analysts like myself! I have an interview in two weeks for an entry level data analyst position and I’m pretty nervous not having any previous experience as a data analyst… I was wondering if you offered any consulting services ? Thank you again !
I'm so glad to hear that! I do, but I'm quite booked lately so I don't know if I can fit anything in in the next 2 weeks. You can always email me at AlexTheAnalyst95@gmail.com and we can chat about it.
Super mega awesome helpful and useful. Thank you very much for these video series!!!
35:46 if anyone is stuck like I was with the df.corr(method ='pearson') you can try this:
numeric_df = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr(method='pearson')
print(correlation_matrix)
Thanks for the time that you put into all of your content, this video helped a lot. I hope all is well and once again Thank you
Thank you for the effort you are putting in theses videos, it's really helpful.
I actually decided not to do this one..... Yet. I have ZERO experience with Python so I want to get familiar with it first. I will be back for this one though!
At 46:00, if after running the code, 'name' column hasn't numerized, change the datatype of name in the csv to string. For example, if the name is 21 in the csv, change it to '21' with the quotation marks so that the value becomes a string. Do this for all numbers so that they become string. Save. Then re-run the code. Should fix it.
Thank you sir ... Nice explanation ...
For the Pearson correlation there is a couple of assumptions to timeseries under consideration. The timeseries should be normally distributed, linearly dependent and homoscedastic. In the video you've only checked the linearity. What about other assumptions? Thanks.
I don't think that correlation with categorical data will work. Even after being turned into numbers, correlation and regression won't work at this case. The only way to introduce categorical data into correlation or regression is s it is turned it into multiple dummy variables.
Thanks for the awesome video. 4/4 what does this mean? the series is done :(
He doesnt know what he is doing. He is just directing People to wrong lanes.
@@tahsinserkanyaman3459 I don't think its right to say he doesn't know what he's doing. That's a little ridiculous. But yeah I don't think the correlation and linear regression really work well here with the categorical data.
Hi Alex, can you clarify how cat.codes work? I tried researching more about them online but couldn't really wrap my head around it. They all look like random numbers. How can we be confident that our final correlation matrix actually worked the way we wanted to? Also do the cat codes take into account very similar names like the multiple variations of "Walt Disney". Thanks so much!
Looking forward for more 🥰
Thank you for another tutorial, Alex. Really appreciate the effort you put in your explanations. I'm really looking forward to the next video in the series. Watching from halfway across the world in Cape Town. #savingsoulsfromworkplaceembarassement
Thank you so much for this video!! Really appreciated