I hope everyone had a great week! We've got a long video this week, but we go over a lot of important topics about how to analyze data in Pandas. We will learn how to answer very interesting questions such as "What is the most popular social media site by country?". I put timestamps together for this video so that you all can skip around if you need to go back and watch a specific section. Here are those timestamps: Aggregate Column - 2:00 Aggregate DataFrame - 3:55 Value Counts - 7:51 Grouping - 12:30 Multiple Aggregates on Group - 26:00 People Who Know Python By Country - 27:20 Practice Question - 34:20 Concat Series - 37:27 Have a great weekend everybody!
Corey, is it safe to assume if your coming from a SQL background, that you can effectively use things like the 'pd.concat()' to replace the various joins (left, right, inner etc) workflows in SQL and just use SQLAlchemy or pyodbc libs to load the data and then do all the calculations with python that you would normally do in whatever SQL dialect?
3:10 median function 5:00 describe function 7:20 count() 8:05 value_counts() 12:51 grouping the data 14:39 groupby() function 16:07 get_group(), grabbing a specific group by name 17:30 doing same by using the filters 18:40 using value_counts on filters 20:20 value_counts() for groups 21:49 using loc to find for one country 23:40 percentage by using normalize 25:00 median by country group 26:13 agg function for multiple functions 27:30 using filtering to get python users by country 30:20 error on using same approach for groups 31:40 apply method to run that on group 35:40 finding the percentage of people using python in each country(group) 37:40 using concat for combining series in a dataframe 45:30 adding percentage column
thx, this is very useful. The videos already are very concise and to the point, but if I am just looking for how to do a proper groupby quickly on my own dataset....
FYI, the percentage problem can be solved alternatively as follows: country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum()/x.count())
Let's all admit that this dude is a hard working man and his work is just a wow! I've been following him for quite some time now and I am always impressed by how thoughtful, tactical and clear his explanation is in every tutorial he makes. Hat off to you, dude!
numeric_columns = df.select_dtypes(include="number") medians=numeric_columns.median() print(medians) # this is a way of getting the medians of numerical values as I use df.median(), it gave me value error that says could not convert string to float"I am not a student who is learning to code" thanks for great work. I learn more from you than from my professors. Thank you so much for great efforts!😎
I'd like to share my solution to the practice question. ctr_knows_python = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python', na=False).value_counts(normalize=True)) ctr_knows_python.rename({False:'Don\'t know', True:'I know'}, inplace=True) ctr_knows_python Hope this helps too!
@@moushumitamanna in a column which is expected to have numbers, na does not make sense so we filter out these values. also if you run any functions (mean, median) then you may run into syntax errors
I'd like to share my solution to the practice question: country_grp['LanguageWorkedWith'].apply( lambda x: x.str.contains('Python').sum() / len(x) * 100 ) As you can see it's just as symple as adding " / len(x) * 100 " in the lambda function, where len(x) is the total number of users for each country.
@@BCS_FahadAhmad I think x.count() will not count NaN so I think len is better if you want to include people that skipped that question. I am assuming that was an option.
I was thinking the same, so i downloaded his repository and tried it and it works. Came here to comment and saw your comment. so , i just wasted 5-10 minutes of my day. whatever.hallelujah.
other gurus are just like "here's the code for this, copy it and don't ask why" but you are the only one who shows how things work. Thank you very much
YESSSS !!! Corey`s video for groupby. I press like before I watch it. Groupby is just a superpower. Thank you for this awesome series Corey. You are the best.
I don't care if some of your videos are long, in other channels they're just redundant but that's not your case! If you start doing short videos we may be losing all that valuable information that you provide to us. So far, this is the best Python channel I've seen. Greetings from Medellín, Colombia.
one of the best thing that happened to me when I woke up (I am on the opposite side of the world to Corey Schafer) is finding that Corey just upload another Pandas tutorial video , thank you !
For your practise question, use .mean() instead of .sum() .mean() on a Series of bool will give you the fractions in a quick and easy way. Multiply by 100 for %. edit: As per discussion below, .mean() ignores NA values whereas Corey's approach treats NA as '0'. An alternative is thus: mygroups['LanguageWorkedWith'].apply(lambda x:x.str.contains('Python').fillna(0).mean()) Now, the results should be equal to Corey's.
It looks like a nice solution but numbers from Corey's video are slightly different than those with .mean() and I do not know why. For example: for Afghanistan PctKnowsPython 18.181818, with mean is 20.512821 for Albania PctKnowsPython 26.744186, with mean 27.710843 Does anyone know why?
@@davidsp7949 yeah i have the same doubt,,i guess that s due to the fact that mean() take in consideration only the respondents who effectively answered the question,,and sum() take all respondents even the ones with NaN for the question.so corey solution is Pct over all respondents,, and the mean() is over only the ones who answered this Q.
.mean() neglects those "NAN" responses while if u use x.str.contains('Python').sum()/x.size instead, it would count those "NaN" as "no pythoner" which is what Corey was doing
Well played. For me your solution is quite elegant. BTW, in case you wanted to treat NA as zeros (to get the same results from the video), you can simply use .mean() with its 'skipna' option disabled. This would make: mygroups['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').mean(skipna = False)
Hey Corey! Thanks a million for the Pandas Series. As always, very intuitive and easy to follow. Now that you've taught Matplotlib and Pandas, would love to see a new Numpy series in order to complete the Data Science trinity. Please consider adding a Numpy Series.
For your exercise (What % knows Python) , I divided the sum in the lambda function by x.count() then multiplied by 100 : country_group['LanguageWorkedWith'].apply(lambda x: 100 * x.str.contains('Python', ).sum() / x.count()) Thanks for the free content, awesome
The level of my programming in Python has been substantially improved since I have started watching your great videos. Many thanks, Corey. Would you please prepare some videos regarding the networkx module as well?
Mr. Schafer, I am so happy I found your teaching. I have been on a journey to become a data analyst, and after completing the Google Analytics Course , I realized that I needed to learn much more. I am currently finishing a Python Course through Coursera offered by IBM. Not every professional, no matter how good they are, have the natural ability to teach. Your method and technique are so amazing and helped me to overcome some of the confusions I had with coding in Python. I learned so much from just this video alone. I will definitely visit the site you referenced, and look forward to learning more from your videos. Thank you so much!
Working on a project evaluating an employee survey and this is just what the doctor ordered. Thanks! One of the best channels in RUclips for data analysis hands down
Hey, Corey. I'm a network engineer and I'm learning pandas to be able to do some packet analysis and your videos really help me a bunch! This is my solution for the coding problem from the video. Hope it helps! know_python = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum()) total_respondents = country_grp['LanguageWorkedWith'].apply(len) know_python / total_respondents
I am a GIS student and I want to thank you because I'm doing my last assignment for university and I'm using Geopandas, matplotlib, pandas, cartopy and forth on and you helped so much with your videos, I have build a nice map and I have produced different tables with my data. Thanks god you are out there and sharing your knowledge free
Thank you so much for this video. I learnt way more from this than the many hours I spent sitting in class listening to a teacher who just wanted to end the lesson early or have long lunch breaks. This is really precious. And thanks for the reassurance that if I find this difficult, there's nothing wrong with me LOL.
Hey Corey! For the question you gave: The percentage of people by country who use python. There is an efficient solution too (Without creating a separate dataframe). country_grp["LanguageWorkedWith"].apply(lambda x:(x.str.contains("Python").sum()/len(x))*100) Actually what I am doing here is, in the lambda function, at the return part, I divide the No. of python users by the length of the given series and then multiply it by 100. This gives the percent of python users in each country. This approach might be a bit code efficient but can be a bit confusing for some.
You can use len(x) in the lambda, it will include the na in the series. You cannot use count(x) because it skips na. But you can use value_counts(x). If you run the following code, it will verify that Prasanna solution is corret: len_total = country_grp["LanguageWorkedWith"].apply(lambda x: len(x)).sort_values(ascending=False) us_no_answer = country_grp["LanguageWorkedWith"].apply(lambda x: x.isna().sum()) us_answer = country_grp["LanguageWorkedWith"].apply(lambda x: x.notna().sum()) df_counts = df['Country'].value_counts() df_counts = pd.concat([df_counts,len_total,us_answer,us_no_answer],axis='columns') df_counts.columns=['df_value_counts','lambda_len','user_respond','user_did_not_respond'] df_counts produces: df_value_counts lambda_len user_respond user_did_not_respond United States 20949 20949 20769 180 India 9061 9061 8844 217
Just want to share here my solution for the practice question (but with the survey of 2022): --- country_group['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True)) --- And also give thanks to your wonderful videos, Corey! It's been 3 years and they're still among one of the best tutorials.
Im browsing thru some of the videos to brush up on Python, and this is the first python video that didnt get me bored. Concise and brillliant. Love your videos! keep up the good work :)
I've been doing data analysis using Python & Pandas for a few years now. Still, I'm picking up new things from your videos. Very clearly explained! Thank you!
Hey, Corey, I see many people gave their own answer which are wonderful. So want to give my one here, which looks like more a beginner answer: #create a new column called: 'Python use' df['Python_use'] = ( df['LanguageWorkedWith'].str.contains('Python') ) #.value_counts() the 'Python use' column as it is a boolean type: country_grp = df.groupby(['Country']) Python = country_grp['Python_use'].value_counts(normalize=True) Thanks for your great Lectures! I watched from part 1. I am your big fan now.
Thanks so much Corey ! It's clearly impossible not to understand what you explain on all your videos : It's fluid, straightforward, crystal clear ! And more over your english : Whaoooo ... Congratulations !! I wonder if I've learned more Pandas or english !! 200% great !!
Thanks a lot for your teaching! Here is the my solution at the end of the video: # group object['column'] is a Series object, so the input of the function is a Series, ana the output value of the function is a float def percent_know_python_each_country(countrySeries): num_know_python = countrySeries.str.contains('Python').sum() num_all = len(countrySeries) percent = round((num_know_python / num_all * 100), 2) return percent country_group['LanguageWorkedWith'].apply(percent_know_python_each_country).sort_values(ascending=False).head(30)
thanks so much for this series. started from the first video two weeks ago, now in the 8th. this series so far made a lot progress in me,. thanks so much, .May God Bless You. Love from Sri Lanka...
Hi Corey. I think this might be a relevant simpler approach for getting percentage. I used value_counts(normalize=True) instead of sum. df.groupby(['Country'])['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True)) This of course return both percentage who know Python and Who don't know. So if we want to get for a specific country, for instance Japan, then: df.groupby(['Country'])['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True)).loc['Japan'][1]
instead of using : value_counts(), use sum() in second line. Then you'll get appropriate answer. df.groupby(['Country'])['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
perfect content. one flaw is that , you never explained what a "lamda" function is and went straight to using the function in your previous videos. you did the same thing here. now i have to first study lamda.
Yeah I had to take a look into it. For anyone curious, lambda and "def return" is the same thing, but lambda are throway functions while "def return" assign a name to be applid further in the code. The "def return" syntax is easier to read and more clean than lambda
Thanks, Corey. This is one of the most comprehensive pandas tutorials on RUclips. Love from India. I also noticed a subtle issue. We are adding the number of respondents who filled their Country and not who filled LanguageWorkedWith. Someone can fill Country and not LanguageWorkedWith.
thank you Corey!! this is a wonderful pandas series!! you make the concept so easy that even a python beginner (that is me) without programming background in colleague can understand!! i'd like to share my solution as well: (since i don't know concat method, i calculate the answer first then convert them into a dataframe by dictionary) know_py = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum()) answer = df['Country'].value_counts() per_cent = know_py/ answer result = pd.DataFrame({'answer': answer, 'Python': know_py, 'percentage': per_cent}) i find you stop uploading new tutorials for a long time, hope everything goes well with you. and strongly looking forward to hearing from you soooooon!! thank you & greeting at 2022 sep 7th :)
This video is so useful , the simplification that python does for the problems is so helping. is the best language in the programming of code. And the proffesor of this video is really a genius. !!! thanks.
'dropna' parameter of value_counts, is another parameter i explore while digging the difference between my approach and corey's, on top of the 'na' parameter of contains. btw, thx a lot sir corey, for all these helpful videos!
I've watched quite a variety of different data analysis tutorials and this one was easily one of the most engaging for me. Having interesting data really helps.
Perfect explanation. Making a convoluted yet so important concept crystal clear through step by step explaning and also giving connections to pandas object types. Cheers!
This was a lot to learn in one video. That's why I went back and watched it again this week. At the end, I added something I think would be useful in what I do. I added a filter to select records where the number of respondents is >= 5. filt = python_df['NumRespondents'] >= 5 python_df.loc[filt]
My solution to the practice question: def pctKnowsPython(x): try: return x.str.contains('Python', na=False).value_counts(normalize=True).loc[True] except KeyError: return 0 pctKnowsPython = country_grp['LanguageWorkedWith'].apply(pctKnowsPython)
I found amusing explaining what a percentage% is after going over apply & lambda methods, but that's exactly the thoroughness that makes your videos so loved :)
We can do this in the lambda function for getting the percentage : lambda x : x.str.contains ('Python').value_counts(normalize=True) And I would love to thankyou for all these videos!
"a = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True))" this what i did is it correct or not i dont know??
Hey Corey, first of all thank you very much for all those fantastic videos. I also have tried to answer the question of percentage knowing python for each country. I came up with following solution: PrctKnowPython = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python', na=False).value_counts(normalize=True)).loc[:,True] One advantage of this approach is, that you can just remove na=False and ignore NaN values in your data.
A much simpler way of finding the percentage of people from each country who know python is by using the mean() function Something like this : country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').mean()) P.S: Corey's method produces a much detail-oriented version of this result
Thank you for the last example (percent that knows python). It was great to see how the different methods learnt can come together in a practical example; this really helped consolidate the knowledge gained.
Hi. Again, thank you for all the good works you've done. Please keep it up. As for making ratio of python knowers by respondents of countries, I personally make the variable first, PctKnowsPython, and then rename it. country_grp = df.groupby(['Country']) numpy_country = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum()).sort_values(ascending=False) -> number of people who know python by country country_resp = df['Country'].value_counts().sort_values(ascending=False) -> number of respondents by country pynum_by_con = (numpy_country / country_resp) * 100 -> my version of PctKnowPython py_df = pd.concat([country_resp, numpy_country, pynum_by_con], axis='columns') -> showing the dataframe which has three columns. The last column name is 0. py_df.rename(columns={'Country': 'NumRespondents' , 'LanguageWorkedWith': 'NumPythonKnower', 0: 'PctKnowsPython'}) -> showing the renamed dataframe. py_df.rename(columns={'Country': 'NumRespondents' , 'LanguageWorkedWith': 'NumPythonKnower', 0: 'PctKnowsPython'}, inplace=True) -> applying the change. Hope this may be helpful.
I am following your pandas series since the last 3 days and may complete in 1 or 2 days max, I will come back to the series to revise it, very well made Series and keep up the good work !😀
You could also solve the question in the video by using a pandas pivot table. here is the code: table = pd.pivot_table(df, index='Country', values=['Hobbyist', 'LanguageWorkedWith'], aggfunc={'Hobbyist': 'count', 'LanguageWorkedWith': lambda x: x.str.contains('Python').sum()}) table['percent'] = (table['LanguageWorkedWith'] / table['Hobbyist']) * 100 table.sort_values(by='percent', ascending=False)
you teach really well, I am learning a lot...Thanks. I have also learned other topics from your videos....I was stuck on opening and reading a csv file in python, till I saw your video and learned it....I am an absolute beginner...😅...Thanks
for the question, I just test this one and it works, just use value_counts(normalize=True) instead of using sum() country_group['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True))
I hope everyone had a great week! We've got a long video this week, but we go over a lot of important topics about how to analyze data in Pandas. We will learn how to answer very interesting questions such as "What is the most popular social media site by country?". I put timestamps together for this video so that you all can skip around if you need to go back and watch a specific section. Here are those timestamps:
Aggregate Column - 2:00
Aggregate DataFrame - 3:55
Value Counts - 7:51
Grouping - 12:30
Multiple Aggregates on Group - 26:00
People Who Know Python By Country - 27:20
Practice Question - 34:20
Concat Series - 37:27
Have a great weekend everybody!
Thanks so much bro,,,,much love from kenya
Hey, is this series gonna continue?
Can you explain all the types of joins
And if possible please explain multi-level Indexing as well
Corey, is it safe to assume if your coming from a SQL background, that you can effectively use things like the 'pd.concat()' to replace the various joins (left, right, inner etc) workflows in SQL and just use SQLAlchemy or pyodbc libs to load the data and then do all the calculations with python that you would normally do in whatever SQL dialect?
3:10 median function
5:00 describe function
7:20 count()
8:05 value_counts()
12:51 grouping the data
14:39 groupby() function
16:07 get_group(), grabbing a specific group by name
17:30 doing same by using the filters
18:40 using value_counts on filters
20:20 value_counts() for groups
21:49 using loc to find for one country
23:40 percentage by using normalize
25:00 median by country group
26:13 agg function for multiple functions
27:30 using filtering to get python users by country
30:20 error on using same approach for groups
31:40 apply method to run that on group
35:40 finding the percentage of people using python in each country(group)
37:40 using concat for combining series in a dataframe
45:30 adding percentage column
thx, this is very useful. The videos already are very concise and to the point, but if I am just looking for how to do a proper groupby quickly on my own dataset....
thanks a lot.
@@afdqwfqwqwdfqwdawdas sure
FYI, the percentage problem can be solved alternatively as follows: country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum()/x.count())
@@80expertube It throws a RuntimeWarning
I love the fact that there are no ads interrupting in the middle. So thoughtful. ❤️
Yeah, I didn’t want the to ruin the flow of the videos. Glad you noticed :)
Corey Schafer OMG! Your reply just made my day!
@@coreyms Thank you so much for doing this. You are such a considerable person with a big heart.
Corey, do you have a Patreon page? Thank you for your exceptional videos; a huge help for me and so many people!
Yes please, do a video on the topic of MULTIPLE INDEXING!!
I would pay to see Corey's tutorial on this. Every time I encounter a multi-index, I'm on stack overflow. It just doesn't seem to stick.
Let's all admit that this dude is a hard working man and his work is just a wow!
I've been following him for quite some time now and I am always impressed by how thoughtful, tactical and clear his explanation is in every tutorial he makes.
Hat off to you, dude!
numeric_columns = df.select_dtypes(include="number")
medians=numeric_columns.median()
print(medians)
# this is a way of getting the medians of numerical values as I use df.median(), it gave me value error that says could not convert string to float"I am not a student who is learning to code" thanks for great work. I learn more from you than from my professors. Thank you so much for great efforts!😎
You saved my learning session, thanks!
Thank you bro
Thank you, you save me alot of time,
I found that you can shorten it by writing numeric_columns.median()
Thanks man
I'd like to share my solution to the practice question.
ctr_knows_python = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python', na=False).value_counts(normalize=True))
ctr_knows_python.rename({False:'Don\'t know', True:'I know'}, inplace=True)
ctr_knows_python
Hope this helps too!
Nice!
Hi, can you please explain what "na=False" means here and why do we have to put this in the code? Thanks in advance
@@moushumitamanna not applicable
@@tplano3794 thanks. But why should we put na=false in this code
@@moushumitamanna in a column which is expected to have numbers, na does not make sense so we filter out these values. also if you run any functions (mean, median) then you may run into syntax errors
I'd like to share my solution to the practice question:
country_grp['LanguageWorkedWith'].apply( lambda x: x.str.contains('Python').sum() / len(x) * 100 )
As you can see it's just as symple as adding " / len(x) * 100 " in the lambda function, where len(x) is the total number of users for each country.
como se te ocurrio esto? ahem I mean, How did u come up with this? well played
I guess x.count() in place len(x) makes more sense, since there can be people who did not answer language(I highly doubt XD)
@@BCS_FahadAhmad I think x.count() will not count NaN so I think len is better if you want to include people that skipped that question. I am assuming that was an option.
I was thinking the same, so i downloaded his repository and tried it and it works. Came here to comment and saw your comment. so , i just wasted 5-10 minutes of my day. whatever.hallelujah.
Damn well played
other gurus are just like "here's the code for this, copy it and don't ask why" but you are the only one who shows how things work. Thank you very much
hey if ur df.median() doesn't work and ur getting typeerror and valueerror u can do df.median(numeric_only=True)
Thanks man
Thanks a bunch!
was looking for an advice on this, thanks a lot!
thanks a lot
YESSSS !!! Corey`s video for groupby. I press like before I watch it. Groupby is just a superpower. Thank you for this awesome series Corey. You are the best.
You make all your subscribers happy with just uploading your videos !!! Thanks, Corey
Thank you all for watching!
I don't care if some of your videos are long, in other channels they're just redundant but that's not your case! If you start doing short videos we may be losing all that valuable information that you provide to us. So far, this is the best Python channel I've seen. Greetings from Medellín, Colombia.
This has to be one of the best videos on youtube about Pandas, thank you so much. Greetings from Perú.
I love how you are just using the same data throughout the whole series. Thank you so much, Corey!
I just discover that your way of teaching is to tell not only how to do it but why this is how to do it. thumb up!!
one of the best thing that happened to me when I woke up (I am on the opposite side of the world to Corey Schafer) is finding that Corey just upload another Pandas tutorial video , thank you !
For your practise question, use .mean() instead of .sum()
.mean() on a Series of bool will give you the fractions in a quick and easy way. Multiply by 100 for %.
edit: As per discussion below, .mean() ignores NA values whereas Corey's approach treats NA as '0'. An alternative is thus:
mygroups['LanguageWorkedWith'].apply(lambda x:x.str.contains('Python').fillna(0).mean())
Now, the results should be equal to Corey's.
It looks like a nice solution but numbers from Corey's video are slightly different than those with .mean() and I do not know why.
For example:
for Afghanistan PctKnowsPython 18.181818, with mean is 20.512821
for Albania PctKnowsPython 26.744186, with mean 27.710843
Does anyone know why?
@@davidsp7949 yeah i have the same doubt,,i guess that s due to the fact that mean() take in consideration only the respondents who effectively answered the question,,and sum() take all respondents even the ones with NaN for the question.so corey solution is Pct over all respondents,, and the mean() is over only the ones who answered this Q.
.mean() neglects those "NAN" responses while if u use x.str.contains('Python').sum()/x.size instead, it would count those "NaN" as "no pythoner" which is what Corey was doing
Well played. For me your solution is quite elegant.
BTW, in case you wanted to treat NA as zeros (to get the same results from the video), you can simply use .mean() with its 'skipna' option disabled. This would make:
mygroups['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').mean(skipna = False)
I just changed sum to value_count(normalize=True) and it worked
Best video in the series loving them and normally can’t wait for the next.
Hey Corey! Thanks a million for the Pandas Series. As always, very intuitive and easy to follow.
Now that you've taught Matplotlib and Pandas, would love to see a new Numpy series in order to complete the Data Science trinity. Please consider adding a Numpy Series.
For your exercise (What % knows Python) , I divided the sum in the lambda function by x.count() then multiplied by 100 :
country_group['LanguageWorkedWith'].apply(lambda x: 100 * x.str.contains('Python', ).sum() / x.count())
Thanks for the free content, awesome
use len(x) instead of x.count() because count() function exclude respondents who does not know any language.
The level of my programming in Python has been substantially improved since I have started watching your great videos. Many thanks, Corey. Would you please prepare some videos regarding the networkx module as well?
Mr. Schafer, I am so happy I found your teaching. I have been on a journey to become a data analyst, and after completing the Google Analytics Course , I realized that I needed to learn much more. I am currently finishing a Python Course through Coursera offered by IBM.
Not every professional, no matter how good they are, have the natural ability to teach. Your method and technique are so amazing and helped me to overcome some of the confusions I had with coding in Python. I learned so much from just this video alone.
I will definitely visit the site you referenced, and look forward to learning more from your videos.
Thank you so much!
American?
Working on a project evaluating an employee survey and this is just what the doctor ordered. Thanks! One of the best channels in RUclips for data analysis hands down
Corey Again. Very fantastic tutor. I press the like button before I watch.
after this lecture i started loving the complex coding of pandas and matplotlib. really you are an amazing teacher
I signed up for brilliant org just for Corey Schafer. Thanks for sponsoring him.
Hey, Corey. I'm a network engineer and I'm learning pandas to be able to do some packet analysis and your videos really help me a bunch!
This is my solution for the coding problem from the video. Hope it helps!
know_python = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
total_respondents = country_grp['LanguageWorkedWith'].apply(len)
know_python / total_respondents
I am a GIS student and I want to thank you because I'm doing my last assignment for university and I'm using Geopandas, matplotlib, pandas, cartopy and forth on and you helped so much with your videos, I have build a nice map and I have produced different tables with my data. Thanks god you are out there and sharing your knowledge free
Thank you so much for this video. I learnt way more from this than the many hours I spent sitting in class listening to a teacher who just wanted to end the lesson early or have long lunch breaks. This is really precious. And thanks for the reassurance that if I find this difficult, there's nothing wrong with me LOL.
Hey Corey! For the question you gave: The percentage of people by country who use python. There is an efficient solution too (Without creating a separate dataframe).
country_grp["LanguageWorkedWith"].apply(lambda x:(x.str.contains("Python").sum()/len(x))*100)
Actually what I am doing here is, in the lambda function, at the return part, I divide the No. of python users by the length of the given series and then multiply it by 100. This gives the percent of python users in each country. This approach might be a bit code efficient but can be a bit confusing for some.
len(x) might not give u total respondent since it is series on LanguageWorkedWith and people might have skipped.
@@maheshmmmec Correct
You can use len(x) in the lambda, it will include the na in the series.
You cannot use count(x) because it skips na. But you can use value_counts(x).
If you run the following code, it will verify that Prasanna solution is corret:
len_total = country_grp["LanguageWorkedWith"].apply(lambda x: len(x)).sort_values(ascending=False)
us_no_answer = country_grp["LanguageWorkedWith"].apply(lambda x: x.isna().sum())
us_answer = country_grp["LanguageWorkedWith"].apply(lambda x: x.notna().sum())
df_counts = df['Country'].value_counts()
df_counts = pd.concat([df_counts,len_total,us_answer,us_no_answer],axis='columns')
df_counts.columns=['df_value_counts','lambda_len','user_respond','user_did_not_respond']
df_counts
produces:
df_value_counts lambda_len user_respond user_did_not_respond
United States 20949 20949 20769 180
India 9061 9061 8844 217
Just want to share here my solution for the practice question (but with the survey of 2022):
---
country_group['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True))
---
And also give thanks to your wonderful videos, Corey!
It's been 3 years and they're still among one of the best tutorials.
I thought the same thing
Perfect timing. Thank you for all the effort put into these videos. I've done that last jibe crash so many times, with the feet just off centre
Thanks Corey for sharing these videos. Pretty clear explanations. You are a great teacher.
Im browsing thru some of the videos to brush up on Python, and this is the first python video that didnt get me bored. Concise and brillliant. Love your videos! keep up the good work :)
Best Pandas playlist in youtube. One point solution for python learners
Best videos on pandas on RUclips by Corey Sir. Loving them and normally wait for the next videos. Lots of love for you from India.
Every time I saw a new post, I click the 'like' button before watching:D
I've been doing data analysis using Python & Pandas for a few years now. Still, I'm picking up new things from your videos. Very clearly explained! Thank you!
Thanks!
Thank you!
Hey, Corey, I see many people gave their own answer which are wonderful. So want to give my one here, which looks like more a beginner answer:
#create a new column called: 'Python use'
df['Python_use'] = (
df['LanguageWorkedWith'].str.contains('Python')
)
#.value_counts() the 'Python use' column as it is a boolean type:
country_grp = df.groupby(['Country'])
Python = country_grp['Python_use'].value_counts(normalize=True)
Thanks for your great Lectures! I watched from part 1. I am your big fan now.
The groupby explanation comparing it to a filter is really good.
Thanks so much Corey ! It's clearly impossible not to understand what you explain on all your videos : It's fluid, straightforward, crystal clear ! And more over your english : Whaoooo ... Congratulations !! I wonder if I've learned more Pandas or english !!
200% great !!
Every second of your python video are really worth it!!!
LOL @ "If I did this correctly, and it's definitely possible I made a mistake." Happy I found these videos, thanks.
You're Genius
Love From India
You mentioned India in your tutorial
Love this gesture 🇺🇸🇮🇳
Thanks a lot for your teaching! Here is the my solution at the end of the video:
# group object['column'] is a Series object, so the input of the function is a Series, ana the output value of the function is a float
def percent_know_python_each_country(countrySeries):
num_know_python = countrySeries.str.contains('Python').sum()
num_all = len(countrySeries)
percent = round((num_know_python / num_all * 100), 2)
return percent
country_group['LanguageWorkedWith'].apply(percent_know_python_each_country).sort_values(ascending=False).head(30)
thanks so much for this series. started from the first video two weeks ago, now in the 8th. this series so far made a lot progress in me,. thanks so much, .May God Bless You. Love from Sri Lanka...
Man you are one of the best teachers who ever learned me something, much love and support ♥️
NAMASTE!! Corey Schafer.. Love From India
Hi Corey. I think this might be a relevant simpler approach for getting percentage. I used value_counts(normalize=True) instead of sum.
df.groupby(['Country'])['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True))
This of course return both percentage who know Python and Who don't know. So if we want to get for a specific country, for instance Japan, then:
df.groupby(['Country'])['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True)).loc['Japan'][1]
I have a similar approach to,i am happy to see that i am not the only one who always sees the easiest ways =)))
instead of using : value_counts(), use sum() in second line. Then you'll get appropriate answer.
df.groupby(['Country'])['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
Great approach. Thanks for sharing
Here is my approach : filt=df['LanguageWorkedWith'].str.contains('Python',na=False)
python_count=df.loc[filt]['Country'].value_counts()
python_count.rename('p_c',inplace=True)
python_count
--
total_count=country_grp['Country'].value_counts()
total_count.rename('t_c',inplace=True)
total_count
--
result_horizontal = pd.concat([total_count, python_count], axis=1)
import numpy as np
result_horizontal.replace({'p_c':np.nan},0,inplace=True)
result_horizontal['perc']=(result_horizontal['p_c']/result_horizontal['t_c'])*100
result_horizontal
perfect content. one flaw is that , you never explained what a "lamda" function is and went straight to using the function in your previous videos. you did the same thing here. now i have to first study lamda.
Yeah I had to take a look into it. For anyone curious, lambda and "def return" is the same thing, but lambda are throway functions while "def return" assign a name to be applid further in the code. The "def return" syntax is easier to read and more clean than lambda
Thanks, Corey. This is one of the most comprehensive pandas tutorials on RUclips. Love from India.
I also noticed a subtle issue. We are adding the number of respondents who filled their Country and not who filled LanguageWorkedWith. Someone can fill Country and not LanguageWorkedWith.
Really appreciate the addition of practice problems. It helps me to grasp the material
47 minutes of a pure pandas tutorial from a god in python, man you're a hero🔥🔥
thank you Corey!! this is a wonderful pandas series!! you make the concept so easy that even a python beginner (that is me) without programming background in colleague can understand!!
i'd like to share my solution as well: (since i don't know concat method, i calculate the answer first then convert them into a dataframe by dictionary)
know_py = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
answer = df['Country'].value_counts()
per_cent = know_py/ answer
result = pd.DataFrame({'answer': answer, 'Python': know_py, 'percentage': per_cent})
i find you stop uploading new tutorials for a long time, hope everything goes well with you. and strongly looking forward to hearing from you soooooon!!
thank you & greeting at 2022 sep 7th :)
This video is so useful , the simplification that python does for the problems is so helping. is the best language in the programming of code. And the proffesor of this video is really a genius. !!! thanks.
The reassurance at the end was so appreciated as a beginner. Thank you for your help !!!
Thanks for doing this video in a detailed way, like you always do. Just under an hour is a good length for a video like this. Thanks!
best tutorials ever, i have read lot of books, but your technique is global and works best...keep on the good word @Corey Schafer
I just went through Part 8 a second time. Thanks a bunch!! I learned a lot.
'dropna' parameter of value_counts, is another parameter i explore while digging the difference between my approach and corey's, on top of the 'na' parameter of contains.
btw, thx a lot sir corey, for all these helpful videos!
nice video :) i liked the q/a approach at the end where you left it open.
I've watched quite a variety of different data analysis tutorials and this one was easily one of the most engaging for me. Having interesting data really helps.
Perfect explanation. Making a convoluted yet so important concept crystal clear through step by step explaning and also giving connections to pandas object types. Cheers!
Thank you,I am very lucky that I found your tutorials.
One of the best and most easily understandable vid on Pandas. Thank you for creating this !!
We Love You Corey Schafer!!!!!!!!!! Lots of Love From India❤
Thanks Corey.
I have waited for this video whole week.
Great explaination
This was a lot to learn in one video. That's why I went back and watched it again this week. At the end, I added something I think would be useful in what I do. I added a filter to select records where the number of respondents is >= 5.
filt = python_df['NumRespondents'] >= 5
python_df.loc[filt]
My solution to the practice question:
def pctKnowsPython(x):
try:
return x.str.contains('Python', na=False).value_counts(normalize=True).loc[True]
except KeyError:
return 0
pctKnowsPython = country_grp['LanguageWorkedWith'].apply(pctKnowsPython)
Heyyy coreyyy I got drops in my eyes after watching the way you taught....you made my day❣️✨love you so much corey
Always looking forward for your videos, Corey.
im just diving into pandas and would like to say - GREAT THANK YOU for such nice and detailed explanation.
great job!
Helps me to get into my PhD. Thanks a lot for uploading this!
I found amusing explaining what a percentage% is after going over apply & lambda methods, but that's exactly the thoroughness that makes your videos so loved :)
Very interesting tutorial. Thumb up! Particularly at 33:30 !!
You are just wonderful, it makes so much fun watching your tutorials. I finde directly the answer, those I need.
God bless you brother.
We can do this in the lambda function for getting the percentage :
lambda x : x.str.contains ('Python').value_counts(normalize=True)
And I would love to thankyou for all these videos!
"a = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True))"
this what i did is it correct or not i dont know??
Thanks Corey! This is pure gold!
Hey Corey,
first of all thank you very much for all those fantastic videos. I also have tried to answer the question of percentage knowing python for each country. I came up with following solution:
PrctKnowPython = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python', na=False).value_counts(normalize=True)).loc[:,True]
One advantage of this approach is, that you can just remove na=False and ignore NaN values in your data.
Corey, thank you very much for your free videos!
Corey, the content of your videos are amazing. This tutorial in special is sensational.
A much simpler way of finding the percentage of people from each country who know python is by using the mean() function
Something like this :
country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').mean())
P.S: Corey's method produces a much detail-oriented version of this result
Thank you for the last example (percent that knows python). It was great to see how the different methods learnt can come together in a practical example; this really helped consolidate the knowledge gained.
Really enjoying this series, thank u Corey!
Thanks a lot Corey! Got to learn complex syntax in simple ways. You are amazing teacher.
Thank you for this. Had clearer understanding of pandas than before. Wish you the very best
Hi. Again, thank you for all the good works you've done. Please keep it up. As for making ratio of python knowers by respondents of countries, I personally make the variable first, PctKnowsPython, and then rename it.
country_grp = df.groupby(['Country'])
numpy_country = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum()).sort_values(ascending=False) -> number of people who know python by country
country_resp = df['Country'].value_counts().sort_values(ascending=False) -> number of respondents by country
pynum_by_con = (numpy_country / country_resp) * 100 -> my version of PctKnowPython
py_df = pd.concat([country_resp, numpy_country, pynum_by_con], axis='columns') -> showing the dataframe which has three columns. The last column name is 0.
py_df.rename(columns={'Country': 'NumRespondents' , 'LanguageWorkedWith': 'NumPythonKnower', 0: 'PctKnowsPython'}) -> showing the renamed dataframe.
py_df.rename(columns={'Country': 'NumRespondents' , 'LanguageWorkedWith': 'NumPythonKnower', 0: 'PctKnowsPython'}, inplace=True) -> applying the change.
Hope this may be helpful.
I am following your pandas series since the last 3 days and may complete in 1 or 2 days max, I will come back to the series to revise it, very well made Series and keep up the good work !😀
You could also solve the question in the video by using a pandas pivot table. here is the code:
table = pd.pivot_table(df, index='Country', values=['Hobbyist', 'LanguageWorkedWith'],
aggfunc={'Hobbyist': 'count',
'LanguageWorkedWith': lambda x: x.str.contains('Python').sum()})
table['percent'] = (table['LanguageWorkedWith'] / table['Hobbyist']) * 100
table.sort_values(by='percent', ascending=False)
Nice! I’ll be covering pivot tables in a future video
Corey Shafer deserves a RUclips Teacher award
you're a amazing teacher man, thank you for this free content
I really like the way you speak. Your language is clearly understandable and you have a great accent. :)
I am a huge fan of your videos. Thanks a lot for making this wonderful tutorial😇
you teach really well, I am learning a lot...Thanks. I have also learned other topics from your videos....I was stuck on opening and reading a csv file in python, till I saw your video and learned it....I am an absolute beginner...😅...Thanks
Corey, I learn a lot from your videos. But this one has been especially helpful. Thanks.
At 35:40 since you want to get the percentage of boolean values that are 1 you can pretty much just replace .sum() with .mean(). It's sum/length.
for the question, I just test this one and it works, just use value_counts(normalize=True) instead of using sum()
country_group['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True))
Yes Corey, having a future video on multi-index will be very helpful!