Is the mean() method not working for you? You need to include the argument numeric_only=True, for example: drinks.mean(numeric_only=True). This is a new requirement in pandas for cases in which you want to calculate the mean of numeric rows or columns and the DataFrame contains non-numeric data. Hope that helps!
Your explanations are so accurate and so eloquent! I feel I can't thank you enough man! Thank you very much, I appreciate your efforts so much and wish all the best for the future!
Your explanations are wonderful. You speak slowly and in a concise manner, that makes it easy to follow and understand. Thank you! Also, your bonus tips at the end of videos are always so useful!
If mean is not working for you: We first have to drop 'country' and 'continent' columns, these columns contain strings so we can't do mean with them. drinks = drinks.drop(['continent','country'],axis = 1)
Alternatively, you can include the argument numeric_only=True, for example: drinks.mean(numeric_only=True). That way, you can still perform the mean operation without dropping data that you might want to keep. Hope that helps!
Python Newb here (about one month), so please take it easy.... In the first example, we wanted to get rid of 'continent', which is a column, so we had axis=1. but when we started getting into the mean, and we wanted to get the mean of a column, it became axis=0... Obviously, I am missing something here, but it looks to me like it flipped... What am I missing.. I am having some difficulty with axis.
That's a great question! When you are removing something, you are specifying the axis from which you want to remove something. To remove a column, the axis from which you want to remove something is axis 1. When you are aggregating something, you are specifying the axis along which you want the aggregation to occur. Thus if you want to aggregate all rows with the mean function, the axis along which you are aggregating is axis 0. The result is that you have a mean of each column, but the key point is that you aggregated all rows, and the row axis is 0. Does that help?
The way you explain stuffs is amazing. Immediately subscribed to your channel after watching this video. Thanks for your kind support towards the data enthusiasts.
The reason I find "index" and "column" confusing is, in a dataframe, the index of the training example is actually a (m,1) vector, where m is the total number of training examples. On the other hand, features is a (n,1) vector where n is the total number of features in the given dataset. So, basically we are calling the features as "index" here, which is confusing. I will stick to 0 and 1. Thank you for the amazing explanation. :)
That was great explanation which i was eagerly looking for the parameter axis since all documentations are able to explain clearly the same. Thanks for your very informative videos which is supporting beginner level people.
Can't get better than this. You are a great teacher. One question: If the data set have a lot of NaN values in it, and they are like random so lets say some of the values in series are NaN for some index and some are filled, how can we get a data frame without NaN and then save it as a new sheet?
Great explanation. It's not the case anymore as pandas dropped the Panel but I feel like adding another dimension, say using numpy, clarifies it even further. So when having 3 dimensions axis=0 refers to the list matrixes, axis=1 to the rows of the matrixes and axis=2 to the columns of the matrixes. Then when using the axis for a reduction like 'sum' or 'mean' you will basically have a result of the remaining 2 dimensions, except the one you specified in the axis parameter.
Thanks for the question! In the current version of pandas, if a DataFrame contains non-numeric data and you want to calculate the mean of numeric rows or columns, you have to include the argument numeric_only=True. Hope that helps!
The best explanation for axis parameter that I have ever gotten, 0 for moving down and 1 for moving right . But for dropping a column, The axis =1 right? how is that possible?
01:12 Each row represents a country and their reported alcohol consumption per adult. To REMOVE a COLUMN (eg continent column), we'd use the DROP method (DataFrame method) drinks.drop('continent')
Thanks a lot Kevin for this great video! The visual explanation works better than StackOverlow answers. What got me confused with the Pandas axis is the pandas.concat function. I couldn't figure out why axis = 0 is vertical concat and axis = 1 horizontal. With the logic illustrated in this video, I guess I shall consider concat as a kind of operation where axis = 0 is moving along the index? Thanks agian.
Q&A Series: Video #11 _12(TC05:24) drinks.mean(axis=0) ================ Kevin says, "The way I'd like to IMAGINE is THESE FOUR NUMERIC COLUMNS are BEING COLLAPSED DOWN into a SINGLE SET of FOUR NUMBERS that represents the MEAN of EACH COLUMN."
Thank you for teaching us the two different methods and how axis is operating. I have no idea why they have sticked with two different for two axis used in two regions: DataFrame and Mathematical operation. This honestly, complicates and creates confusion between understanding the two areas. But again, it is the fault of founders of pd.df not to sticking with mean axis, which I think they should have done. 😊
Hi axis parameter behaves differently in dropna() method. When my axis=1 the na values in each columns are dropped, whereas in case of 0, the na values of each rows are dropped. Kindly explain this. Thanks
Great video! In the specific case for mean method, I think that a DataFrame of student grades in a semester would me more meaningful in both axis directions. You could get the average class grade for each exam or the average grade for each student. Nevertheless, I am finding very useful to study pandas through your content! Thank you!
Thanks for the series. I have a question for a code thats not quite working as I expected. I am using your dataset of ufo, so I am trying to drop all rows that "Shape_reported" == Other. But for some reason it drops the first row which is not an "Other". Here is the ufo.head(3) City Shape_Reported State Time 0 Ithaca TRIANGLE NY 6/1/1930 22:00 1 Willingboro OTHER NJ 6/30/1930 20:00 2 Holyoke OVAL CO 2/15/1931 14:00 So if I drop rows with the "Other" Shape_reported I should get indexes 0,2 and so on, but 0 is not appearing in the list: in[]: ufo.drop(ufo['Shape_Reported']=='OTHER',axis=0).head() out[]: City Shape_Reported State Time 2 Holyoke OVAL CO 2/15/1931 14:00 3 Abilene DISK KS 6/1/1931 13:00 4 New York Worlds Fair LIGHT NY 4/18/1933 19:00 5 Valley City DISK ND 9/15/1934 15:30 As you can see, Ithaca NY does not appear on the table, does anyone knows why? I am using Jupyter console, not the notebook
I think you want to be filtering, not using the drop method. I suggest checking out this video: ruclips.net/video/2AFGPdNn4FM/видео.html Let me know if that helps!
Dear Sir, In the 'axis parameter' video (at 2.20 mins) you mention that " I DID NOT use the in place parameter, so did not remove the column , this is temporary". What ' in place' actually does. Thanks.
Alternatively, you can include the argument numeric_only=True, for example: drinks.mean(numeric_only=True). That way, you can still perform the mean operation without dropping data that you might want to keep. Hope that helps!
Confusing topic. When I think of dataframe as the "collections of Series" that share the same "index", and not like "rows and columns", things become more clear. The main question for me is: "Why do we use 'axis=1' when dropping column?" From my pondering I may only conclude that dropping is vectorized operation (not atomic).
You need to include the argument numeric_only=True, for example: drinks.mean(numeric_only=True). This is a new requirement in pandas for cases in which you want to calculate the mean of numeric rows or columns and the DataFrame contains non-numeric data. Hope that helps!
I tried to recap the "the dropping the column concept" and wrote "drinks.drop(columns=['beer_servings'],axis=1,inplace=True) " and it still worked, how is it so?
When you are removing something, you are specifying the axis from which you want to remove something. To remove a column, the axis from which you want to remove something is axis 1. When you are aggregating something, you are specifying the axis along which you want the aggregation to occur. Thus if you want to aggregate all rows with the mean function, the axis along which you are aggregating is axis 0. The result is that you have a mean of each column, but the key point is that you aggregated all rows, and the row axis is 0. Hope that helps!
@@dataschool It is still somewhat counter-intuitive, but this is the way it is. Anyhow, thank you for your quick answer and for this entire great RUclips channel. You are doing a great work!
One of the column of Dataframe contains integer, float and missing value(i.e empty) and its dtype shows 'object' . How do i iterate to get part of the column to analyse ? Please help
Yeah it does. In the beginning when we wanted to drop, there was no "motion", but when we did the mathematical calculations, the axis was used as a "collapsor"
I really enjoy your videos and instruction. you are a really great instructor and I am always looking for your videos first because they make the most sense. Do you teach any courses? That would be great I would take them for sure. Thanks very much, Barbara Dilucchio
Thanks! Here is the one course that I currently teach: www.dataschool.io/learn/ You can hear about future courses by subscribing to my newsletter: www.dataschool.io/subscribe/
You're welcome! Regarding your question, the only time (I can think of) when axis is greater than 1 is when using the Panel data structure, a container for 3-dimensional data: pandas.pydata.org/pandas-docs/stable/dsintro.html#panel
Q&A Series: Video #11 _10(TC04:36) drinks.mean() =========== Kevin says, "Why did it give us the mean of each pandas SERIES?" He goes on to say, "The Default behavior of the mean is axis=0. Cf. (compare) drinks.mean() drinks.means(axis=0) =================================== Both produce the same result. Think why?
in sklearn. preprocessing, I used normalize(), which has 'axis=' parameter, I experiment with both axis=0 and axis=1. the model scores are quite different. axis=1 is the default. In practical work, should I try both 0 and 1 axis to capture the best model score?
The axis parameter defines the direction along which the normalization should take place. Depending on your reason for normalization, there is a correct axis to use. So, I would not recommend trying both and picking which one works better. Rather, I would recommend figuring out whether you are trying to normalize samples or features, and then using the axis which does that. Hope that helps!
Sir... Love your series.... 👌 Hats off to you I just have a question.. See .drop(axis=1) removes a coloumn Whereas .sum(axis=1) gives sum of a row. I'm a bit confused. Can u help??
I face a problem when I try to create conditional loops(if,elif,else statements) while using pandas. The error says "Truth value of a series is ambiguous". Any ideas how to fix it?
Your tutorials are great! I am just learning Jupyter/pandas and looking forward to learning more. I have a quick question on how to drop multiple rows? For example, I have a csv file in which every other row is blank, this reads as NaN, and just clutters the DataFrame, I have not been successful in removing/deleting them.
OK, I am confused print(drinks.mean(axis=1, numeric_only = True).head()) print(drinks.drop(["country","continent"], axis=1).mean()) In the first line I have axis=1 and it gives the mean of each row In the second line I have axis=1 and it gives the mean of each column I am assuming pandas axis comes from 2D Numpy array, and axis 0=row or x value and axis 1=column or y value My confused guess is somehow in the first command we are going down the column and taking the mean of every row, so the one refers to columns we are somehow iterating through While in the second, we removed two columns that had strings, and then took the mean of each columns, which makes more sense to me. Does someone have a better way to explain this?
Q&A Series: Video #11 _6(TC02:10) drinks.drop('continent', axis=1).head() -------------------------------------------------------------- Kevin says: "I did not use the > parameter, so it did NOT actually REMOVE the COLUMN."
I'm not quite sure... you can definitely have a Series that contains Python objects like lists or dictionaries, however. And you can use multi-level indexing.
Hi Kevin! Thank you for the class, it was great! Interesting... You have dropped the row 2 (drinks;drop(2, axis=0). However, when you use the mean by row, you can still can see the mean of this row. I thought that when you drop the row you would not be able to return any values regarding this row.Apparently, I was wrong and the drop does not invalidate other function like ".mean()".
Kevin, I realized that you did not made an attribution. Something like: "drinks = drinks.drop(2, axis=0)" instead of the operation "drinks.drop(2, axis=0)". Thanks again!!
That's correct, I didn't overwrite the original drinks object, or perform the operation "inplace". Thus, the DataFrame didn't change. Glad you like the video!
The df.mean() you illustrated was a great example of applying a single function to multiple columns at once. This is quite handy in many operations. On the other hand, is there an efficient way to apply multiple functions to multiple columns? Say, I wanted to do mean of beer_servings, sum of wine_servings, sd of spirit_servings and median of total alcohol?
Can there be more than 2 axis (0 and 1) in a Pandas dataframe? I guess more fundamentally can there be more than 2 dimensional datasets and can axis be used to point to it? Maybe I'm talking about a hierarchical dataframe.
Hi Kevin, thanks for the video. As others have noted: The axis coordinates for drop() and mean() appear to have opposite behaviors. For drop('name', axis=1) the method scans in row-wise direction until it finds a column named 'name' and then drops all of values in a column-wise direction. For mean(axis =1) the method scans in a column-wise direction and then calculates the mean of all the values in a row-wise direction. Why would anyone think it would be a good idea to write methods that have opposite behaviors for referencing and operating? I'm sure there's a reason, but I can't see it.
I don't believe they have opposite behaviors, but I certainly understand that viewpoint. Here is how I think about it: 0 is the row axis, and 1 is the column axis. When you drop with axis=1, that means drop a column. When you take the mean with axis=1, that means the operation should "move across" the column axis, which produces row means. In other words, I think of a mean as an operation that "scans", and I think of a drop as an operation that an operation that "selects". However, if you think of both operations as scanning, then I agree that the behaviors would seem to be opposite. Hope that helps!
Hey Kevin, thanks for the video. I have a question, when applying a function to a df, pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html, it says: "axis : {0 or ‘index’, 1 or ‘columns’}, default 0 0 or ‘index’: apply function to each column 1 or ‘columns’: apply function to each row", How to understand this? When axist = 0, shouldn't the function be applied to each row instead of each column?
I know it's confusing! Basically, when you use apply with axis=0, you are saying you want to apply a function along axis 0, which means on each column. I talk about it more in this video: ruclips.net/video/P_q0tkYqvSk/видео.html Hope that helps!
when i tried to import the csv by drinks=pd.read_csv('bit.ly/drinksbycountry') i am unable to import to the dataframe and throwing error as below in the last line of a very long message. Any help please to use the dataframe to follow thro the Video
0 is always for rows and 1 is always for columns. If you are confused, you can use the alternative like he suggested...use 'index' for rows and 'columns' for columns...
You need to include the argument numeric_only=True, for example: drinks.mean(numeric_only=True). This is a new requirement in pandas for cases in which you want to calculate the mean of numeric rows or columns and the DataFrame contains non-numeric data. Hope that helps!
I can't think of an efficient way to do this, without knowing more specifics about the DataFrames. Do they have the same number of rows? Columns? Same index? Same column names? And so on. The solution would depend on those factors.
dataframes have same column but rows might differ, i used below code which gives me difference but it is taking lot time .If you can tell some more efficient way than this then it would be really help me alot. def report_diff(x): return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x) # We want to be able to easily tell which rows have changes def has_change(row): if "--->" in row.to_string(): return "Y" else: return "N" # Read in both excel files df1 = pd.read_csv(r'FL_insurance_first.csv')#,index_col='policyID', parse_dates=True) df2 = pd.read_csv(r'FL_insurance_second.csv') # # Make sure we order by account number so the comparisons work df1.sort(columns="policyID") df1=df1.reindex() df2.sort(columns="policyID") df2=df2.reindex() # # Create a panel of the two dataframes diff_panel = pd.Panel(dict(df1=df1,df2=df2)) # #Apply the diff function diff_output = diff_panel.apply(report_diff, axis=0) diff_output.to_csv(r'my-diff-1.csv',index=False)
How about just: (df1 == df2).sum() That will tell you how many differences exist in each column, as long as there are the same number of rows and columns in df1 and df2. If that works, you should be able to use filtering to reveal the rows that are different (I'd have to experiment to figure out the exact code). Hope that helps!
how did you import csv file using bit.ly link that? isn't it need to be a csv file from local computer? i tried to search the link in the browser and it brings me to your github file.. but the link change to a longer link compared to what you writen.. how did you do that using bit.ly.. it really awesome :) anyone can teach me?
@@dataschool i think i understand.. so basically bit.ly is just a shorter version of the link for the file that is hosted in the github right? Thank you for the reply :) i really appreciate it
Is the mean() method not working for you? You need to include the argument numeric_only=True, for example: drinks.mean(numeric_only=True).
This is a new requirement in pandas for cases in which you want to calculate the mean of numeric rows or columns and the DataFrame contains non-numeric data. Hope that helps!
Thank you for your update...... Your explanation is truly awesome.......
Your explanations are so accurate and so eloquent! I feel I can't thank you enough man! Thank you very much, I appreciate your efforts so much and wish all the best for the future!
Wow, thank you so much for your kind comment! You are very welcome!
Your explanations are wonderful. You speak slowly and in a concise manner, that makes it easy to follow and understand. Thank you!
Also, your bonus tips at the end of videos are always so useful!
When people like what they do it shows. Thanks for your explanations, their clarity and the quality of your enunciation!
Thanks so much! 🙏
If mean is not working for you:
We first have to drop 'country' and 'continent' columns, these columns contain strings so we can't do mean with them.
drinks = drinks.drop(['continent','country'],axis = 1)
Thanks
Alternatively, you can include the argument numeric_only=True, for example: drinks.mean(numeric_only=True). That way, you can still perform the mean operation without dropping data that you might want to keep. Hope that helps!
Fantastic explanations Kevin! Really enjoy learning from you
Thanks!
2022 , 6 years old video e yet it is so accurate from others. Please upload more content. Thnk you😊
Thank you so much!
Dude I love you, I was confused and a little frustrated with this topic and you explained it perfectly
That's excellent to hear!
I'm taking an online course but I'm actually learning all the stuff from these Data School videos cause they are better.
Thank you!
Python Newb here (about one month), so please take it easy....
In the first example, we wanted to get rid of 'continent', which is a column, so we had axis=1.
but when we started getting into the mean, and we wanted to get the mean of a column, it became axis=0...
Obviously, I am missing something here, but it looks to me like it flipped... What am I missing.. I am having some difficulty with axis.
That's a great question! When you are removing something, you are specifying the axis from which you want to remove something. To remove a column, the axis from which you want to remove something is axis 1.
When you are aggregating something, you are specifying the axis along which you want the aggregation to occur. Thus if you want to aggregate all rows with the mean function, the axis along which you are aggregating is axis 0. The result is that you have a mean of each column, but the key point is that you aggregated all rows, and the row axis is 0.
Does that help?
@@dataschool absolutely it helped..
Thanks
The way you explain stuffs is amazing. Immediately subscribed to your channel after watching this video. Thanks for your kind support towards the data enthusiasts.
Awesome! Thanks for subscribing and for your kind words :)
The hand movement trick to remember about axis is really helpful. Thanks . Excellent video !!
Great to hear! It's hard for me to know if those visual tricks are useful to people, so I'm glad to hear that it works for you!
Excellent explanation in a concise way...very helpful.
Thank you!
you are an artist,I appreciate your efforts so much
Thank you!
The reason I find "index" and "column" confusing is, in a dataframe, the index of the training example is actually a (m,1) vector, where m is the total number of training examples.
On the other hand, features is a (n,1) vector where n is the total number of features in the given dataset.
So, basically we are calling the features as "index" here, which is confusing.
I will stick to 0 and 1. Thank you for the amazing explanation. :)
That was great explanation which i was eagerly looking for the parameter axis since all documentations are able to explain clearly the same. Thanks for your very informative videos which is supporting beginner level people.
You're welcome!
Can't get better than this. You are a great teacher. One question: If the data set have a lot of NaN values in it, and they are like random so lets say some of the values in series are NaN for some index and some are filled, how can we get a data frame without NaN and then save it as a new sheet?
I don't completely understand your question, I'm sorry!
Best classes about Pandas. Thank you!
Thank you!
axis = 0 could be seen as 'show the results for this axis', in the case 'x'. And for the other axis is the same idea. Amazing videos.
Thanks for sharing! Glad you like the videos :)
Thank you so much for your explanations. It's so easy to understand and very helpful to me!
You're very welcome!
very useful video. Nicely explained axis=0, axis=1 & mean. If could explained with inplace=True very gratful.
Thanks for sharing.
Great explanation.
It's not the case anymore as pandas dropped the Panel but I feel like adding another dimension, say using numpy, clarifies it even further.
So when having 3 dimensions axis=0 refers to the list matrixes, axis=1 to the rows of the matrixes and axis=2 to the columns of the matrixes.
Then when using the axis for a reduction like 'sum' or 'mean' you will basically have a result of the remaining 2 dimensions, except the one you specified in the axis parameter.
drinks.mean(axis=1) or drinks.mean(axis=0) both give the same error. TypeError: can only concatenate str (not "int") to str. How can solve it?
Thanks for the question! In the current version of pandas, if a DataFrame contains non-numeric data and you want to calculate the mean of numeric rows or columns, you have to include the argument numeric_only=True. Hope that helps!
Thanks for the video. You actually made it feel so easy to learn coding hahaha
That's awesome to hear!
@@dataschool that is true, but unfortunately it is not true... LOL
un grand Merci for your knowledge and your diction...
You're very welcome!
The best explanation for axis parameter that I have ever gotten, 0 for moving down and 1 for moving right . But for dropping a column, The axis =1 right? how is that possible?
thanks a lot bro. you are doing social work by educating people in the current trend.
Thanks!
Q&A Series: Video #11 _3(TC01:36)
drinks.drop()
01:12
Each row represents a country and their reported alcohol consumption per adult. To REMOVE a COLUMN (eg continent column), we'd use the DROP method (DataFrame method)
drinks.drop('continent')
Q&A Series: Video #11 _8(TC04:00)
drinks.mean()
.mean()
DOT MEAN()
axis parameter had always confused me, Now i understand how it actually works.Thanks buddy.
If possible please create a playlist for matplotlib.
Glad I could be of help!
Thanks for your video suggestion - I'll consider it for the future.
he's right, I would love to see your explanation of matplotlib...
Q&A Series: Video #11 _5(TC01:56)
drinks.drop('continent', axis=1).head()
As always, thank you for such a valuable video!
You're very welcome!
Thanks a lot Kevin for this great video! The visual explanation works better than StackOverlow answers. What got me confused with the Pandas axis is the pandas.concat function. I couldn't figure out why axis = 0 is vertical concat and axis = 1 horizontal. With the logic illustrated in this video, I guess I shall consider concat as a kind of operation where axis = 0 is moving along the index? Thanks agian.
axis=0 is the "rows" or "index" axis, meaning concatenate (stack) rows. axis=1 is the "columns" axis, meaning concatenate columns. Hope that helps!
Q&A Series: Video #11 _12(TC05:24)
drinks.mean(axis=0)
================
Kevin says, "The way I'd like to IMAGINE is THESE FOUR NUMERIC COLUMNS are BEING COLLAPSED DOWN into a SINGLE SET of FOUR NUMBERS that represents the MEAN of EACH COLUMN."
Q&A Series: Video #11 _7(TC02:40)
drinks.drop(2, axis=0).head()
============================
.drop(2, axis = 0)
=======================
Thanks, this was a great video, going to come in handy
Great to hear!
Thank you for teaching us the two different methods and how axis is operating. I have no idea why they have sticked with two different for two axis used in two regions: DataFrame and Mathematical operation. This honestly, complicates and creates confusion between understanding the two areas. But again, it is the fault of founders of pd.df not to sticking with mean axis, which I think they should have done. 😊
How you set the data frame display like table/excel? Like your data frame output is showing.
Hi axis parameter behaves differently in dropna() method. When my axis=1 the na values in each columns are dropped, whereas in case of 0, the na values of each rows are dropped. Kindly explain this. Thanks
Great video! In the specific case for mean method, I think that a DataFrame of student grades in a semester would me more meaningful in both axis directions. You could get the average class grade for each exam or the average grade for each student.
Nevertheless, I am finding very useful to study pandas through your content! Thank you!
Excellent explanation!!! Thank you so much!
You're welcome!
A really useful video series. Thanks so much.
You're welcome!
Thanks for the series. I have a question for a code thats not quite working as I expected.
I am using your dataset of ufo, so I am trying to drop all rows that "Shape_reported" == Other.
But for some reason it drops the first row which is not an "Other".
Here is the ufo.head(3)
City Shape_Reported State Time
0 Ithaca TRIANGLE NY 6/1/1930 22:00
1 Willingboro OTHER NJ 6/30/1930 20:00
2 Holyoke OVAL CO 2/15/1931 14:00
So if I drop rows with the "Other" Shape_reported I should get indexes 0,2 and so on, but 0 is not appearing in the list:
in[]: ufo.drop(ufo['Shape_Reported']=='OTHER',axis=0).head()
out[]: City Shape_Reported State Time
2 Holyoke OVAL CO 2/15/1931 14:00
3 Abilene DISK KS 6/1/1931 13:00
4 New York Worlds Fair LIGHT NY 4/18/1933 19:00
5 Valley City DISK ND 9/15/1934 15:30
As you can see, Ithaca NY does not appear on the table, does anyone knows why?
I am using Jupyter console, not the notebook
I think you want to be filtering, not using the drop method. I suggest checking out this video: ruclips.net/video/2AFGPdNn4FM/видео.html
Let me know if that helps!
I watched that video before but never ocurred to me to do:
ufo[ufo["Shape Reported"] != "OTHER"]
Which works nicely, thanks for the help!
You're very welcome!
Dear Sir, In the 'axis parameter' video (at 2.20 mins) you mention that " I DID NOT use the in place parameter, so did not remove the column , this is temporary". What ' in place' actually does.
Thanks.
See this video: ruclips.net/video/XaCSdr7pPmY/видео.html
.mean() is not working for me
Drop the country and continent axis first. You can't do sum or mean with strings
Alternatively, you can include the argument numeric_only=True, for example: drinks.mean(numeric_only=True). That way, you can still perform the mean operation without dropping data that you might want to keep. Hope that helps!
Q&A Series: VIDEO #11 _1 (TC01:05)
import pandas as pd
after dropping a row, how to re-index the table?
i think its DataFrame.reset_index
@@DywanJohnson it didnt work
Confusing topic.
When I think of dataframe as the "collections of Series" that share the same "index",
and not like "rows and columns", things become more clear.
The main question for me is:
"Why do we use 'axis=1' when dropping column?"
From my pondering I may only conclude that dropping is vectorized operation (not atomic).
Q&A Series: Video #11 _4(TC01:50)
DataFrame.drop()
00:28
DROPPING ROWS and COLUMNS: RECAP
> parameter
import pandas as pd
drinks = pd.read_csv(' bit.ly/drinksbycountry ') 01:01
while performing the mean operation, it shows that it could not convert the country's name to numeric , its an error. What to do?
You need to include the argument numeric_only=True, for example: drinks.mean(numeric_only=True). This is a new requirement in pandas for cases in which you want to calculate the mean of numeric rows or columns and the DataFrame contains non-numeric data. Hope that helps!
I tried to recap the "the dropping the column concept" and wrote "drinks.drop(columns=['beer_servings'],axis=1,inplace=True) " and it still worked, how is it so?
If in a column there are some word repeating. Then how can we count their occurance nd how can we filter it.
Isn't it a bit a contradiction? for drop, axis =1 means a column, while for mean (of a column) we use axis = 0 ?
When you are removing something, you are specifying the axis from which you want to remove something. To remove a column, the axis from which you want to remove something is axis 1. When you are aggregating something, you are specifying the axis along which you want the aggregation to occur. Thus if you want to aggregate all rows with the mean function, the axis along which you are aggregating is axis 0. The result is that you have a mean of each column, but the key point is that you aggregated all rows, and the row axis is 0. Hope that helps!
@@dataschool It is still somewhat counter-intuitive, but this is the way it is. Anyhow, thank you for your quick answer and for this entire great RUclips channel. You are doing a great work!
Thanks so much for your kind words!
One of the column of Dataframe contains integer, float and missing value(i.e empty) and its dtype shows 'object' . How do i iterate to get part of the column to analyse ? Please help
When you say "get part of the column to analyse", what exactly do you mean? Could you give a specific example? Thanks!
Your presentation is fantastic .....how can I see you all presentation on Pandas in series at one sitting?
Is this what you're looking for? ruclips.net/p/PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y
Thanks for wonderful explanation.I just have a question "Does the behaviour of axis parameter changes according to methods ".
Yeah it does. In the beginning when we wanted to drop, there was no "motion", but when we did the mathematical calculations, the axis was used as a "collapsor"
Awesome...Awesome...Superb and thanks for this videos
You're very welcome!
for example, sometimes we have country names as indexes and how we use boolean masking on data frame. how we get that index value.
I'm sorry, I don't understand your question. Could you clarify? Thanks!
what is Boolean Masking?
A boolean mask is basically how you filter by condition. This video should help: ruclips.net/video/2AFGPdNn4FM/видео.html
I really enjoy your videos and instruction. you are a really great instructor and I am always looking for your videos first because they make the most sense. Do you teach any courses? That would be great I would take them for sure. Thanks very much, Barbara Dilucchio
Thanks! Here is the one course that I currently teach: www.dataschool.io/learn/
You can hear about future courses by subscribing to my newsletter: www.dataschool.io/subscribe/
Thanks again for another great video.
My question is: are there any axes greater than 1, and if so what are they used for? Regards!
You're welcome! Regarding your question, the only time (I can think of) when axis is greater than 1 is when using the Panel data structure, a container for 3-dimensional data: pandas.pydata.org/pandas-docs/stable/dsintro.html#panel
Awesome tutorial, thank you!
Glad it was helpful to you!
Q&A Series: Video #11 _10(TC04:36)
drinks.mean()
===========
Kevin says, "Why did it give us the mean of each pandas SERIES?" He goes on to say, "The Default behavior of the mean is axis=0.
Cf. (compare)
drinks.mean() drinks.means(axis=0)
===================================
Both produce the same result. Think why?
how do we add columns'rows ?
in sklearn. preprocessing, I used normalize(), which has 'axis=' parameter, I experiment with both axis=0 and axis=1. the model scores are quite different. axis=1 is the default. In practical work, should I try both 0 and 1 axis to capture the best model score?
The axis parameter defines the direction along which the normalization should take place. Depending on your reason for normalization, there is a correct axis to use. So, I would not recommend trying both and picking which one works better. Rather, I would recommend figuring out whether you are trying to normalize samples or features, and then using the axis which does that. Hope that helps!
Sir... Love your series.... 👌
Hats off to you
I just have a question..
See .drop(axis=1) removes a coloumn
Whereas .sum(axis=1) gives sum of a row. I'm a bit confused. Can u help??
I face a problem when I try to create conditional loops(if,elif,else statements) while using pandas. The error says "Truth value of a series is ambiguous". Any ideas how to fix it?
I think if you search Stack Overflow with that error message, you will find some answers that explain what you are doing wrong. Good luck!
Your tutorials are great! I am just learning Jupyter/pandas and looking forward to learning more. I have a quick question on how to drop multiple rows? For example, I have a csv file in which every other row is blank, this reads as NaN, and just clutters the DataFrame, I have not been successful in removing/deleting them.
One of these videos might help:
ruclips.net/video/fCMrO_VzeL8/видео.html
ruclips.net/video/2AFGPdNn4FM/видео.html
OK, I am confused
print(drinks.mean(axis=1, numeric_only = True).head())
print(drinks.drop(["country","continent"], axis=1).mean())
In the first line I have axis=1 and it gives the mean of each row
In the second line I have axis=1 and it gives the mean of each column
I am assuming pandas axis comes from 2D Numpy array, and axis 0=row or x value and axis 1=column or y value
My confused guess is somehow in the first command we are going down the column and taking the mean of every row, so the one refers to columns we are somehow iterating through
While in the second, we removed two columns that had strings, and then took the mean of each columns, which makes more sense to me.
Does someone have a better way to explain this?
how can we drop rows 20 to 30?
Hello, how do i create a web scraping application in python? Thank you!
This should help: ruclips.net/p/PL5-da3qGB5IDbOi0g5WFh1YPDNzXw4LNL
HI Kevin ,thank you for the video.
How can we retrieve the column or row dropped with the inplace=True parameter.
There is no way to retrieve a column or row that has already been dropped. Sorry!
Q&A Series: Video #11 _6(TC02:10)
drinks.drop('continent', axis=1).head()
--------------------------------------------------------------
Kevin says: "I did not use the > parameter, so it did NOT actually REMOVE the COLUMN."
Is there a concept of nested dataframes? for example, if a dataframe had 3 columns, can the 3rd column be a dataframe for each row?
I'm not quite sure... you can definitely have a Series that contains Python objects like lists or dictionaries, however. And you can use multi-level indexing.
so clear and so smooth
Thank you!
@@dataschool 😄
Hi Kevin! Thank you for the class, it was great!
Interesting... You have dropped the row 2 (drinks;drop(2, axis=0). However, when you use the mean by row, you can still can see the mean of this row. I thought that when you drop the row you would not be able to return any values regarding this row.Apparently, I was wrong and the drop does not invalidate other function like ".mean()".
Kevin, I realized that you did not made an attribution. Something like: "drinks = drinks.drop(2, axis=0)" instead of the operation "drinks.drop(2, axis=0)". Thanks again!!
That's correct, I didn't overwrite the original drinks object, or perform the operation "inplace". Thus, the DataFrame didn't change.
Glad you like the video!
The df.mean() you illustrated was a great example of applying a single function to multiple columns at once. This is quite handy in many operations. On the other hand, is there an efficient way to apply multiple functions to multiple columns? Say, I wanted to do mean of beer_servings, sum of wine_servings, sd of spirit_servings and median of total alcohol?
I think it would be best to apply each of those functions in separate steps.
what is Boolean masking ?
and How get index which a type of String.
I'm sorry, I don't understand your question. Could you clarify? Thanks!
Can there be more than 2 axis (0 and 1) in a Pandas dataframe? I guess more fundamentally can there be more than 2 dimensional datasets and can axis be used to point to it? Maybe I'm talking about a hierarchical dataframe.
Yes, you can have more than 2 axes. Yes, that is a hierarchical index aka MultiIndex. Hope that helps!
Great video. By the way, do you have a course about Joining CSV (different structure but common key) ? I'm keen to learn about this
Hi Kevin, thanks for the video. As others have noted: The axis coordinates for drop() and mean() appear to have opposite behaviors.
For drop('name', axis=1) the method scans in row-wise direction until it finds a column named 'name' and then drops all of values in a column-wise direction.
For mean(axis =1) the method scans in a column-wise direction and then calculates the mean of all the values in a row-wise direction.
Why would anyone think it would be a good idea to write methods that have opposite behaviors for referencing and operating? I'm sure there's a reason, but I can't see it.
I don't believe they have opposite behaviors, but I certainly understand that viewpoint. Here is how I think about it:
0 is the row axis, and 1 is the column axis. When you drop with axis=1, that means drop a column. When you take the mean with axis=1, that means the operation should "move across" the column axis, which produces row means.
In other words, I think of a mean as an operation that "scans", and I think of a drop as an operation that an operation that "selects". However, if you think of both operations as scanning, then I agree that the behaviors would seem to be opposite.
Hope that helps!
Hey Kevin, thanks for the video. I have a question, when applying a function to a df,
pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html,
it says:
"axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’: apply function to each column
1 or ‘columns’: apply function to each row",
How to understand this? When axist = 0, shouldn't the function be applied to each row instead of each column?
I know it's confusing! Basically, when you use apply with axis=0, you are saying you want to apply a function along axis 0, which means on each column. I talk about it more in this video: ruclips.net/video/P_q0tkYqvSk/видео.html
Hope that helps!
Finally! Thank you so much
You're welcome!
when i tried to import the csv by
drinks=pd.read_csv('bit.ly/drinksbycountry')
i am unable to import to the dataframe and throwing error as below in the last line of a very long message.
Any help please to use the dataframe to follow thro the Video
Thank you, you are the best!!!
Thanks!
You are the best
Thanks! :)
Great video. There is always issue understand axis when it comes to 3D array :(
Some cases we use axis =0 for row ... for some area axis = 1 for rows... its confusing....
0 is always for rows and 1 is always for columns. If you are confused, you can use the alternative like he suggested...use 'index' for rows and 'columns' for columns...
sir the mean function is not working for me
You need to include the argument numeric_only=True, for example: drinks.mean(numeric_only=True). This is a new requirement in pandas for cases in which you want to calculate the mean of numeric rows or columns and the DataFrame contains non-numeric data. Hope that helps!
can i drop both columns and rows at the same time ???
I don't think that is possible, sorry!
Great explanation thankyou
You're welcome!
hey man,great videos..Thanks alot....i have question:) how i can compare two dataframe & highlight the difference?
I can't think of an efficient way to do this, without knowing more specifics about the DataFrames. Do they have the same number of rows? Columns? Same index? Same column names? And so on. The solution would depend on those factors.
dataframes have same column but rows might differ,
i used below code which gives me difference but it is taking lot time .If you can tell some more efficient way than this then it would be really help me alot.
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
# We want to be able to easily tell which rows have changes
def has_change(row):
if "--->" in row.to_string():
return "Y"
else:
return "N"
# Read in both excel files
df1 = pd.read_csv(r'FL_insurance_first.csv')#,index_col='policyID', parse_dates=True)
df2 = pd.read_csv(r'FL_insurance_second.csv')
# # Make sure we order by account number so the comparisons work
df1.sort(columns="policyID")
df1=df1.reindex()
df2.sort(columns="policyID")
df2=df2.reindex()
# # Create a panel of the two dataframes
diff_panel = pd.Panel(dict(df1=df1,df2=df2))
# #Apply the diff function
diff_output = diff_panel.apply(report_diff, axis=0)
diff_output.to_csv(r'my-diff-1.csv',index=False)
How about just:
(df1 == df2).sum()
That will tell you how many differences exist in each column, as long as there are the same number of rows and columns in df1 and df2.
If that works, you should be able to use filtering to reveal the rows that are different (I'd have to experiment to figure out the exact code).
Hope that helps!
Sir, How to convert a Excel file into a CSV?
Thanks a lot for your videos helping me a lot during lockdown
Thanks for great videos. I executed the code : df.mean(axis='index') but the result I got is : Series([], dtype: float64) . Why this occured.
That depends on what is contained in your DataFrame. Sorry, that's all I can say!
Q&A Series: Video #11 _9(TC04:25
)
drinks.mean()
===========
type(drinks.mean())
((pandas.core.series.Series))
==========================
1) beer_serving: 106.16.....
pandas.Series
2) spirit_servings: 80.99....
pandas.Series
3) wine_servings: 49.45...
pandas.Series
4) total_liters_of_pure_alcohol: 4.71...pandas Series
how did you import csv file using bit.ly link that? isn't it need to be a csv file from local computer? i tried to search the link in the browser and it brings me to your github file.. but the link change to a longer link compared to what you writen.. how did you do that using bit.ly.. it really awesome :) anyone can teach me?
read_csv can read from a URL, not just a local file! The bit.ly link simply points to a CSV file that is hosted on GitHub. Does that help?
@@dataschool i think i understand.. so basically bit.ly is just a shorter version of the link for the file that is hosted in the github right? Thank you for the reply :) i really appreciate it
Exactly!
Awesome tutorial :) Thanks :)
You're welcome! :)
Q&A Series: Video #11 _11(TC05:19)
drinks.mean(axis=0)
================
Kevin says, "The direction I want the operation to occur is DOWN."
You are the best.Thanks
Thanks! :)