Starting in pandas version 0.19, you can create a category column during the file reading process! Learn more here: ruclips.net/video/-NbY7E9hKxk/видео.html And starting in pandas 0.21, the method for specifying ordered categories has changed. Learn the new method here: ruclips.net/video/te5JrSCW-LY/видео.html
Dude, you are awesome! This is THE best tutorial on Pandas I have come across on the internet. You are really doing the internet a great favor! Thanks a lot!
very useful! I was still a bit skeptical but the example with the country series made it all very clear! you are good at giving the best frame to understand things
Good lord man, this is awesome and your way of teaching is well paced and easy to follow. You're a incredible teacher, keep this way and you will hit the stars!
Well I must sound like a broken record about how good these videos are but they only get better. I've come close on occasion to manually implementing what the category dtype does, so thanks for that revelation.
You should mention that if you perform a df['mycolumn'].astype=('category'), you won't be able to enter arbitrary strings into the DataFrame anymore (write ops are limited to the exact categories). This may be an advantage (typo protection) or disadvantage, depending on the use case! Otherwise, thanks for the conscise and clear instructions!
I understand that the category becomes "available" to only the kinds of values used on it, but how should I do when need to edit? For example, on sex gender I used to have Male of Female. Now I should store many other types. How to edit / increase the category list?
Thank you for an excellent video on writing memory efficient code with categorical data in input. I'm interested in understanding various options to read in large dataframes (other than common pandas and spark methods) containing only numerical data, iterate over its length, create smaller dataframe out of it based on a condition and do some processing, all of which in a faster and memory efficient way. Please cover it if possible.
Amazing explanation along with hands on. I am really stunned with the way of teaching. Thank you very much. Your accent sometimes remembers me Bruce Lee.
This on the coolest tutorials I have watched on pandas. Thanks for making it. I have a question though, would these categories improve the speed of a for loop, if I user iterrows() on the data frame
Using iterrows() in pandas is an anti-pattern and should only be done as a last resort. See engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
Great Videos. Thank you. Would appreciate your advice on the following - I am attempting to maintain customer-wise product wise monthly sales data. The index would be the product and the columns would be the customer name. Data would have to be captured into the table every month. 1. How would you recommend setting up the structure - As different data frames for each month or as a 3 dimensional array, with the 3rd dimension being the monthly data. 2. How do you set up a blank structure containing all possible products and customers and then populate each data frame with monthly sales data received? 3. Suppose you start dealing with a new customer mid year, how do you populate the entire table with this new customer Series and then start capturing their sales data from the month they start buying? Thank you in advance, for the answers
Great question! When using the category data type, you are defining how pandas stores that column of data. However, you still treat that column as strings when working with it within pandas. With label encoding, your goal is to convert categories to numbers so that you can work with the numbers, not the strings. Does that answer your question?
Amazing as always. This entire playlist is in my favorites bar now! I have a quick questions, I tried the bonus tip on the drinksby continent dataframe just to see how it works drinks['continent']=drinks.continent.astype('category', categories=['South America', 'Africa', 'North America', 'Europe', 'Asia', 'Oceania'], ordered=True) and I get this error TypeError: astype() got an unexpected keyword argument 'categories' Any idea why?
(11:00) That method might be usefull for data analysis studies, but if we apply some macine learning algorithms, we HAVE TO use label encoding or one hot encoding etc. technics , right ? I actually want to know that how much correct to convert the attribute as 'category' type in ML instead of not appliyng encoding technics ?
Possible new topic: Methods in pandas that are not well known to most users. I've been using pandas for years and didn't know about the `cat`, `str`, and `memory_usage` methods. I'm familiar with `groupby`, `applymap`, `map`, etc. but it would be cool if you could show case some other methods that are less well known to the common users. Thanks
You are the best! I'm feeling Lucky that I found your channel at right time in my learning path ...Thanks a lot! I have one question here. could you please help understanding general idea behind using 'categories' in astype method since it is not a pre-defined parameter in method documentation (if we click shift+ tab :) )? I mean what all parameters we can use in place of kwargs in an instancemethod just like we used 'categories' here? (All properties/attributes of an object?)
Glad you like the videos! Please consider subscribing to the Data School mailing list: www.dataschool.io/subscribe/ Regarding your question, I don't know how to explain the technical details behind why you can pass the argument 'categories' in this case, other than to say that it's because the pandas code has been written to allow that argument. I'm sorry if that's not what you were looking for!
The way the output looks is determined by your editor. I'm using the Jupyter notebook, though note that the output varies even across different versions of the notebook.
I am glad that I came across your videos. It is really helpful for me. However, can we use categorical and numeric features for building decision trees in sklearn? I am getting the following errors: ValueError: could not convert string to float: 'Zimbabwe' Thank you very much for your help.
You can use categorical features with any scikit-learn model, however you will need to transform them to numeric values. Here are two videos that may help you: ruclips.net/video/0s_1IsROgDc/видео.html ruclips.net/video/ylRlGCtAtiE/видео.html
custom ordered category is now a bit different: from pandas.api.types import CategoricalDtype cat_type = CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True) df.quality.astype(cat_type)
Great video. I have a doubt. Suppose if i have a dataset about computers. I have a column for number of antivirus installed in a computer. I have total 100 observations but only 3 unique values for this column (1, 2 and 3). So should I consider this column as numeric or categorical?
Data School I am predicting if a particular machine will be attacked by a malware soon, based on its configurations and a number of other parameters including number of antiviruses installed
Hi there, thanks for your excellent tutorial. I have a question that I unable to find an answer to, Can you use these columns (ones which have been converted into categories) in analysis, specifically machine learning models? If not how can one do without have to use get_dummies option since I have a column of about 8,000 unique rows?
Hi Kevin, Does this mean we can throw in this category converted variable into machine learning model like Logistic Regression in sklearn or statmodels?
Thanks for the very informative video. I have one question. How do we convert multiple columns to 'category' data type at once? In my data set, I have 25 categorical columns and 6 integer columns. So is there an efficient way of converting these 25 columns to categorical while importing the data set or after importing? Thanks.
Great question! There might be an easy way to do this, perhaps with the apply function, but I'm not sure at the moment. Let me know if you figured out an efficient method!
I usually have to work over big big data samples, even for simple analysis. The main issue I face is that pandas takes more time to read/store the data frame than working on it. Sadly, is quicker e easier to just run some extractions using sql as is runs on the database server than importing data to my local machine.
My question is about using categorical variables to build a logistic regression model using statsmodels. I had some 0-1 integer variables that I wanted to use as some of the predictor variables to build a logistic regression model, but converted them to categorical thinking this would avoid being treated as numerical. However, I got a ValueError: unrecognized data structures: / . Do you understand why? I can take this to a different forum if that would be better..
Nice video. My question goes a bit further. Suppose you wanted to use your k-1 dummy variables in a statsmodels or sci-kit learn logistic regression. would you leave them as type integers or convert them to type categorical?
There is no 'categories' or 'ordered' parameters in the astype() method I use pandas version 0.25.1 So, how do I set a priority in this version? Oh you did explain in your message Thank you
Hi Thanks for the nice videos! df[df.quality >'good'] also works Is there any reason you use df.loc[df.quality > 'good'] in the last part of this video? Under what conditions you use df[ condition] vs df.loc[condition]?
The tutorials are super nice and helpful, but I just got a slight problem that the 'categories' and 'ordered' arguments are not working in python 3.9 and pandas version 1.2.2
Thanks for another fantastic video! I tried the tip at the end, and got a warning message: "FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead." I checked the pandas documentation and substituted CategoricalDType, e.g. "cat_type = CategoricalDtype(categories=["good", "very good", "excellent"],ordered=True) [newline] df['quality'].astype(cat_type)" but that didn't really work the way I was expecting either. Is there a newer way of accomplishing this?
Thanks for your kind words! Regarding your question, you are correct that this has changed in the latest versions of pandas. However, your proposed code looks exactly correct to me. What exactly are you expecting that you are not seeing? Just to be clear, you do need to overwrite the existing 'quality' column if you want there to be a permanent change: df['quality'] = df['quality'].astype(cat_type)
I discuss the new syntax for specifying categories in my latest video, "5 new changes in pandas you need to know about": ruclips.net/video/te5JrSCW-LY/видео.html Hope that helps!
I've been following along on your examples and they've all been incredible, but I encountered an error I can't see to get around on this one. At about 16:45, the command df['quality'] = df.quality.astype('category', categories=['good','very good','excellent'], ordered=True) is given and whenever I try and submit that line to the compiler I get the error ValueError: Got an unexpected argument: categories Was there an update to Pandas that may have changed this function or is there some kind of error I'm not aware I'm making?
at 16:44 I get the error message "ValueError: Got an unexpected argument: categories" for running "df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered =True)" . please help
question: at 5:20, when you coded drinks.memory_usage(deep=True).sum(), it gave '24920L'. What does the 'L' mean after the figure? I think I seemed to see the 'L' thing appears when using the '.shape' function. what does that 'L' mean?
L stands for "long", which I believe refers to the "long integer" type, which is the NumPy data type being used to store that data. In other words, it's an implementation detail that you don't really need to know. Hope that helps!
I am wondering if there is any cryptographic system that can convert strings to integers and then decrypt them back. If yes then why pandas do not implement that in the background to reduce space? Also if we use this astype("category") does has any effects when we export this dataframe into csv or excel file?
Seems like in the latest pandas 1.1.2 version df['quality'] = df.quality.astype('category',categories=['good','verygood','excellent'],ordered=True) this throws an error saying unexpected categories argument. I guess this should work. df['quality'] = pd.Categorical(df.quality,categories=['good','verygood','excellent'],ordered=True)
Thank you very much. Amazing tutorial. When trying this line `df['Quality'] = df.Quality.astype('category', categories = ['good', 'very good', 'excellent'], ordered=True)`, I encountered an error `TypeError: astype() got an unexpected keyword argument 'categories'`
Searched and solve like that: `from pandas.api.types import CategoricalDtype` then I used the line like that `df['Quality'] = df['Quality'].astype(CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True))`
Amazing always !!! Is it possible to convert these type of data into category while we read the data into python? Also, There is another datatype called datetime. I think it would be great if you may enlighten us with that as well for the purpose of datetime manipulation in future.
You're welcome! And, I will do my best to create one on pivot table. In the meantime, here's a good post on it: pbpython.com/pandas-pivot-table-explained.html
Hi... Thanks for sharing the Greatest series of videos on Pandas...!!! Quick question: Is there a way to convert a csv (size more than 2 GB) to a pandas data frame in the system where the RAM is 2 GB. I am getting 'memory error', while executing the code. I cant use 'category', I need the data as same as in the csv. Thanks...!!!
Thanks for your kind words! One strategy is to read in only some of the rows and columns (only the ones you need), demonstrated here: ruclips.net/video/B-r9VuK80dk/видео.html
Hello ,currently I am using pandas version 1.2.2,in that I get an error while runing this code , df.quality.astype('category',categories=[''good','very good','excellent'],ordered =True) And it says that astype() got an unexpected keyword argument 'categories' Do they removed those parameters in newer version of pandas as this video was few years old?
first of all thank you for all of your videos! my question would be: in your case the size of the continent category is 488KB but in my case its 744KB. Can you explain the reason behind this difference?
i want to compare two date and time columns and produce the categorical value of new column if both columns have the same value , like if two columns have the same date and time i need to have 1 else 0. how it can be done pls help me
For the bonus tutorial i got error as "_astype() got an unexpected keyword argument 'categories' " Has the definition to astype() changed? Appreciate if someone could help.
I had a similar error, I think what you did is you somehow ran the code without the "ordered = True" bit of the code at first or some such partial code and then tried to run it again with all the arguments as shown in the tutorial above, in that case it does show the error you mentioned. Just run the DataFrame creation command; ie, df = pd.DataFrame(...) again and then run the df.quality.astype(...) code, it should work. It did for me anyways. Let me know how it goes. Can anyone explain why it happens though? I am not sure about that.
For my dataset it reduced the size by approximately 50%. What i wanted to ask is if it has to lookup each time, does this increases the time complexity?
Thanks for your great videos, I am very enjoying watching, learning a lot. But most of these concepts are already addressed in sql world. I think when you tutor the video, you may reference these subjects to sql subjects. IMHO.
SQL and pandas can indeed accomplish many of the same tasks. For SQL users, you are right that SQL comparisons might be helpful. You might like resource #5 here: www.dataschool.io/best-python-pandas-resources/
You used a parameter called categories.This is not in the parameters of astype method. I think its in **kwargs.In docs I found this: kwargs : keyword arguments to pass on to the constructor. Where is the constructor I cannot understand this
Hie there Great videos, when we wrote drinks.continent.cat.codes.head() we got 1 2 0 2 0 and when I did drinks.head after that, it displayed Asia Europe and all instead of just numbers which should point to a look up table containing strings. Then I did was drinks.memoryusage(deep =True ) which gave reduced continent size... How does this worked . One side it does not reflect in Data frame and on other side it shows reduced . Hope you help me out soon.. Thanks a lot for your amazing videos. Please make more videos on Data Science ML topics .
Great question! The integers are the internal encodings for those categories, and the size is reduced due to those encodings. Does that help? You might like this video series: ruclips.net/p/PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
Looks like Pandas ordered categories syntax has changed. Should now be: from pandas.api.types import CategoricalDtype df['quality'] = df.quality.astype(CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True))
Starting in pandas version 0.19, you can create a category column during the file reading process! Learn more here: ruclips.net/video/-NbY7E9hKxk/видео.html
And starting in pandas 0.21, the method for specifying ordered categories has changed. Learn the new method here: ruclips.net/video/te5JrSCW-LY/видео.html
I am recommending your channel to all my friends. You are too good.
Wow, thank you!
Yes, he is too good. Even our professor recommended learn pandas from him. lol.
@@dataschool Thanks a lot. Could we have some of R shiny or Python visualisation ? Like your teaching style.
Thanks, I reduced mya data from 592.4 MB to 195.0 MB using categories
That's amazing!!!
That is awesome!!
remember, with big data you need pd.eval and df.query for filter, these functions don't use memore for temp bool Series
Dude, you are awesome! This is THE best tutorial on Pandas I have come across on the internet. You are really doing the internet a great favor! Thanks a lot!
Wow! Thank you so much for your kind words! :) You are very welcome.
very useful! I was still a bit skeptical but the example with the country series made it all very clear! you are good at giving the best frame to understand things
Excellent! Glad to hear that this video was helpful to you.
Good lord man, this is awesome and your way of teaching is well paced and easy to follow. You're a incredible teacher, keep this way and you will hit the stars!
Thanks so much for your kind words! Much appreciated!
You’re amazing at explaining, thanks for uploading these content
You're very welcome! Thanks for your kind comments :)
Well I must sound like a broken record about how good these videos are but they only get better. I've come close on occasion to manually implementing what the category dtype does, so thanks for that revelation.
Thank you! I'm glad the category tip was helpful to you!
category feature super powerful, glad i learnt this
Great to hear!
You should mention that if you perform a df['mycolumn'].astype=('category'), you won't be able to enter arbitrary strings into the DataFrame anymore (write ops are limited to the exact categories). This may be an advantage (typo protection) or disadvantage, depending on the use case! Otherwise, thanks for the conscise and clear instructions!
That's a great point, thank you for bringing it up! I really appreciate it.
I understand that the category becomes "available" to only the kinds of values used on it, but how should I do when need to edit?
For example, on sex gender I used to have Male of Female. Now I should store many other types. How to edit / increase the category list?
I very much congratulate you for sharing code used in video with us. Many thanks for that. It is very much useful to me. My warm regards to you.
You're welcome!
loved ur explanation, great teacher
Thank you! 😃
Thank you a million. being struggling with inplace returning none type df most of the time.
Very useful tips. You make pandas easy to understand. Thank you!
You're very welcome!
Thank you for an excellent video on writing memory efficient code with categorical data in input. I'm interested in understanding various options to read in large dataframes (other than common pandas and spark methods) containing only numerical data, iterate over its length, create smaller dataframe out of it based on a condition and do some processing, all of which in a faster and memory efficient way. Please cover it if possible.
Thanks for your suggestion!
Amazing explanation along with hands on. I am really stunned with the way of teaching. Thank you very much. Your accent sometimes remembers me Bruce Lee.
Thank you!
This on the coolest tutorials I have watched on pandas. Thanks for making it. I have a question though, would these categories improve the speed of a for loop, if I user iterrows() on the data frame
Thanks for your kind words! As for your question, I'm not sure, sorry!
Using iterrows() in pandas is an anti-pattern and should only be done as a last resort. See engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
Great explanation! Never knew about "category" before.
Thanks! It's so useful, I knew I had to cover it in the video series!
Really useful tips. Thanks Kevin.
You're very welcome!
Great Videos. Thank you. Would appreciate your advice on the following -
I am attempting to maintain customer-wise product wise monthly sales data. The index would be the product and the columns would be the customer name. Data would have to be captured into the table every month.
1. How would you recommend setting up the structure - As different data frames for each month or as a 3 dimensional array, with the 3rd dimension being the monthly data.
2. How do you set up a blank structure containing all possible products and customers and then populate each data frame with monthly sales data received?
3. Suppose you start dealing with a new customer mid year, how do you populate the entire table with this new customer Series and then start capturing their sales data from the month they start buying?
Thank you in advance, for the answers
I'm sorry, but this is way beyond what I can address in a comment... good luck!
Thanks for ur knowledge sharing. My question is how this category is different from label encoding. They do the same thing?
Great question! When using the category data type, you are defining how pandas stores that column of data. However, you still treat that column as strings when working with it within pandas. With label encoding, your goal is to convert categories to numbers so that you can work with the numbers, not the strings. Does that answer your question?
Amazing as always. This entire playlist is in my favorites bar now! I have a quick questions, I tried the bonus tip on the drinksby continent dataframe just to see how it works
drinks['continent']=drinks.continent.astype('category', categories=['South America', 'Africa', 'North America', 'Europe', 'Asia', 'Oceania'], ordered=True)
and I get this error TypeError: astype() got an unexpected keyword argument 'categories'
Any idea why?
(11:00) That method might be usefull for data analysis studies, but if we apply some macine learning algorithms, we HAVE TO use label encoding or one hot encoding etc. technics , right ? I actually want to know that how much correct to convert the attribute as 'category' type in ML instead of not appliyng encoding technics ?
You are correct that converting to the category type does not prepare it for ML. See this video for more: ruclips.net/video/0w78CHM_ubM/видео.html
Amazing....I enjoy learning from the channel
Thank you!
Yeah, this one was totally awesome. Thanks for making the videos!
Ha! Thank you for the comment! And you are very welcome, I enjoyed making these videos.
If anyone is seeing a FutureWarning error when specifying categories, instead of:
df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)
use:
quality_dtype = pd.api.types.CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True)
df['quality'] = df.quality.astype(quality_dtype)
Right! The API changed in pandas 0.21. More details here: ruclips.net/video/te5JrSCW-LY/видео.html
Man. You are insanely good.
Thank you! 😊
Can you make a video on how to merge, join and concatenate in python and also differences between these. Nice videos by the way!
Thanks for your suggestion, I'll consider it! :)
Possible new topic: Methods in pandas that are not well known to most users. I've been using pandas for years and didn't know about the `cat`, `str`, and `memory_usage` methods. I'm familiar with `groupby`, `applymap`, `map`, etc. but it would be cool if you could show case some other methods that are less well known to the common users. Thanks
Great suggestion, thanks!
Didn't know about the memory_usage, cat, str, etc. Nice!
Thanks!
You are the best!
I'm feeling Lucky that I found your channel at right time in my learning path ...Thanks a lot!
I have one question here.
could you please help understanding general idea behind using 'categories' in astype method since it is not a pre-defined parameter in method documentation (if we click shift+ tab :) )? I mean what all parameters we can use in place of kwargs in an instancemethod just like we used 'categories' here? (All properties/attributes of an object?)
Glad you like the videos! Please consider subscribing to the Data School mailing list: www.dataschool.io/subscribe/
Regarding your question, I don't know how to explain the technical details behind why you can pass the argument 'categories' in this case, other than to say that it's because the pandas code has been written to allow that argument. I'm sorry if that's not what you were looking for!
How to make the output to appear in a tabular form as is shown in your video? This gives the better clarity of data.
The way the output looks is determined by your editor. I'm using the Jupyter notebook, though note that the output varies even across different versions of the notebook.
Great explanation .Thank you.
You are welcome!
I am glad that I came across your videos. It is really helpful for me. However, can we use categorical and numeric features for building decision trees in sklearn? I am getting the following errors:
ValueError: could not convert string to float: 'Zimbabwe'
Thank you very much for your help.
You can use categorical features with any scikit-learn model, however you will need to transform them to numeric values. Here are two videos that may help you:
ruclips.net/video/0s_1IsROgDc/видео.html
ruclips.net/video/ylRlGCtAtiE/видео.html
Thanks a lot! There are awesome..
You're very welcome! Glad they were helpful to you :)
custom ordered category is now a bit different:
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True)
df.quality.astype(cat_type)
thanks! It works!
why does .info( ) have parenthesis? Isn't it an attribute of the DataFrame?
Brilliant video! Thx.bonus was awesome
Glad you enjoyed it!
Great video. I have a doubt. Suppose if i have a dataset about computers. I have a column for number of antivirus installed in a computer. I have total 100 observations but only 3 unique values for this column (1, 2 and 3). So should I consider this column as numeric or categorical?
It depends - what are you trying to predict?
Data School I am predicting if a particular machine will be attacked by a malware soon, based on its configurations and a number of other parameters including number of antiviruses installed
You would consider the column numeric.
Data School thanks a lot. May I know the reason please? And why it depends on the predictor?
Thanks. Very useful. Why do you prefer df.loc[df.quality > 'good', :] over df[df.quality > 'good']?
Either is fine. The first is more explicit, whereas the second is more readable, so I go back and forth! :)
awesome tutorial.. you made it so easy
Thanks!
Hi, could you do a lesson on using the pivot function in Pandas? Haven't seen a good example anywhere.
Thanks for the suggestion! Maybe this might be helpful to you? pbpython.com/pandas-pivot-table-explained.html
Thanks! That helps to explain it a bit better.
Cheers
Hi there, thanks for your excellent tutorial. I have a question that I unable to find an answer to, Can you use these columns (ones which have been converted into categories) in analysis, specifically machine learning models? If not how can one do without have to use get_dummies option since I have a column of about 8,000 unique rows?
I recommend scikit-learn's OneHotEncoder for this case. No, you can't directly feed a category column to scikit-learn. Hope that helps!
You are great Kevin
Thanks! You are great Serdar!
Great tutorial, thank you
You are welcome!
Hi Kevin,
Does this mean we can throw in this category converted variable into machine learning model like Logistic Regression in sklearn or statmodels?
No, that's not how it works, sorry!
Thanks for the very informative video. I have one question. How do we convert multiple columns to 'category' data type at once? In my data set, I have 25 categorical columns and 6 integer columns. So is there an efficient way of converting these 25 columns to categorical while importing the data set or after importing?
Thanks.
Great question! There might be an easy way to do this, perhaps with the apply function, but I'm not sure at the moment. Let me know if you figured out an efficient method!
I usually have to work over big big data samples, even for simple analysis.
The main issue I face is that pandas takes more time to read/store the data frame than working on it.
Sadly, is quicker e easier to just run some extractions using sql as is runs on the database server than importing data to my local machine.
Thank you! Is it possible to create multiple dataframes based on the categories I have in my dataset?
My question is about using categorical variables to build a logistic regression model using statsmodels. I had some 0-1 integer variables that I wanted to use as some of the predictor variables to build a logistic regression model, but converted them to categorical thinking this would avoid being treated as numerical. However, I got a ValueError: unrecognized data structures: / . Do you understand why? I can take this to a different forum if that would be better..
My video coming out on July 12 will answer that question! I'll let you know when it's posted.
Check out my latest video, and see if it answers your question: ruclips.net/video/0s_1IsROgDc/видео.html
Hope that helps!
Nice video. My question goes a bit further. Suppose you wanted to use your k-1 dummy variables in a statsmodels or sci-kit learn logistic regression. would you leave them as type integers or convert them to type categorical?
You would leave them as type integer. Good luck!
There is no 'categories' or 'ordered' parameters in the astype() method
I use pandas version 0.25.1
So, how do I set a priority in this version?
Oh you did explain in your message
Thank you
This should help: nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas_changes.ipynb
Why did sort_values() method not work in line 9 and instead you used sorted()?
Hi
Thanks for the nice videos!
df[df.quality >'good'] also works
Is there any reason you use df.loc[df.quality > 'good'] in the last part of this video?
Under what conditions you use df[ condition] vs df.loc[condition]?
In this case, I use loc to be more explicit. I general, I use loc whenever its flexibility is required.
loved this part
Thanks!
The tutorials are super nice and helpful, but I just got a slight problem that the 'categories' and 'ordered' arguments are not working in python 3.9 and pandas version 1.2.2
See here: ruclips.net/video/te5JrSCW-LY/видео.html
Thats too good. Can you plz come up with tutorial videos of Matplotlib?
Thanks for the suggestion! :)
Great Video Sir
Thanks!
Thanks for another fantastic video! I tried the tip at the end, and got a warning message: "FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead." I checked the pandas documentation and substituted CategoricalDType, e.g. "cat_type = CategoricalDtype(categories=["good", "very good", "excellent"],ordered=True) [newline] df['quality'].astype(cat_type)" but that didn't really work the way I was expecting either. Is there a newer way of accomplishing this?
Thanks for your kind words! Regarding your question, you are correct that this has changed in the latest versions of pandas. However, your proposed code looks exactly correct to me. What exactly are you expecting that you are not seeing?
Just to be clear, you do need to overwrite the existing 'quality' column if you want there to be a permanent change:
df['quality'] = df['quality'].astype(cat_type)
I discuss the new syntax for specifying categories in my latest video, "5 new changes in pandas you need to know about": ruclips.net/video/te5JrSCW-LY/видео.html
Hope that helps!
brilliantly explained.
Thank you!
Great job Kevin!
Thanks! :)
I've been following along on your examples and they've all been incredible, but I encountered an error I can't see to get around on this one. At about 16:45, the command
df['quality'] = df.quality.astype('category', categories=['good','very good','excellent'], ordered=True)
is given and whenever I try and submit that line to the compiler I get the error
ValueError: Got an unexpected argument: categories
Was there an update to Pandas that may have changed this function or is there some kind of error I'm not aware I'm making?
I had tried going to your github and copying the line you used from there, but I was getting the same error
The pandas API has changed, please see this video: ruclips.net/video/te5JrSCW-LY/видео.html
As usual, great lesson. Many thanks!
Thank you!
these videos are excellent!
Thanks!
at 16:44 I get the error message "ValueError: Got an unexpected argument: categories" for running "df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered =True)" . please help
The pandas API has changed. See this video: ruclips.net/video/te5JrSCW-LY/видео.html
Excellent video!! thank you!
You're very welcome!
I should have found your channel more earily! Tks for sharing great vedio
😄
question: at 5:20, when you coded drinks.memory_usage(deep=True).sum(), it gave '24920L'. What does the 'L' mean after the figure?
I think I seemed to see the 'L' thing appears when using the '.shape' function. what does that 'L' mean?
L stands for "long", which I believe refers to the "long integer" type, which is the NumPy data type being used to store that data. In other words, it's an implementation detail that you don't really need to know. Hope that helps!
I am wondering if there is any cryptographic system that can convert strings to integers and then decrypt them back.
If yes then why pandas do not implement that in the background to reduce space?
Also if we use this astype("category") does has any effects when we export this dataframe into csv or excel file?
Question 1 - I'm not sure. Question 2 - no effect. Hope that helps!
Seems like in the latest pandas 1.1.2 version
df['quality'] = df.quality.astype('category',categories=['good','verygood','excellent'],ordered=True) this throws an error saying unexpected categories argument.
I guess this should work.
df['quality'] = pd.Categorical(df.quality,categories=['good','verygood','excellent'],ordered=True)
Thanks for sharing! Yes, the pandas API for ordered categories has changed since I recorded this video.
Amazing tip, thank you again!
You're very welcome!
Very useful!
Agreed! It's surprising that it's not more widely known! I'm trying to change that :)
What is the amount of non unique values that still worth becoming a category?
Hi, why do you hace to put memory_usage = 'deep' and not only memory_usage
That's how you specify the parameter
@@dataschool Thank you very much!!!, never though you would answer, and thank very much in general for your content you have thought me so much!!!
Thank you very much. Amazing tutorial. When trying this line `df['Quality'] = df.Quality.astype('category', categories = ['good', 'very good', 'excellent'], ordered=True)`, I encountered an error `TypeError: astype() got an unexpected keyword argument 'categories'`
Searched and solve like that: `from pandas.api.types import CategoricalDtype` then I used the line like that `df['Quality'] = df['Quality'].astype(CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True))`
cool. Very nice examples.
Thanks! Glad it was helpful to you!
Amazing always !!! Is it possible to convert these type of data into category while we read the data into python?
Also, There is another datatype called datetime. I think it would be great if you may enlighten us with that as well for the purpose of datetime manipulation in future.
Thanks! Regarding your first question, I haven't figured out a way to do it. Regarding datetimes, I will cover that in an upcoming video :)
My latest video on the datetime format has been released: ruclips.net/video/yCgJGsg0Xa4/видео.html
Hope that helps!
Thanks for sharing such great videos.
Can you create one video in explaining pivot table in pandas. That would be really helpful.
Regards
Rahul
You're welcome! And, I will do my best to create one on pivot table. In the meantime, here's a good post on it: pbpython.com/pandas-pivot-table-explained.html
+Data School thank you kevin
Hi... Thanks for sharing the Greatest series of videos on Pandas...!!!
Quick question: Is there a way to convert a csv (size more than 2 GB) to a pandas data frame in the system where the RAM is 2 GB. I am getting 'memory error', while executing the code. I cant use 'category', I need the data as same as in the csv. Thanks...!!!
Thanks for your kind words! One strategy is to read in only some of the rows and columns (only the ones you need), demonstrated here: ruclips.net/video/B-r9VuK80dk/видео.html
This was awesome!
Thanks!
Hi! The data file url doesn't seem to be working all of a sudden. Could you look it up please?
You can get the datasets from here if needed: github.com/justmarkham/pandas-videos
Hello ,currently I am using pandas version 1.2.2,in that
I get an error while runing this code ,
df.quality.astype('category',categories=[''good','very good','excellent'],ordered =True)
And it says that astype() got an unexpected keyword argument 'categories'
Do they removed those parameters in newer version of pandas as this video was few years old?
See this video: ruclips.net/video/te5JrSCW-LY/видео.html
first of all thank you for all of your videos!
my question would be:
in your case the size of the continent category is 488KB but in my case its 744KB. Can you explain the reason behind this difference?
Glad you like the videos! Regarding your question, it's probably due to the version of pandas or Python.
Hello Kevin!!
How can I rename my columns which I changed to categorical data to the original names of the columns?
You can use the DataFrame method 'rename', which I talk about in this video: ruclips.net/video/0uBirYFhizE/видео.html
Hi, Your videos are superb. Learnt a lot.Could you please explain me about pivot and pivot_table?
Thanks! I will consider that for future videos.
i want to compare two date and time columns and produce the categorical value of new column if both columns have the same value , like if two columns have the same date and time i need to have 1 else 0. how it can be done pls help me
df['new'] = (df.first == df.second)
Would grateful if you make some tutorials on big data analytics thanks
Thanks for your suggestion!
Data School i hope will see a great tutorial series from you about big data soon. 😊
Hey python shows an error whenever I type categories in astype, saying: astype got an unexpected keyword argument 'categories'. Can you please help.
the syntax got updated, you better check out the first comment he pinned on top
Ann Gu Thanks
I need something like categories for a age range, for example 0-10, 0-20... Is it possible?
Sure!
What about cols=['col1', 'col2' ]; df[cols].apply(lambda x: x.astype('category')
That seems like it would work!
For the bonus tutorial i got error as "_astype() got an unexpected keyword argument 'categories' "
Has the definition to astype() changed? Appreciate if someone could help.
I had a similar error, I think what you did is you somehow ran the code without the "ordered = True" bit of the code at first or some such partial code and then tried to run it again with all the arguments as shown in the tutorial above, in that case it does show the error you mentioned.
Just run the DataFrame creation command; ie, df = pd.DataFrame(...) again and then run the df.quality.astype(...) code, it should work. It did for me anyways. Let me know how it goes.
Can anyone explain why it happens though? I am not sure about that.
What version of pandas are you running?
Thanks to re-running the the df creation again worked. My pandas version info from conda.
pandas 0.19.2 np112py36_1
-------------------------
file name : pandas-0.19.2-np112py36_1.tar.bz2
name : pandas
version : 0.19.2
build string: np112py36_1
build number: 1
channel : defaults
size : 8.4 MB
arch : x86_64
date : 2017-02-04
license : BSD
md5 : 5ce048ed69412b7bec27989c5c963678
noarch : None
platform : darwin
url : repo.continuum.io/pkgs/free/osx-64/pandas-0.19.2-np112py36_1.tar.bz2
dependencies:
numpy 1.12*
python 3.6*
python-dateutil
pytz
Excellent Article
Thanks!
For my dataset it reduced the size by approximately 50%. What i wanted to ask is if it has to lookup each time, does this increases the time complexity?
No, the lookup shouldn't take a meaningful amount of time.
How to autoupdate the ID column
Thanks for your great videos, I am very enjoying watching, learning a lot. But most of these concepts are already addressed in sql world. I think when you tutor the video, you may reference these subjects to sql subjects. IMHO.
SQL and pandas can indeed accomplish many of the same tasks. For SQL users, you are right that SQL comparisons might be helpful. You might like resource #5 here: www.dataschool.io/best-python-pandas-resources/
df['Quality'] = df.Quality.astype("category", categories=["good", "very good", "excellent"],ordered=True)
any idea when I run this I get this
See this video for details: ruclips.net/video/te5JrSCW-LY/видео.html
You used a parameter called categories.This is not in the parameters of astype method.
I think its in **kwargs.In docs I found this: kwargs : keyword arguments to pass on to the constructor.
Where is the constructor I cannot understand this
Sorry, I don't know how to answer your question!
Hie there Great videos, when we wrote drinks.continent.cat.codes.head() we got 1 2 0 2 0 and when I did drinks.head after that, it displayed Asia Europe and all instead of just numbers which should point to a look up table containing strings.
Then I did was drinks.memoryusage(deep =True ) which gave reduced continent size...
How does this worked . One side it does not reflect in Data frame and on other side it shows reduced .
Hope you help me out soon..
Thanks a lot for your amazing videos.
Please make more videos on Data Science ML topics .
Great question! The integers are the internal encodings for those categories, and the size is reduced due to those encodings. Does that help?
You might like this video series: ruclips.net/p/PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
Looks like Pandas ordered categories syntax has changed. Should now be:
from pandas.api.types import CategoricalDtype
df['quality'] = df.quality.astype(CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True))
What worked for me is:
df['quality'] = pd.Categorical(df.quality, categories=['good', 'very good','excellent'], ordered=True)
@@pdileepan thanks
Thanks for sharing! I have more details here: ruclips.net/video/te5JrSCW-LY/видео.html
Over 9000!!!!!
😄
before the conversion it was OVER 9000 !!!! @10:21
Pretty cool, right? :)
I was looking through the comments for this comment! xD
does anyone know why I get NoneType when I do df.info()? Thanks in advance.
Are you sure that the 'df' object is a pandas DataFrame?
Thanks, it's working now. Awesome tutorials btw.
Great to hear... thanks for your kind comment!
Thanks for making these videos.