hii keith!!! I am getting an error after this line CODE: for file in files: current_data = pd.read_csv(path + "/" + file) ERROR: ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2 Please can you help me solve this error....I tried to find solution online but didn't get any.
@@colorways518 Just thinking out loud,aren't we able to find the below kind of info from Amazon Jungle Scout, Helium10, Sellics. We are amazon seller, do we also need to go thru Python and data-science on Amazon. There are 3rd Party SaaS plug-ins to solve these questions. Correct me if i am wrong? - What was the best month for sales? How much was earned that month?
For the problem on getting city with highest sales, we ran into an ordering problem while plotting the cities, I think we can also use result.index as our xtick That way it simply takes the values straight from the Dataframe in the right order rather than using df.unique and rearranging
@Pasha people who think they know it all are a bore. 🙄 You could always learn something new from other people, it never hurts to learn new perspectives. Good luck with that mindset. I learn everyday. 😌
the best part part was watching some one google the answer an seeing how they implement the solution instead of just acting like they know everything. man your tutorials are the best an down to earth
hahahahaaha you think this kids knows what he is doing and for your information we all google no matter what postion we hold. 🤣 we built websites for a reason to always look back to when needed. Google provides faster search capability rather going to src and look through to get to. Get your mind straight about goodle 🤣 This kid clearly looking around for the code he already written and you assuming google is preferred to be a bad example as a programmer 😂 tells me you expecting movies type like hackers hahahaahahaha. Come to reality
It honestly makes it feel more real, like, I am studying data science now and I google stuff all the time, the fact that even someone well versed in data science still googles stuff constantly is reassuring.
As a business major with very limited internship experience, I am teaching myself python and data analytics from scratch. This video is literal gold to me because this is one of the few that actually shows the entire wrangling process! Thanks for the great vid!
If i use only fd=pd.read_csv("./Sales_Data/Sales_April_2019.csv") i get file not found error..i should use the whole path starting from c drive..How does he not get error
@@vilw4739 He is using jupyter notebook where files are stored separately in a jupyter notebook directory and you can upload files in the directory and import them by simply running fd=pd.read_csv("./Sales_Data/Sales_April_2019.csv") If you're using a local python IDE like pycharm and VSCode, you need to specify the whole directory like fd=pd.read_csv("C:/Data Science/Sales_Data/Sales_April_2019.csv") to import.
Hi Keith, I feel obligated to personally thank everyone that helps in pursuing my data career and of course, you included. I've used your project (and learned a LOT) and modify/add codes here and there with my own styling for my online portfolio. Moreover, you're a fantastic teacher and you deserve all the credits you should get for helping others like me. Thank you for doing this, may God return the favor and always bless you. Rock on Keith!
Video Timeline! 0:00 - Intro 1:22 - Downloading the Data 2:57 - Getting started with the code (Jupyter Notebook) Task #1: Merging 12 csvs into a single dataframe (3:35) 4:25 - Read single CSV file 5:44 - List all files in a directory 7:06 - Concatenating files 11:00 - Reading in Updated dataframe Task #2: Add a Month column (12:48) 14:12 - Parse string in Pandas cell (.str) Cleaning our data! 17:31 - Drop NaN values from df 21:25 - Remove rows based on condition Task #3: Add a sales column (24:58) 25:58 - Another way to convert a column to numeric (ints & floats) Question #1: What was the best month for sales? (29:20) 30:35 - Visualizing our results with bar chart in matplotlib Question #2: What city sold the most product? (34:17) 35:32 - Add a city column 36:10 - Using the .apply() method (super useful!!) 40:35 - Why do we use the lambda x ? 40:57 - Dropping a column 46:45 - Answering the question (using groupby) 47:34 - Plotting our results Question #3: What time should we display advertisements to maximize the likelihood of purchases? (52:13) 53:16 - Using to_datetime() method 56:01 - Creating hour & minute columns 58:17 - Matplotlib line graph to plot our results 1:00:15 - Interpreting our results Question #4: What products are most often sold together? (1:02:17) 1:03:31 - Finding duplicate values in our DataFrame 1:05:43 - Use transform() method to join values from two rows into a single row 1:08:00 - Dropping rows with duplicate values 1:09:39 - Counting pairs of products (itertools, collections) Question #5: What product sold the most? Why do you think it did? (1:14:04) 1:15:28 - Graphing data 1:18:41 - Overlaying a second Y-axis on existing chart 1:23:41 - Interpreting our results Thanks for watching! If you enjoyed, please consider subscribing :).
I am on holiday and have started datascience for fun to see what the buzz is all about. I have to say I love it and I would appreciate if you'd apload more videos like this. I have learnt a TON
Great tutorial! 55:00 When parsing a column into datetime, specifying the format manually will decrease the execution time significantly: all_data['Order Date'] = pd.to_datetime(all_data['Order Date'], format='%m/%d/%y %H:%M')
Love how this cool dude researches solutions on the fly and explains things as he goes even when he commits minor unforced errors. He is so relatable. His other tutorials on Pandas, Numpy, Matplotlib, etc. are equally helpful. I wish him all the success and hope that he continues to share his knowledge for decades to come.
Agreed totally relatable and helpful videos for beginners giving them a chance to know what error can happen due to what syntax errors. Thanks for the informative guide.
I get the feeling in this video that you know more than you're letting on but you're just trying to make things as basic as possible and I love it. I hope to teach others in this same manner. God bless you
This is the best data science class on the net (that I have seen, of course). We are solving real problems, using google, and working with datasets that require a lot of preprocessing. Perfect.
At 50:10 for anyone who wants to use .unique(), when you calculate the sales for each city make sure to throw in a .reset_index() in there, it will reset the indexes and your bar is going to be alright. cityy=all_data.groupby("City").sum().reset_index() then you do the rest like him, you can also throw in ascending order in there as well, just follow the rest of his instruction. cityy=all_data.groupby("City").sum().reset_index().sort_values("Sales",ascending=False) xxx=cityy["City"].unique() plt.bar(xxx,cityy["Sales"]) plt.ylabel("$$$") plt.xlabel("Cities") plt.xticks(xxx, rotation='vertical', size=8) plt.show()
unfortunately, I am getting a ValueError. Any idea how I can solve this: ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (10,) and arg 1 with shape (12,). I havent got any proper answer from google or maybe not an expert enough to understand :p.
Dude , literary i have never seen anyone solving real world problems on you tube .Your, way of teaching is quite impressive. Many, you tubers just showcase basic problems .But, hats off to you !!!
Didn't watch more than a few minutes since I already know how to do most of this stuff but loved how the dude straight up tells us to google it. SO TRUE!!! I've had professors who tell me the same thing. Thumbs up.
I just enter data analysis area and amazing this videos made 4 years before already! thanks for made this, learnt your skills and problem solving as talents, appreciated!
just to add to what most people are saying, this is in my opinion the best way to do a tutorial. you showed me that even though im a super beginner and not long coming out of learning basic python things im able to pick up something really easily while realising that i dont have to feel bad thinking everyone else is better than me and that even experienced programmers google stuff and actually are not gods sitting on pedestals acting like they are better than us haha. great work
As a new learner of python I found this to be one of the best videos on youtube for beginners. How he managed to deal with the problems and solve them on the go (not knowing it all, but knowing how to consult google for the right answer). Way to go! Loved the approach and how easy you made it look
I write from Denmark, but I'm Chilean, I followed all the steps and really everything is very clear, I loved your explanations of each task and each question
It's been three years since the video was posted, anyone watching now, as I am, in the column Month one way of getting the names of the months from Order Date, would be to convert the Order Date to_datetime and using the dt.month_name() in Month column. One other thing to remember is to clean the data before starting doing all the analysis.
Keith, you're literally the most underrated and one of the best teachers on youtube. This exercise cleared most of my doubts about Data Science and i fell in love with it because of you. Thank you so much for this, you're the best!
Great video! At the beginning it is much more concise to do this and concatenate all csv files into one like this (better to put ipython notebook csv files in the same directory and then): files=[f for f in os.listdir("./") if f.endswith('.csv')] df=pd.concat(pd.read_csv(i) for i in files) THAT'S IT!
Im new to data analysis. My instructor always tells us to search our questions on google and get help from stack overflow. I didnt understand it till now and got stuck on my second project for sales analysis. This helped me big time!!! I'm so thankful to you for telling all those shortcuts. The data time split had such a long tricky code online.
omg!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! i searched for so much "a day in a life of a data science" thinking they would show a glimpse of reality. and this is the best portrayal AND simultaneously one of the best tutorial video. YOU ARE A LEGEND!!!!
Great tutorial! thank you for sharing In 50:26 for cities: can always use the index values from 'results' DF: cities = results.index.values instead of a for loop
you're the best. not only are you teaching people how to use python Pandas lib, but you're also teaching the type of hat you should be wearing when solving real world problems! kudos x 10000
Thanks for the tips! Love when people comment helpful stuff like this :). Just started using command mode to easily switch cells from code to markdown, will have to add these two commands to the arsenal as well!
f and j will move focus to above or below cells and u can pair this with shift and then press ‘m’ to merge the highlighted cells. so shift+f+m will merge the current cell with the one below it. ‘dd’ will delete a cell also! (these bindings are very vim like)
Hey Keith , thank you so much for this video concerning the 4th question 'What products are most often sold together?' i kinda had a similar approach and I got same order of grouped products when i counted the values using .value_counts(). However, the values themselves were different ! here is my approach order_grouped =months_purchase.groupby('Order ID') def concatenate_strings(x): return x.str.cat(sep=',') products = pd.DataFrame(order_grouped['Product'].agg(concatenate_strings)) combined_items = products[products['Product'].str.contains(',')] combined_items.value_counts().head(10)
23:39 that duplication was because of the header rows in each of the files. I've dealt with this a lot. You would have had had to have excluded those header rows on each file before you concatenated all of them together to resolve this. Great video course man, thank you for making all of that content
When passing a function to apply, you could have just passed the function name, there's no need to do apply(lambda x:get_city(x)). This is just enough and better => apply(get_city)
Congrats oludamire, I'm guessing you're a Nigerian. I'm a Nigerian too and recently got into Exploratory Data Analysis through the udacity Nanodegree program. I'm currently on my second project which is an Investigation of WeRateDogs Twitter dataset. I think I have learnt a thing or two so far. Do you think I'm ready for Turin?..i hear it's like going to the big leagues lol.
Man, I really like your style. Firstly, because you take real world problems and not some primitive stuff like some other bloggers, secondly, because you encourage your viewers to search for solutions themselves, and, thirdly, because you show how to find a sollution to a certain problem on the Interned. Please keep doing similar videos! With best wishes and sincere appreciation from Ukraine.
Best data analysis video I have watched so far! I also love how most people in the comment sections have outlined alternative ways of approaching some of the tasks.
This is the most informative video I've ever seen on what data science actually is! I keep looking for actual applications and I loved seeing your thought process, comments, and method of asking and answering questions.
i would greatly appreciate another simillar video with a new project with some newer formulas and features, maybe understanding heatmaps, creating more complex functions etc. Thanks again for this.
The dataset contains January-data for both 2019 and 2020, so the grouping by month doesn't work because you only look at the month, not the year. Stopgap solution: also slice the year off the date string Proper solution: convert date string to an actual datetime, then groupby month with pd.Grouper. I suggest putting a card or a note there so others aren't confused. Thanks for the video though!
Really good job that really give us a real daily solving problem. I’m sure most of us resolve problems as this way, googling, prove an error. I do not understand why in Interviews they expect you know everything about the Language, Algorithms and Syntax.
Great video. Just a few suggestions: At 4:25 when using os.listdir("'./"), this returns a list alread. So using [file for file in os.listdir(...)] is redundant. At 40:50 you don't need to use the lambda function, even if you want to access a cell content. If you simply pass the reference to a function, by default the *args will be passed. Example: def modify(a): return 'CHANGED ' + a + ' CHANGED' df['Column'].apply(modify) # modify without parenthesis is the reference to the function.
Great tutorial, I've learned a lot! a suggestion for you first question for the best month for sales: Instead of creating the extra cols of 'month' and 'sales' we can use the pandas "resample" method which does the group by month for us, and just like in the groupby method we close it with the "sum" and we get the same table! all_data.resample('M', on='Order Date').sum().sort_values(by='Price Each', ascending=False)
But heres the problem, Order Date is not a date time type so you have to conver it first. all_data["Order Date"]= pd.to_datetime(all_data["Order Date"], format="%m/%d/%y %H:%M")
I'm launching a data analytics bootcamp! goto.masterschool.com/5wn3sw Some highlights of the program: - Fully remote (with flexible working hours) - No tuition fees until after you land a job in tech - Open to applicants anywhere in the world! This is a 7-month long program kicking off in June. To learn more and get your application started, click the link above ⬆
One of the easy simple and best ways to approach data analysis This is my first time watching you sir and Im already a sincere subscriber while(True): Do watch, learn and grow under your guidance You are Awesome
@@stevejuso Really late reply, but just incase it helps someone. You can tell the read_csv function to read a column as a date by passing in parse_dates=['col1', 'col2'] for any amount of columns. You can tell it to use European format with dayfirst=True And if you need a specific format you can use date_parser to give your own parser for a specific format. So in my case it was: df = pd.read_csv('filepath', parse_dates=[datecols], dayfirst=True) to get the cols I needed into European date format. One key thing is that it converts the dates to a pandas timestamp. But they are interchangeable with python datetimes almost all of the time. Can also be converted with an .apply(lambda x: x.to_pydatetime) if you need.
Man I don't know how to compliment you but, you teach, explain super well I have learned python pandas from you, and have used it for other projects of mine Keep up the good content, very valuable
you get those headers multiple times because when you concatenated the files from different months, the headers from each file was also included in the concatenation!
Best things I have done today is finding this man. I was eating and chilling and saw the thumbnail of this video with pandas name and think that let see 5 min what he had to say. But believe guys I have already watched 1:02:57 this portion of the video and it getting more intresting as it goes towards ending. Kudos to his technique ❤❤❤
32:15 you have created months list to pass it to plt.bar() out of thin air, in current scenario as our data is coming in sorted way by month so no issue is coming else it would have plotted Sales against wrong month. Instead I tried this, please let me know if I'm wrong about it? all_data.groupby('Month')['Daily Sale'].sum().plot(kind='bar') plt.show()
Amazing video ! All the mistakes and the searching process make the beginners in data science realize that it's possible to do a lot of things since the start of the journey. Thanks
I just wanted to see how DS works and after searching and watching a lot of videos this one is very understandable and very real one. Thanks author! Great job!
At 51.16 you could have simply passed results['City'] as x -axis argument. Thank you so much. Looking forward for more real time analysis like this one. It was really cool hands on exercise.
I enjoyed working through this real world data analysis problem with you. I look forward to more, please do more problems like this. It helps me to work out problems in Python.
That tutorial was really helpful for getting the first grab of DS's application....please make more such "real world DS solutions" like airline data,travel data , companies profit with strategy data,hotel service data,salary vs domain age datasets etc.
Posted a new "Solving real world data science tasks" video! Check it out here: ruclips.net/video/Ewgy-G9cmbg/видео.html
This is awesome. Learning Python is so much easier when there's something tangible and grounded to work towards.
hii keith!!! I am getting an error after this line
CODE: for file in files:
current_data = pd.read_csv(path + "/" + file)
ERROR: ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
Please can you help me solve this error....I tried to find solution online but didn't get any.
@@colorways518 Just thinking out loud,aren't we able to find the below kind of info from Amazon Jungle Scout, Helium10, Sellics. We are amazon seller, do we also need to go thru Python and data-science on Amazon. There are 3rd Party SaaS plug-ins to solve these questions. Correct me if i am wrong?
- What was the best month for sales? How much was earned that month?
For the problem on getting city with highest sales, we ran into an ordering problem while plotting the cities, I think we can also use result.index as our xtick
That way it simply takes the values straight from the Dataframe in the right order rather than using df.unique and rearranging
This red warning displays bcuz u didn't make a copy of the original dataframe, do it and this warning goes off.
As a programmer/data analyst/systems administrator I can safely say that this is exactly how we solve problems in real life. Good job!
you wouldnt have watched this video if you were
@Pasha people who think they know it all are a bore. 🙄 You could always learn something new from other people, it never hurts to learn new perspectives. Good luck with that mindset. I learn everyday. 😌
@@justapugontheinternet love your mindset on🎉🎉🎉🎉
the best part part was watching some one google the answer an seeing how they implement the solution instead of just acting like they know everything. man your tutorials are the best an down to earth
hahahahaaha you think this kids knows what he is doing and for your information we all google no matter what postion we hold. 🤣 we built websites for a reason to always look back to when needed. Google provides faster search capability rather going to src and look through to get to. Get your mind straight about goodle 🤣 This kid clearly looking around for the code he already written and you assuming google is preferred to be a bad example as a programmer 😂 tells me you expecting movies type like hackers hahahaahahaha. Come to reality
It honestly makes it feel more real, like, I am studying data science now and I google stuff all the time, the fact that even someone well versed in data science still googles stuff constantly is reassuring.
@@dragonmateX people who work in google google stuff 😂 get back to reality to why google is meant for🤣
@@Amir-tv4nn and? what the fuck is your problem? so far you didn't write anything valuable here
@@Amir-tv4nn Come to reality. Man, come to reality. Could you please come to reality? Btw you should come to reality
As a business major with very limited internship experience, I am teaching myself python and data analytics from scratch. This video is literal gold to me because this is one of the few that actually shows the entire wrangling process! Thanks for the great vid!
If i use only fd=pd.read_csv("./Sales_Data/Sales_April_2019.csv") i get file not found error..i should use the whole path starting from c drive..How does he not get error
@@vilw4739 He is using jupyter notebook where files are stored separately in a jupyter notebook directory and you can upload files in the directory and import them by simply running fd=pd.read_csv("./Sales_Data/Sales_April_2019.csv")
If you're using a local python IDE like pycharm and VSCode, you need to specify the whole directory like fd=pd.read_csv("C:/Data Science/Sales_Data/Sales_April_2019.csv") to import.
@@ashiksrinivas thankyou
@@vilw4739 did you ever figure it out? getting the same error
@@muhsintabatabayee8592 they should be in the same folder.Otherwise you need to put the whole path
This is the most practical Python tutorial video I've ever watched.
Watching this 4 years after you published it, and you're still a legend ! Thank you !!!
Thank you for watching and the kind words!!
This situation so realistic. The mistakes, the solving.. great video!
Yes liked it ..it was so realistic
is this sarcasm?
Юрій Черній pretty sure no it's not
not only teach us about pandas but also give us the confidence that "If this guy could be so success in data science then why shouldn't I?"
@@ЧернійЮрійМиколайович no
Hi Keith, I feel obligated to personally thank everyone that helps in pursuing my data career and of course, you included. I've used your project (and learned a LOT) and modify/add codes here and there with my own styling for my online portfolio. Moreover, you're a fantastic teacher and you deserve all the credits you should get for helping others like me. Thank you for doing this, may God return the favor and always bless you. Rock on Keith!
Thank you so much for the kind words! :)
Video Timeline!
0:00 - Intro
1:22 - Downloading the Data
2:57 - Getting started with the code (Jupyter Notebook)
Task #1: Merging 12 csvs into a single dataframe (3:35)
4:25 - Read single CSV file
5:44 - List all files in a directory
7:06 - Concatenating files
11:00 - Reading in Updated dataframe
Task #2: Add a Month column (12:48)
14:12 - Parse string in Pandas cell (.str)
Cleaning our data!
17:31 - Drop NaN values from df
21:25 - Remove rows based on condition
Task #3: Add a sales column (24:58)
25:58 - Another way to convert a column to numeric (ints & floats)
Question #1: What was the best month for sales? (29:20)
30:35 - Visualizing our results with bar chart in matplotlib
Question #2: What city sold the most product? (34:17)
35:32 - Add a city column
36:10 - Using the .apply() method (super useful!!)
40:35 - Why do we use the lambda x ?
40:57 - Dropping a column
46:45 - Answering the question (using groupby)
47:34 - Plotting our results
Question #3: What time should we display advertisements to maximize the likelihood of purchases? (52:13)
53:16 - Using to_datetime() method
56:01 - Creating hour & minute columns
58:17 - Matplotlib line graph to plot our results
1:00:15 - Interpreting our results
Question #4: What products are most often sold together? (1:02:17)
1:03:31 - Finding duplicate values in our DataFrame
1:05:43 - Use transform() method to join values from two rows into a single row
1:08:00 - Dropping rows with duplicate values
1:09:39 - Counting pairs of products (itertools, collections)
Question #5: What product sold the most? Why do you think it did? (1:14:04)
1:15:28 - Graphing data
1:18:41 - Overlaying a second Y-axis on existing chart
1:23:41 - Interpreting our results
Thanks for watching! If you enjoyed, please consider subscribing :).
Heyy,machine learning would be awesome
I Have very big data in xlsx format. Read excel tâkes like forever...
I am on holiday and have started datascience for fun to see what the buzz is all about. I have to say I love it and I would appreciate if you'd apload more videos like this. I have learnt a TON
Hey man, are you gonna do more such videos anytime soon?
Thank you so much, it is very useful to me
Great tutorial!
55:00 When parsing a column into datetime, specifying the format manually will decrease the execution time significantly:
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'], format='%m/%d/%y %H:%M')
on google colab it was like 30 sec vs 2 sec. Great tip !
Love how this cool dude researches solutions on the fly and explains things as he goes even when he commits minor unforced errors. He is so relatable. His other tutorials on Pandas, Numpy, Matplotlib, etc. are equally helpful. I wish him all the success and hope that he continues to share his knowledge for decades to come.
He's such a GREAT tutor!!!
Agreed totally relatable and helpful videos for beginners giving them a chance to know what error can happen due to what syntax errors. Thanks for the informative guide.
I get the feeling in this video that you know more than you're letting on but you're just trying to make things as basic as possible and I love it. I hope to teach others in this same manner. God bless you
"I dont know how to do it, but i know how to google it." this guys knows how things going in real world haha
Googling is, indeed, one of the most important skills for coding.
Hahaha! We invite you to take a look at our videos which deal with the same topics :)
His very fast too, like I would need to know it, coz once I go to google im there for 4 hours :/
I did the exact same process be it R, Matlab or Py
@@carlurbananimals that's coz your question isn't exactly right ;)
This is the best data science class on the net (that I have seen, of course). We are solving real problems, using google, and working with datasets that require a lot of preprocessing. Perfect.
At 50:10 for anyone who wants to use .unique(), when you calculate the sales for each city make sure to throw in a .reset_index() in there, it will reset the indexes and your bar is going to be alright.
cityy=all_data.groupby("City").sum().reset_index()
then you do the rest like him, you can also throw in ascending order in there as well, just follow the rest of his instruction.
cityy=all_data.groupby("City").sum().reset_index().sort_values("Sales",ascending=False)
xxx=cityy["City"].unique()
plt.bar(xxx,cityy["Sales"])
plt.ylabel("$$$")
plt.xlabel("Cities")
plt.xticks(xxx, rotation='vertical', size=8)
plt.show()
thanks a lot
unfortunately, I am getting a ValueError. Any idea how I can solve this:
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (10,) and arg 1 with shape (12,).
I havent got any proper answer from google or maybe not an expert enough to understand :p.
Dude , literary i have never seen anyone solving real world problems on you tube .Your, way of teaching is quite impressive. Many, you tubers just showcase basic problems .But, hats off to you !!!
Love how realistic and down to earth all your videos are! Makes data analysis way more approachable. What a guy!
Didn't watch more than a few minutes since I already know how to do most of this stuff but loved how the dude straight up tells us to google it. SO TRUE!!! I've had professors who tell me the same thing. Thumbs up.
I just enter data analysis area and amazing this videos made 4 years before already! thanks for made this, learnt your skills and problem solving as talents, appreciated!
just to add to what most people are saying, this is in my opinion the best way to do a tutorial. you showed me that even though im a super beginner and not long coming out of learning basic python things im able to pick up something really easily while realising that i dont have to feel bad thinking everyone else is better than me and that even experienced programmers google stuff and actually are not gods sitting on pedestals acting like they are better than us haha. great work
As a new learner of python I found this to be one of the best videos on youtube for beginners. How he managed to deal with the problems and solve them on the go (not knowing it all, but knowing how to consult google for the right answer). Way to go! Loved the approach and how easy you made it look
I write from Denmark, but I'm Chilean, I followed all the steps and really everything is very clear, I loved your explanations of each task and each question
Content of this quality deserves far more recognition. Thank you!
It's been three years since the video was posted, anyone watching now, as I am, in the column Month one way of getting the names of the months from Order Date, would be to convert the Order Date to_datetime and using the dt.month_name() in Month column. One other thing to remember is to clean the data before starting doing all the analysis.
Keith, you're literally the most underrated and one of the best teachers on youtube. This exercise cleared most of my doubts about Data Science and i fell in love with it because of you. Thank you so much for this, you're the best!
Great video! At the beginning it is much more concise to do this and concatenate all csv files into one like this (better to put ipython notebook csv files in the same directory and then):
files=[f for f in os.listdir("./") if f.endswith('.csv')]
df=pd.concat(pd.read_csv(i) for i in files)
THAT'S IT!
Thats better thanks
monthly_dataframes = [pd.read_csv(file) for file in glob.glob(filePath + "*.csv")]
merged_dataframe = pd.concat(monthly_dataframes)
thank you so much i have been battling no such directory all morning
Also consider adding a condition to skip the first row of each subsequent file - to avoid duplicate headers.
Im new to data analysis. My instructor always tells us to search our questions on google and get help from stack overflow. I didnt understand it till now and got stuck on my second project for sales analysis. This helped me big time!!! I'm so thankful to you for telling all those shortcuts. The data time split had such a long tricky code online.
Your assignments are harder than Coursera's. I'm actually learning something. Major thanks all the way from Holland! 🙏
omg!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! i searched for so much "a day in a life of a data science" thinking they would show a glimpse of reality. and this is the best portrayal AND simultaneously one of the best tutorial video. YOU ARE A LEGEND!!!!
Great tutorial! thank you for sharing
In 50:26 for cities: can always use the index values from 'results' DF:
cities = results.index.values
instead of a for loop
if every human being on earth had the will and disposition to teach like Keith... the world would be a 99% better place
He is like my friend who teachs one day before exams. 😂😅
you're the best. not only are you teaching people how to use python Pandas lib, but you're also teaching the type of hat you should be wearing when solving real world problems! kudos x 10000
Dude, this is by far one of the best real-life tutorials on YT. Subbed for more like this!
I was absolutely blown away by the fanastic lectures. The best teacher I've ever had!
34:34
Pro tip: go to command mode (press Esc) and press 'b' to make cells below current cell or 'a' to make cells above
Thanks for the tips! Love when people comment helpful stuff like this :). Just started using command mode to easily switch cells from code to markdown, will have to add these two commands to the arsenal as well!
f and j will move focus to above or below cells and u can pair this with shift and then press ‘m’ to merge the highlighted cells. so shift+f+m will merge the current cell with the one below it. ‘dd’ will delete a cell also! (these bindings are very vim like)
Think that's reversed. Use 'b' to make cells above and 'a' to make cells below.
Hey Keith , thank you so much for this video
concerning the 4th question 'What products are most often sold together?'
i kinda had a similar approach and I got same order of grouped products when i counted the values using .value_counts(). However, the values themselves were different !
here is my approach
order_grouped =months_purchase.groupby('Order ID')
def concatenate_strings(x):
return x.str.cat(sep=',')
products = pd.DataFrame(order_grouped['Product'].agg(concatenate_strings))
combined_items = products[products['Product'].str.contains(',')]
combined_items.value_counts().head(10)
23:39 that duplication was because of the header rows in each of the files. I've dealt with this a lot. You would have had had to have excluded those header rows on each file before you concatenated all of them together to resolve this.
Great video course man, thank you for making all of that content
I just did what he did and all I am getting is the header rows, what's the solution?
@@vertik3895 load first df as normal and proceeding df´s as pd.read_csv('file2.csv', skiprows=1) before concat
@@vertik3895 The solution is call the method read_csv(..., header=None) for each iteration
As a beginner in Data science with Python, I find you as the best youtuber in this field.
Good Job!
1:09:28 you can use df=df.groupby('Order ID')['Product'].apply(','.join) instead those three lines. Thanks for this video, it was great for me.
When passing a function to apply, you could have just passed the function name, there's no need to do apply(lambda x:get_city(x)). This is just enough and better => apply(get_city)
Came here to make sure someone said this! As long as the function you pass only takes a single argument. Otherwise lambda x: my_func(x, other_arg)
Thank you, there are tons of brilliant programmers on youtube but only a few programmers who are good communicators and teachers.
I love how this guy is explaining, I really enjoyed learning from you.
mate, you're a legend! not only did I learn matplolib and pandas but now I know my pokemon too, tip of the hat!
Your courses are very great as you delve into practical content. Your course helped me to pass data analysis test in Turing. Thank you so much
Congrats oludamire, I'm guessing you're a Nigerian. I'm a Nigerian too and recently got into Exploratory Data Analysis through the udacity Nanodegree program. I'm currently on my second project which is an Investigation of WeRateDogs Twitter dataset. I think I have learnt a thing or two so far. Do you think I'm ready for Turin?..i hear it's like going to the big leagues lol.
Man, I really like your style. Firstly, because you take real world problems and not some primitive stuff like some other bloggers, secondly, because you encourage your viewers to search for solutions themselves, and, thirdly, because you show how to find a sollution to a certain problem on the Interned. Please keep doing similar videos! With best wishes and sincere appreciation from Ukraine.
50:47 cities = result.Sales.keys() works as expected. great tutorial, tks!
Best data analysis video I have watched so far! I also love how most people in the comment sections have outlined alternative ways of approaching some of the tasks.
This is the most informative video I've ever seen on what data science actually is! I keep looking for actual applications and I loved seeing your thought process, comments, and method of asking and answering questions.
The first time I let the ads on a youtube video, because I wanted to watch every second of it. Many thanks Keith, you' re just amazing !
I appreciate the kind words! Glad you enjoyed :)
I just finished your two videos demonstrating numpy and pandas, finally feeling a good grasp of python basics (y)
Thank you for everything you do!
Dude!! You are awesone teaching data science. You make the world better
Thanks mclovin!
Hands down one of the most useful I've seen. Insights galore. Thank you!
i would greatly appreciate another simillar video with a new project with some newer formulas and features, maybe understanding heatmaps, creating more complex functions etc. Thanks again for this.
The best graph type for correlation is 'scatter graph', looks like a constellation. Great video Keith. Thanks.
This is the most practical Python tutorial video I've ever watched. Thanks for sharing!
You are awesome! Thanks for patiently explaining everything, also teaching how to google what you want! Thanks man!
The dataset contains January-data for both 2019 and 2020, so the grouping by month doesn't work because you only look at the month, not the year.
Stopgap solution: also slice the year off the date string
Proper solution: convert date string to an actual datetime, then groupby month with pd.Grouper.
I suggest putting a card or a note there so others aren't confused.
Thanks for the video though!
this video was amazing, I can't believe I actually sat throught the whole thing past my bedtime
If you are a coder, there is no such thing as "bedtime". Just, awake, and not awake.
Really good job that really give us a real daily solving problem. I’m sure most of us resolve problems as this way, googling, prove an error. I do not understand why in Interviews they expect you know everything about the Language, Algorithms and Syntax.
so nice I was searching this kind of tutorial, it has real-time mistake and solution,I hope you do this kind of videos regularly
i would give this guy a 10/10...truly understood everything
Hy Keith, you're great! thanks to you we can be introduced to a hell of a lot of useful panda tools! keep up the good work!
I even liked the name of the video. Straight to the point. I said "YESS IVE BEEN LOOKING FOR THIS" perfect. Thanks.
Great video. Just a few suggestions:
At 4:25 when using os.listdir("'./"), this returns a list alread. So using [file for file in os.listdir(...)] is redundant.
At 40:50 you don't need to use the lambda function, even if you want to access a cell content. If you simply pass the reference to a function, by default the *args will be passed. Example:
def modify(a):
return 'CHANGED ' + a + ' CHANGED'
df['Column'].apply(modify) # modify without parenthesis is the reference to the function.
could u please help : why i'm getting path error when i did try to use os.listdir but not when i opened a specific file to read?
@@mahermonirify hello i'm getting path error too can you please tell how do i resolve it?
Hi Keith, Even after three years, this video is very useful. You are very good at explaining the concepts. Thank you very much
Really interesting to go through the entire process, including looking up solutions and solving errors!
Honestly, one of the best videos I have seen. From mistakes, how to look for answers and little tips & tricks.
You have got new subscriber in me.
Great tutorial, I've learned a lot!
a suggestion for you first question for the best month for sales:
Instead of creating the extra cols of 'month' and 'sales' we can use the pandas "resample" method which does the group by month for us, and just like in the groupby method we close it with the "sum" and we get the same table!
all_data.resample('M', on='Order Date').sum().sort_values(by='Price Each', ascending=False)
But heres the problem, Order Date is not a date time type so you have to conver it first.
all_data["Order Date"]= pd.to_datetime(all_data["Order Date"], format="%m/%d/%y %H:%M")
You are great Keith. You are doing it in a manner that most students can understand better.
I'm launching a data analytics bootcamp!
goto.masterschool.com/5wn3sw
Some highlights of the program:
- Fully remote (with flexible working hours)
- No tuition fees until after you land a job in tech
- Open to applicants anywhere in the world!
This is a 7-month long program kicking off in June. To learn more and get your application started, click the link above ⬆
cool, greetings to you from Lima/Perú
Very practical analysis on real data. As a beginner, I have to pause, research and learn each question separately.
You are always so passionate and enthusiastic even if there're errors haha :) Love your positive attitude! Look forward to more great videos!! :)
I get tensed like in hell..
he purposely introduced those errors for us to have real-life problem-solving experience :)
One of the easy simple and best ways to approach data analysis
This is my first time watching you sir and Im already a sincere subscriber while(True): Do watch, learn and grow under your guidance
You are Awesome
22:00 I think is more reliable to parse column of dates as datetime type to avoid all these problems
pd.to_datetime did not work for me on this data. How did you use it? I get an error
@@stevejuso Really late reply, but just incase it helps someone.
You can tell the read_csv function to read a column as a date by passing in parse_dates=['col1', 'col2'] for any amount of columns.
You can tell it to use European format with dayfirst=True
And if you need a specific format you can use date_parser to give your own parser for a specific format.
So in my case it was:
df = pd.read_csv('filepath', parse_dates=[datecols], dayfirst=True) to get the cols I needed into European date format.
One key thing is that it converts the dates to a pandas timestamp. But they are interchangeable with python datetimes almost all of the time. Can also be converted with an .apply(lambda x: x.to_pydatetime) if you need.
All the errors that were driving nuts are resurfacing here and being handled nicely! Such a treat:)!
Keith: I am gonna snatch the first two digits and make it the month.
The data: Hold my NaNs !
i don't know, man. i think this is one of the very best channels in all platforms (not only youtube)
great job Keith!, keep up with the walk-through-style tutorials, hands on is the best and even better when you have the feedback.
Man I don't know how to compliment you but, you teach, explain super well
I have learned python pandas from you, and have used it for other projects of mine
Keep up the good content, very valuable
I love how he freaks out whenever there is a small warning lol
you get those headers multiple times because when you concatenated the files from different months, the headers from each file was also included in the concatenation!
Checking the length of dataframe helps instead of storing in csv file and verifying.
I feel like I struck gold with this video. It's helping me learn a lot quicker than online tutorials. Thank you!
50:00, use result.index as x values and x ticks.
yes that would be easier.
Sooooo fantastic!!!
This is definitely the best Data Project video I've seen on RUclips!
this video was super interesting. I can certainly watch 10 more of these!
Best things I have done today is finding this man. I was eating and chilling and saw the thumbnail of this video with pandas name and think that let see 5 min what he had to say. But believe guys I have already watched 1:02:57 this portion of the video and it getting more intresting as it goes towards ending. Kudos to his technique ❤❤❤
This channel is the best thing I've encountered in a while. Thank you for helping the desperate ;-; Would do 5 likes if I could
This video is really good, not only the solutions but the process of getting to the solutions shown is what makes it so good...!
32:15 you have created months list to pass it to plt.bar() out of thin air, in current scenario as our data is coming in sorted way by month so no issue is coming else it would have plotted Sales against wrong month. Instead I tried this, please let me know if I'm wrong about it?
all_data.groupby('Month')['Daily Sale'].sum().plot(kind='bar')
plt.show()
The groupby function sorts by months I think so that will be [1:13], same as the new month variable
Monthss = [month for month, df in All_Data.groupby('Month')]
Amazing video ! All the mistakes and the searching process make the beginners in data science realize that it's possible to do a lot of things since the start of the journey. Thanks
58:22 I heard that
LOOOOLLLL
I just wanted to see how DS works and after searching and watching a lot of videos this one is very understandable and very real one. Thanks author! Great job!
I like the way you say in every mistakes - :: AAAAh What did i do ::" lol :D xD
hahaa it made me laugh because i do the exact same thing
Thanks for data science tutorials helps me alot in my labs couldn't have done without you
I definitely prefer to watch your tutorials instead of netflix...I love this format, thanks man 😊
At 51.16 you could have simply passed results['City'] as x -axis argument. Thank you so much. Looking forward for more real time analysis like this one. It was really cool hands on exercise.
I enjoyed working through this real world data analysis problem with you. I look forward to more, please do more problems like this. It helps me to work out problems in Python.
That tutorial was really helpful for getting the first grab of DS's application....please make more such "real world DS solutions" like airline data,travel data , companies profit with strategy data,hotel service data,salary vs domain age datasets etc.