I think your solution might be skipping over some good built-in functionality: 1. Create a function which incorporates your str.contains() to either return the name, or return None. - You can make this dynamic using the split(":") function to split those values using the colon, rather than string slicing which you need to hard code the value. 2. Apply the function to create new columns for each of first/last/date, this will have NaN where the column didn't contain your string. 3. Now use df.fillna(method="ffill") on those new columns to forward fill the values (basically the same as your iteration, and avoids creating multiple DFs and merging) 4. Drop the rows which have NaN values in one of the other columns (e.g. Speed1) 5. Drop the rows which have column headers in the values (e.g. df["Speed1"] == "Speed1") and you're done!
I must say. this channel is one of the best channels to learn how to think and act as a data analyst. the way you are explaining and demonstrating is far greater than most of the channels I stumbled upon. defiantly a subscriber from now on
40 - 50 percent of my bandwidth of projects is usually consumed by data cleaning. This walkthrough is great for practice and helps with new approaches, keep them coming!!
And again, these types of videos are GOLD to any data analyst. We can see how you do your work and share tips and tricks to deal with data. Thanks Shashank! I honestly learned a lot more :)
1.You're dropping records with a null value on the basis that it does not allow your code to run, which will cause your reports to be incorrect if you validate against the source system. 2. For loops are incredible slow 3. Python is suitable for smaller datasets(100m rows, there abouts?), for larger datasets you'll want to compress and physically sort the rows in a database first, process all these changes, and pass it into Python, usually for the modelling libraries. The largest datasets will need to be parallelised over worker nodes. 4. Notebooks are not commonly used for production due to UI resource overheads. For small datasets, just run it as a .py. Larger datasets, submit a nohup job.
That's a very smooth way of cleaning the data. Thank you for bringing this out and explaining it in such a nice way. Request you to bring more videos in future.
Incredible video, Shashank! Really appreciate these types of videos. Small projects really help because they are easy to tackle on the weekend, and really put into practice the practical skills needed as a data analyst.
1. Remove extra rows and empty rows. df = df[(df['Row Type'] != 'Row Type') & (df['Row Type'].notnull())].reset_index(drop=True) 2. create three new column and chnage cell value to null which doesn't contain first_name etc. df['first_name'] = df['Row Type'].apply(lambda x: x.split(": ")[-1] if 'first name' in x else np.nan) df['last_name'] = df['Iter Number'].apply(lambda x: x.split(": ")[-1] if 'last name' in x else np.nan) df['date'] = df['Power1'].apply(lambda x: x.split(": ")[-1] if 'date' in x else np.nan) 3. fill last 3 columns with values above it using ffill df = df.ffill(axis=0) 4. remove rows which has 'first name' in 'Row Type' column so it will remove all extra rows. df = df[~df['Row Type'].str.contains('first name')].reset_index(drop=True)
Thank you for the challenge.. I've done the transformations in PowerQuery (I'm sure the same can easily be done in Python) in a faster and I believe more efficient way. 1. Add a column called "First Name", the column will check if the first column in the dataset starts with "first name". If yes, then it'll copy the contents of the first column, otherwise it'll be blank. 2. Repeat Step1 for columns "Last Name" & "Date" by checking the second & third columns consecutively. 3. The result will be 3 columns with First Name, Last Name & Date data. But it'll be mostly blanks. 4. Extract First Name, Last Name & Date data by getting whatever comes after the colon (": "). 5. Fill down the three new columns to populate black cells with First Name, Last Name & Date data. 6. Filter the data to exclude any rows that contain blanks or the headers. The advantage of this approach is that it doesn't require any merge/join which can get exponentially expensive as the data grows.
Please do bring videos like these more often or at least throw some mini-challenges like these and also would be awesome if you could show such data manipulation and cleaning on excel too!!! Thanks for such amazing videos!
Another way of creating the "iteration" column, using standard (vectorized) pandas (+ range) instead of for loop and if-statement. Note that this also requires NaN/None values being dropped before running. is_new_person = df['Row Type'].str.contains('first name') df.loc[is_new_person, 'person_id'] = range(df[is_new_person].shape[0]) df['person_id'].fillna(method='ffill', inplace=True)
10:01 I didnt understand how the iterations got created as in my case when i create a new column called iterations it says "Length of value(5994) does not match length of index (58397) " .Overall great learning experience .
I kind of understand the logic's of how you did all this but i think i am very new to all these commands and their specific functionalities. I will be watching all your python videos on how we clean data to get a hang of this.
I haven't gone much in depth with python yet, but watching this video sure inspires me to push forward into learning more python for data analytics! Great video!
Great video Shashank! I was able to restructure this data very quickly just using a few excel formulas but I definitely see how this would run into issues with larger datasets
Sir, as someone who is new to python, this one video saves HOURS of googling. This channel is going to be huge if you keep putting out videos like this. Please keep them coming and I am sure you will see the views and subscribers!
One suggestion to make this video even better -- if you take some of the key lines of code and paste them into the description with a little comment description. Should take only a couple minutes, but I feel at least myself am consistently practicing by watching the video and typing out many lines
I love your channel and your tutorials are incredible! gives great insights into what kind of tasks I can expect in a data analyst role, would love to have more such challenges that are related to the work you do at your workplace thanks a lot! :)
Nice video... I actually used a different approach for this case. I used pandas fillna with ffill (forward fill) method. Same thing with what u did, first I cleaned every rows containing column headers or all na then removed extra columns, after that I copied first three columns only on rows that contain "first name:" on the first column. Therefore, I have three new columns containing first name, last name and date with all NAs on other rows. then I used fillna with ffill method to filling every na rows of that three columns, so every na will be filled with the previous non-na rows (the first name, last name and date). and that's it, after that I only need to remove rows containing "first name:" on the first column.
@@ShashankData 3 lines for cleaning extra columns, NAs rows and header name rows 3 lines for copying first name, last name and date values to new columns. And I sliced the str values with these lines of code too 3 lines for ffill method on each new columns 1 line for removing rows with "first name:" values 1 line for ordering columns so that the result looks the same with yours I added more lines in the end because my result didn't have "Iteration" column, but I realized I don't need it for my approach since it is used as a merging key.
This is a very easy task - could use a vba macro (since you're using MS Excel). Each block of data has of almost same format (as some has more rows for 'Iter') so there isn't much high logics involved. 1) Track the last first and last row as well as the column (begin and end) 2) Loop through the entire dataset and consolidate into the desired output format (perhaps in a different sheet) Done..
Hey Shashank great video, really felt like this gave a good example of a typical real-world cleaning exercise. One thing I like to do instead of string slicing, in your case would be to search and take only after the ":" character. Either way would accomplish the same thing, nice work!
Hey Shashank, I've had job interview recently and got some data to work on. I thought I am well prepared in terms of working with pandas, excel etc.. but the problem that I had to work on was something new for me and in reality quite a common example. Task was just dates, sales value, quantity, ID, category for two groups - test and control. Based on this I had to calculate ROI and the overall effectiveness of a marketing campaign including its cost, knowing when campaign started. It would be really nice if you prepare a video on something similar and teach us a little bit about commonly use financial metrics used by data analysts.
So I tried solving this problem by myself, and then now I am watching your video. You dropped an extra column. The "Unnamed: 9" column actually has values if you deep down. It is a "Notes" column. I hope your client didn't have some important notes in there Shashank ;) Regardless, good solution!
Great video, but one thing you do makes me really nervous. You drop entire columns because they appear to be empty. What if they're not? Seems a safer solution to use df.dropna(how='all', axis=1). That would delete an entire column only if it truly is NaN all the way down. And of course, run it once before adding inplace=True. Thanks for making this video. As a fairly new data analyst and pandas enthusiast following along with you gave me that "hey, I really do know what I'm doing!" feeling. : ]
Subscribed! Your videos are just beyond perfect, and helped me a lot. I'm about to quit my current job, and start putting time learnin. I really enjoy data analysis overall, and wish I can land my next job in this area. Again, thank you!
First sorry for my poor english. Since I work for a while with pandas on python if your CSV have the same number of lines for each "set" I would use chuking option on pandas. When you give X row for read, you can simple skip X row and append the data to an array and when you reach the end of file simple create a DataFrame with the array, give your headers columns name and its all done. About use str:[n] if the data it's all the same I use replace. Anything else, good job :D
Hi Shashank, @12:08 here is where you're deleting one of the rows that contain unnecessary headers. I'm just curious how come you don't use the Dataframe.drop() function? i dont quite understand the statement iter_cols[iter_cols["Row Type"] != "Row Type"] and how does that statement remove that specified row? is iter_col[] a function? or something else? Thanks in advance!
Hey Swivel thank you for watching my video! SO the .drop method would work just fine. What I'm doing is filtering the iter_cols DataFrame by the iter_cols["RowType"] column where no row = the value "Row Type"
Great video! While I love pandas, for the particular example I think there are other optimal ways of doing the cleaning, especially if the data is huge. I see that it has a set format of data repeating per first name, last name and date. For a lot of data i.e. 1 bill+ rows, here's how I would approach it: Preprocess one level of the csv within construct of bash. Bash can process this way faster. Resume the rest i.e. bringing a tabular structure, column nomenclature etc in python. OR Read the data in python using the file.readlines construct and not pandas. You can then use custom functions to call in conjunction with list comprehensions. Finally change the results to pandas df. This way we are segregating all memory intensive operations outside pandas. I know pandas is optimized to a certain extent regarding this, but for a lot of data pandas still fails. OR If one is using pandas, then based on limits of numerical fields one can change to appropriate dtype. For example a number ranging from 0 to 5 needn't be stored as int64. Strings having set number of categories neednt be stored as object and rather as category. More on this here: vincentteyssier.medium.com/optimizing-the-size-of-a-pandas-dataframe-for-low-memory-environment-5f07db3d72e OR Read the data using a distributed framework like Spark
this was one of the first video of your that's I watched, and I didn't have much knowledge of sql. However, after studying SQL for a bit this video makes a lot more sense
Great video. If you want to see your results in Excel and analyze them, you just have to create a new dataframe with the Merge result and then use the following command Final_dataframe.to_excel ('Data.xlsx', index = False). Index = False removes from the export the indexes that it places by default. If you want to convert to CSV use the following command Final_dataframe.to_csv ('Datos.csv', index = False). You can also add the attribute SEP = ";" not to separate by commas (,).
Thanks for these handson data challenge . One comment or may be a question .We could have also used the split function to get desired values in name_dateframe as well right? Something like below ?Is split slicing approach more time consuming and so you preferred to slice based on number of characters ? name_dataframe = name_dataframe.loc[:,["Row Type", "Iter Number", "Power1", "Iteration"]] name_dataframe.rename(columns={"Row Type": "First Name", "Iter Number":"Last Name", "Power1": "Date"}, inplace=True) #name_dataframe["First Name"] = name_dataframe["First Name"].str[12:] name_dataframe["First Name"] = name_dataframe["First Name"].str.split(": ",1).str[1] #name_dataframe["Last Name"] = name_dataframe["Last Name"].str[10:] name_dataframe["Last Name"] = name_dataframe["Last Name"].str.split(": ",1).str[1] #name_dataframe["Date"] = name_dataframe["Date"].str[5:] name_dataframe["Date"] = name_dataframe["Date"].str.split(": ",1).str[1] name_dataframe
One more question, what happens if somebody update the original CSV file, does all the python code and the final csv file does get updated as well ? Is it all automated ?
I'm on a Mac and do not have PowerQuery in my version of Excel, sure I could do some VBA, however, this(Python) has considerably more power than that. Thank you Shashank(Subscribed, Liked, here is my comment, and I set the notifications to All, next is Patreon) Cheers, Dane
Really interesting watch. Would have done it a little differently, it’s great to see someone else’s interpretation. Would you use something like pyodbc to load the data frame into sql? What are some of the benefits of loading it into sql from here? I guess it would depend on the use case? Would be awesome to see a whole project that implements that. Subscribed! A note: when you only need a few columns, I love just setting a new data frame with those columns instead of using drop: new_df =og_df[[“columns”, “we”, “want”]]
I think your solution might be skipping over some good built-in functionality:
1. Create a function which incorporates your str.contains() to either return the name, or return None.
- You can make this dynamic using the split(":") function to split those values using the colon, rather than string slicing which you need to hard code the value.
2. Apply the function to create new columns for each of first/last/date, this will have NaN where the column didn't contain your string.
3. Now use df.fillna(method="ffill") on those new columns to forward fill the values (basically the same as your iteration, and avoids creating multiple DFs and merging)
4. Drop the rows which have NaN values in one of the other columns (e.g. Speed1)
5. Drop the rows which have column headers in the values (e.g. df["Speed1"] == "Speed1")
and you're done!
This is a very elegant solution!
Effectivement ça évite les boucles et de passer par plusieurs sous-dataframes.
Clever idea !!
This was very enlightening thank you so much for taking the time to write this out!
@Adam jones how about the automated data cleanup in python?
I must say. this channel is one of the best channels to learn how to think and act as a data analyst.
the way you are explaining and demonstrating is far greater than most of the channels I stumbled upon.
defiantly a subscriber from now on
40 - 50 percent of my bandwidth of projects is usually consumed by data cleaning. This walkthrough is great for practice and helps with new approaches, keep them coming!!
U r providing a good platform to learn for those cannot afford..... without any investment..... Great job
We absolutely need more real hand on projects like this one!! Thanks for the content!!
Finally first guy who actually showed what do data scientist's do, instead of just random English gibberish. Thank you
And again, these types of videos are GOLD to any data analyst. We can see how you do your work and share tips and tricks to deal with data. Thanks Shashank! I honestly learned a lot more :)
1.You're dropping records with a null value on the basis that it does not allow your code to run, which will cause your reports to be incorrect if you validate against the source system.
2. For loops are incredible slow
3. Python is suitable for smaller datasets(100m rows, there abouts?), for larger datasets you'll want to compress and physically sort the rows in a database first, process all these changes, and pass it into Python, usually for the modelling libraries. The largest datasets will need to be parallelised over worker nodes.
4. Notebooks are not commonly used for production due to UI resource overheads. For small datasets, just run it as a .py. Larger datasets, submit a nohup job.
Love the detail in this comment! Thanks for the tips and insight
Can you explain what you mean by "physically sort the rows" ? Thanks!
@@YourMakingMeNervous I think what they meant was sorting with SQL queries i.e. SORTBY or filtering by using the WHERE "column name" = whatever
@@Bozk97 understood, thanks!
Alternative to for loops?
You mean just filter the large dataset using queries and then parallelise the output?
That's a very smooth way of cleaning the data. Thank you for bringing this out and explaining it in such a nice way. Request you to bring more videos in future.
Will do!
Incredible video, Shashank! Really appreciate these types of videos. Small projects really help because they are easy to tackle on the weekend, and really put into practice the practical skills needed as a data analyst.
2:37 mins in and i knew this guy is super cool! Saw the entire video and please do upload more data cleaning challenges like this.
Love seeing the ‘real’ data analyst work! It’s not all sexy ML and modelling. More like this please :)
U really understand freshers pain. What students actually wants to see real life challenges what we actually see and needs to do.
man i love your videos! one of the best channels related to data science/analytics on yt
This was great. Really shows the process of breaking the big task down to bite size pieces.
I looked for a channel like this for well over a month now thank you very much you gained a new sub 🤜🏼🤛🏼
1. Remove extra rows and empty rows.
df = df[(df['Row Type'] != 'Row Type') & (df['Row Type'].notnull())].reset_index(drop=True)
2. create three new column and chnage cell value to null which doesn't contain first_name etc.
df['first_name'] = df['Row Type'].apply(lambda x: x.split(": ")[-1] if 'first name' in x else np.nan)
df['last_name'] = df['Iter Number'].apply(lambda x: x.split(": ")[-1] if 'last name' in x else np.nan)
df['date'] = df['Power1'].apply(lambda x: x.split(": ")[-1] if 'date' in x else np.nan)
3. fill last 3 columns with values above it using ffill
df = df.ffill(axis=0)
4. remove rows which has 'first name' in 'Row Type' column so it will remove all extra rows.
df = df[~df['Row Type'].str.contains('first name')].reset_index(drop=True)
Thank you for the challenge..
I've done the transformations in PowerQuery (I'm sure the same can easily be done in Python) in a faster and I believe more efficient way.
1. Add a column called "First Name", the column will check if the first column in the dataset starts with "first name". If yes, then it'll copy the contents of the first column, otherwise it'll be blank.
2. Repeat Step1 for columns "Last Name" & "Date" by checking the second & third columns consecutively.
3. The result will be 3 columns with First Name, Last Name & Date data. But it'll be mostly blanks.
4. Extract First Name, Last Name & Date data by getting whatever comes after the colon (": ").
5. Fill down the three new columns to populate black cells with First Name, Last Name & Date data.
6. Filter the data to exclude any rows that contain blanks or the headers.
The advantage of this approach is that it doesn't require any merge/join which can get exponentially expensive as the data grows.
Please do bring videos like these more often or at least throw some mini-challenges like these and also would be awesome if you could show such data manipulation and cleaning on excel too!!! Thanks for such amazing videos!
Getting into programming for data science this year and this is first time I've seen the inverse used in pandas. Thanks for the video.
Very organic and high quality video. Someone found their path in life from this content. Great job sir.
Another way of creating the "iteration" column, using standard (vectorized) pandas (+ range) instead of for loop and if-statement. Note that this also requires NaN/None values being dropped before running.
is_new_person = df['Row Type'].str.contains('first name')
df.loc[is_new_person, 'person_id'] = range(df[is_new_person].shape[0])
df['person_id'].fillna(method='ffill', inplace=True)
10:01 I didnt understand how the iterations got created as in my case when i create a new column called iterations it says "Length of value(5994) does not match length of index (58397) " .Overall great learning experience .
I kind of understand the logic's of how you did all this but i think i am very new to all these commands and their specific functionalities. I will be watching all your python videos on how we clean data to get a hang of this.
Just graduated with an MIS degree and I am pursing a Data Analyst position. Your videos give me a grasp of what this role is all about! Thank you!
Great video bro! Keep it up! I'm just starting to learn Python and watching you videos is really cool to see what I can still learn.
I haven't gone much in depth with python yet, but watching this video sure inspires me to push forward into learning more python for data analytics! Great video!
This is gold, Please have Series on this.
Great video Shashank! I was able to restructure this data very quickly just using a few excel formulas but I definitely see how this would run into issues with larger datasets
Love to hear it! Thanks for trying the exercise!
Sir, as someone who is new to python, this one video saves HOURS of googling. This channel is going to be huge if you keep putting out videos like this. Please keep them coming and I am sure you will see the views and subscribers!
One suggestion to make this video even better -- if you take some of the key lines of code and paste them into the description with a little comment description. Should take only a couple minutes, but I feel at least myself am consistently practicing by watching the video and typing out many lines
@@jamesm2892 i agree. Perhaps even sharing a notebook with the comments would be amazing for us
I love your channel and your tutorials are incredible! gives great insights into what kind of tasks I can expect in a data analyst role, would love to have more such challenges that are related to the work you do at your workplace thanks a lot! :)
counter++ is not possible in Python because it is actually a wrapper (object) and pointer arithmetic is not allowed, either.
excel can do i 3 minutes
fantastic video. really like how you think out loud and explain it clearly. subscribed and liked!
Nice video...
I actually used a different approach for this case. I used pandas fillna with ffill (forward fill) method.
Same thing with what u did, first I cleaned every rows containing column headers or all na then removed extra columns,
after that I copied first three columns only on rows that contain "first name:" on the first column. Therefore, I have three new columns containing first name, last name and date with all NAs on other rows.
then I used fillna with ffill method to filling every na rows of that three columns, so every na will be filled with the previous non-na rows (the first name, last name and date).
and that's it, after that I only need to remove rows containing "first name:" on the first column.
Love this method! If you don’t mind me asking, about how many lines of code did this take?
@@ShashankData
3 lines for cleaning extra columns, NAs rows and header name rows
3 lines for copying first name, last name and date values to new columns. And I sliced the str values with these lines of code too
3 lines for ffill method on each new columns
1 line for removing rows with "first name:" values
1 line for ordering columns so that the result looks the same with yours
I added more lines in the end because my result didn't have "Iteration" column, but I realized I don't need it for my approach since it is used as a merging key.
have you tried using the recycle bin?
Your tutorials are really great!
when we merge it will merge with each iteration right in that it will increase the data by merging all vaalues kindly let me know if i am wrong
This is a little late, but here's the function in R for steps 1:3 using the pinned comment solution:
fill
This walk throughs are super important! Super helpful. Great content!
😂That iteration column rings a bell, I use such all the time; dummy_col_1, dummy_col_2.......dummy_col_n . Glad to know I'm not the only one.😂😂
It ain't sexy, it keeps us busy! Would love more of data cleaning projects. TQVM!!
This is a very easy task - could use a vba macro (since you're using MS Excel).
Each block of data has of almost same format (as some has more rows for 'Iter') so there isn't much high logics involved.
1) Track the last first and last row as well as the column (begin and end)
2) Loop through the entire dataset and consolidate into the desired output format (perhaps in a different sheet)
Done..
Hey man great video. Really enjoyed this style of walkthrough! Keep it coming!
You have solved so many of my problems - you have no idea.
Thanks a lot for sharing real data analyst work. Do you use Tableau Prep for data preparation? or How much it is being used in the industry?
Awesome video mate. Big shout out from Brazil
Hey Shashank great video, really felt like this gave a good example of a typical real-world cleaning exercise. One thing I like to do instead of string slicing, in your case would be to search and take only after the ":" character. Either way would accomplish the same thing, nice work!
Hey Shashank, I've had job interview recently and got some data to work on. I thought I am well prepared in terms of working with pandas, excel etc.. but the problem that I had to work on was something new for me and in reality quite a common example.
Task was just dates, sales value, quantity, ID, category for two groups - test and control. Based on this I had to calculate ROI and the overall effectiveness of a marketing campaign including its cost, knowing when campaign started.
It would be really nice if you prepare a video on something similar and teach us a little bit about commonly use financial metrics used by data analysts.
Could you find a way to do it in R todyverse. It is very good video. More please.
sir how about automated data cleansing in python?
So I tried solving this problem by myself, and then now I am watching your video. You dropped an extra column. The "Unnamed: 9" column actually has values if you deep down. It is a "Notes" column. I hope your client didn't have some important notes in there Shashank ;)
Regardless, good solution!
Great video, but one thing you do makes me really nervous. You drop entire columns because they appear to be empty. What if they're not?
Seems a safer solution to use df.dropna(how='all', axis=1). That would delete an entire column only if it truly is NaN all the way down.
And of course, run it once before adding inplace=True.
Thanks for making this video. As a fairly new data analyst and pandas enthusiast following along with you gave me that "hey, I really do know what I'm doing!" feeling. : ]
Great call out! I’m going to add your method to my workflow, never new about that argument in the dropna method
Thanks Shashank this was really useful. Keep it up!
Sir please bring more of these hands on data cleaning videos.
Bro, your videos are extremly valuable. Thanks for sharing your knowledge.
This was fun. Bring more challenging videos and keep up the good work.
Subscribed! Your videos are just beyond perfect, and helped me a lot. I'm about to quit my current job, and start putting time learnin. I really enjoy data analysis overall, and wish I can land my next job in this area. Again, thank you!
That’s amazing to hear Jin! Stay tuned for more videos and good luck!
Can someone please explain at 12:30 how did he removed the extra row. I tried on my own but I'm not able to understand
How about cleaning memory on iphone? seems like the cloud is getting expensive
what do you do if u cant find a solution to these problems?
Stark Overflow is your best friend😃
First sorry for my poor english.
Since I work for a while with pandas on python if your CSV have the same number of lines for each "set" I would use chuking option on pandas.
When you give X row for read, you can simple skip X row and append the data to an array and when you reach the end of file simple create a DataFrame with the array, give your headers columns name and its all done.
About use str:[n] if the data it's all the same I use replace. Anything else, good job :D
Great video, really enjoyed watching it.
Thanks for this type of video ❤❤, make more such videos
thats awsome,, can u make more like such videos on data transformation and cleaning using pyspark or python with more functionality. Thanks
Can't i just clean it using power query in excel? just asking to know if it's possible. thanks.
i have a doubt while merging the data based on iteration both data frame which we are merging is having many repition of iteration like many 1 ,2,3
Why not in Power Querry?
Hi Shashank,
@12:08 here is where you're deleting one of the rows that contain unnecessary headers. I'm just curious how come you don't use the Dataframe.drop() function? i dont quite understand the statement iter_cols[iter_cols["Row Type"] != "Row Type"] and how does that statement remove that specified row? is iter_col[] a function? or something else?
Thanks in advance!
Hey Swivel thank you for watching my video! SO the .drop method would work just fine. What I'm doing is filtering the iter_cols DataFrame by the iter_cols["RowType"] column where no row = the value "Row Type"
It's was an awesome session Shashank,
Also, can you mention some portals/sites to practice of such kind of challenges.
Man your videos are so helpful and practical!
How could I do the same in r?
hey guys, please tell me where I can get the raw data.
Hey Shas! Thank you for this awesome video!
Great video! While I love pandas, for the particular example I think there are other optimal ways of doing the cleaning, especially if the data is huge. I see that it has a set format of data repeating per first name, last name and date. For a lot of data i.e. 1 bill+ rows, here's how I would approach it:
Preprocess one level of the csv within construct of bash. Bash can process this way faster. Resume the rest i.e. bringing a tabular structure, column nomenclature etc in python.
OR
Read the data in python using the file.readlines construct and not pandas. You can then use custom functions to call in conjunction with list comprehensions. Finally change the results to pandas df. This way we are segregating all memory intensive operations outside pandas. I know pandas is optimized to a certain extent regarding this, but for a lot of data pandas still fails.
OR
If one is using pandas, then based on limits of numerical fields one can change to appropriate dtype. For example a number ranging from 0 to 5 needn't be stored as int64. Strings having set number of categories neednt be stored as object and rather as category. More on this here:
vincentteyssier.medium.com/optimizing-the-size-of-a-pandas-dataframe-for-low-memory-environment-5f07db3d72e
OR
Read the data using a distributed framework like Spark
Is the Excel file uploaded? Can't get it from the Drive
this was one of the first video of your that's I watched, and I didn't have much knowledge of sql. However, after studying SQL for a bit this video makes a lot more sense
Your autocompletes are so fast! What language server are you using? Or why is it so fast. Mine takes one or two seconds to show.
You and Keith Galli and Alex Freiberg are the only RUclipsrs who actually deal with the real world problems...
Thanks, You are doing absolutely amazing work.
Great video. If you want to see your results in Excel and analyze them, you just have to create a new dataframe with the Merge result and then use the following command Final_dataframe.to_excel ('Data.xlsx', index = False). Index = False removes from the export the indexes that it places by default. If you want to convert to CSV use the following command Final_dataframe.to_csv ('Datos.csv', index = False). You can also add the attribute SEP = ";" not to separate by commas (,).
Thanks for these handson data challenge . One comment or may be a question .We could have also used the split function to get desired values in name_dateframe as well right?
Something like below ?Is split slicing approach more time consuming and so you preferred to slice based on number of characters ?
name_dataframe = name_dataframe.loc[:,["Row Type", "Iter Number", "Power1", "Iteration"]]
name_dataframe.rename(columns={"Row Type": "First Name", "Iter Number":"Last Name", "Power1": "Date"}, inplace=True)
#name_dataframe["First Name"] = name_dataframe["First Name"].str[12:]
name_dataframe["First Name"] = name_dataframe["First Name"].str.split(": ",1).str[1]
#name_dataframe["Last Name"] = name_dataframe["Last Name"].str[10:]
name_dataframe["Last Name"] = name_dataframe["Last Name"].str.split(": ",1).str[1]
#name_dataframe["Date"] = name_dataframe["Date"].str[5:]
name_dataframe["Date"] = name_dataframe["Date"].str.split(": ",1).str[1]
name_dataframe
Top notch content 👌
Hi which ide your are using.
Please bring more videos like this.. thanks for this video & your efforts for us .. ❤️
One more question, what happens if somebody update the original CSV file, does all the python code and the final csv file does get updated as well ? Is it all automated ?
I can do this with PHP
I'm more of a traditional software engineer but I did dip my toes into data science/analytics at university so this is an interesting insight.
Nice! Hope to see you in future videos!
Great challenge! I managed with a similar approach in R
I'm on a Mac and do not have PowerQuery in my version of Excel, sure I could do some VBA, however, this(Python) has considerably more power than that.
Thank you Shashank(Subscribed, Liked, here is my comment, and I set the notifications to All, next is Patreon)
Cheers,
Dane
Did you keep the Average, Maximum,Std.Dev. and Totals rows/values?
Why do you ignore SettingWithCopyWarning?
Thanks for this content !!😍
Really interesting watch. Would have done it a little differently, it’s great to see someone else’s interpretation. Would you use something like pyodbc to load the data frame into sql? What are some of the benefits of loading it into sql from here? I guess it would depend on the use case? Would be awesome to see a whole project that implements that. Subscribed! A note: when you only need a few columns, I love just setting a new data frame with those columns instead of using drop: new_df =og_df[[“columns”, “we”, “want”]]
Hey this was super insightful! please make more videos 😁
COOL Shashank !
Just great! Thank you very much!
Awesome. Thanks so much
I managed to do it using pandas methods alone without the 'for' loop which i suspect would make things quicker.
How long did it take you to learn the things you do in the day in a life videos? And are you a senior analyst?
I’m a senior analyst, it takes a few minutes this to pick up the skills at a basic level but less time to pick them up as time goes on
nice vid. a small tip:
you can say:
for i, data in enumerate(data):
pass
so that you can iterate the loop and drop the manual counter.