Data Cleaning CHALLENGE (can you think of a better solution?)

Shashank Kalanithi

Просмотров 134 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 янв 2025

Комментарии • 197

@adamjones6916 3 года назад ⁺³⁶
I think your solution might be skipping over some good built-in functionality:
1. Create a function which incorporates your str.contains() to either return the name, or return None.
- You can make this dynamic using the split(":") function to split those values using the colon, rather than string slicing which you need to hard code the value.
2. Apply the function to create new columns for each of first/last/date, this will have NaN where the column didn't contain your string.
3. Now use df.fillna(method="ffill") on those new columns to forward fill the values (basically the same as your iteration, and avoids creating multiple DFs and merging)
4. Drop the rows which have NaN values in one of the other columns (e.g. Speed1)
5. Drop the rows which have column headers in the values (e.g. df["Speed1"] == "Speed1")
and you're done!
@oboswo 3 года назад
This is a very elegant solution!
@___matt__ 3 года назад
Effectivement ça évite les boucles et de passer par plusieurs sous-dataframes.
Clever idea !!
@ShashankData 3 года назад ⁺⁵
This was very enlightening thank you so much for taking the time to write this out!
@alinajaved2165 2 года назад
@Adam jones how about the automated data cleanup in python?
@evyatarse 3 года назад ⁺¹¹
I must say. this channel is one of the best channels to learn how to think and act as a data analyst.
the way you are explaining and demonstrating is far greater than most of the channels I stumbled upon.
defiantly a subscriber from now on
@baun3024 3 года назад ⁺¹¹
40 - 50 percent of my bandwidth of projects is usually consumed by data cleaning. This walkthrough is great for practice and helps with new approaches, keep them coming!!
@satyammishra8904 3 года назад ⁺¹⁰
U r providing a good platform to learn for those cannot afford..... without any investment..... Great job
@PhiNguyen-iz9go 3 года назад ⁺²⁷
We absolutely need more real hand on projects like this one!! Thanks for the content!!
@akashyourbuddy 3 года назад ⁺²⁷
Finally first guy who actually showed what do data scientist's do, instead of just random English gibberish. Thank you
@JesusPonce 3 года назад ⁺²³
And again, these types of videos are GOLD to any data analyst. We can see how you do your work and share tips and tricks to deal with data. Thanks Shashank! I honestly learned a lot more :)
@ImperialArmour 3 года назад ⁺⁷¹
1.You're dropping records with a null value on the basis that it does not allow your code to run, which will cause your reports to be incorrect if you validate against the source system.
2. For loops are incredible slow
3. Python is suitable for smaller datasets(100m rows, there abouts?), for larger datasets you'll want to compress and physically sort the rows in a database first, process all these changes, and pass it into Python, usually for the modelling libraries. The largest datasets will need to be parallelised over worker nodes.
4. Notebooks are not commonly used for production due to UI resource overheads. For small datasets, just run it as a .py. Larger datasets, submit a nohup job.
@ShashankData 3 года назад ⁺²⁴
Love the detail in this comment! Thanks for the tips and insight
@YourMakingMeNervous 3 года назад ⁺²
Can you explain what you mean by "physically sort the rows" ? Thanks!
@Bozk97 3 года назад ⁺³
@@YourMakingMeNervous I think what they meant was sorting with SQL queries i.e. SORTBY or filtering by using the WHERE "column name" = whatever
@YourMakingMeNervous 3 года назад ⁺¹
@@Bozk97 understood, thanks!
@aaryanbhagat4852 3 года назад ⁺¹
Alternative to for loops?
You mean just filter the large dataset using queries and then parallelise the output?
@psyduk6581 3 года назад ⁺³
That's a very smooth way of cleaning the data. Thank you for bringing this out and explaining it in such a nice way. Request you to bring more videos in future.
@ShashankData 3 года назад ⁺²
Will do!
@GriefHC 3 года назад ⁺³
Incredible video, Shashank! Really appreciate these types of videos. Small projects really help because they are easy to tackle on the weekend, and really put into practice the practical skills needed as a data analyst.
@amranazad4540 3 года назад ⁺¹
2:37 mins in and i knew this guy is super cool! Saw the entire video and please do upload more data cleaning challenges like this.
@BreakbeatNightmare 3 года назад ⁺⁹
Love seeing the ‘real’ data analyst work! It’s not all sexy ML and modelling. More like this please :)
@bhartityagi8183 3 года назад
U really understand freshers pain. What students actually wants to see real life challenges what we actually see and needs to do.
@guinhaa 3 года назад
man i love your videos! one of the best channels related to data science/analytics on yt
@stacysmith5787 2 года назад
This was great. Really shows the process of breaking the big task down to bite size pieces.
@emilioprill3373 2 года назад
I looked for a channel like this for well over a month now thank you very much you gained a new sub 🤜🏼🤛🏼
@nikhilpy 3 года назад ⁺³
1. Remove extra rows and empty rows.
df = df[(df['Row Type'] != 'Row Type') & (df['Row Type'].notnull())].reset_index(drop=True)
2. create three new column and chnage cell value to null which doesn't contain first_name etc.
df['first_name'] = df['Row Type'].apply(lambda x: x.split(": ")[-1] if 'first name' in x else np.nan)
df['last_name'] = df['Iter Number'].apply(lambda x: x.split(": ")[-1] if 'last name' in x else np.nan)
df['date'] = df['Power1'].apply(lambda x: x.split(": ")[-1] if 'date' in x else np.nan)
3. fill last 3 columns with values above it using ffill
df = df.ffill(axis=0)
4. remove rows which has 'first name' in 'Row Type' column so it will remove all extra rows.
df = df[~df['Row Type'].str.contains('first name')].reset_index(drop=True)
@YazanWael 3 года назад
Thank you for the challenge..
I've done the transformations in PowerQuery (I'm sure the same can easily be done in Python) in a faster and I believe more efficient way.
1. Add a column called "First Name", the column will check if the first column in the dataset starts with "first name". If yes, then it'll copy the contents of the first column, otherwise it'll be blank.
2. Repeat Step1 for columns "Last Name" & "Date" by checking the second & third columns consecutively.
3. The result will be 3 columns with First Name, Last Name & Date data. But it'll be mostly blanks.
4. Extract First Name, Last Name & Date data by getting whatever comes after the colon (": ").
5. Fill down the three new columns to populate black cells with First Name, Last Name & Date data.
6. Filter the data to exclude any rows that contain blanks or the headers.
The advantage of this approach is that it doesn't require any merge/join which can get exponentially expensive as the data grows.
@arunsajisamuel1283 3 года назад ⁺⁴³
Please do bring videos like these more often or at least throw some mini-challenges like these and also would be awesome if you could show such data manipulation and cleaning on excel too!!! Thanks for such amazing videos!
@samh3157 3 года назад ⁺²
Getting into programming for data science this year and this is first time I've seen the inverse used in pandas. Thanks for the video.
@hectorvillodas4695 3 года назад
Very organic and high quality video. Someone found their path in life from this content. Great job sir.
@oppenteknik6182 3 года назад ⁺¹
Another way of creating the "iteration" column, using standard (vectorized) pandas (+ range) instead of for loop and if-statement. Note that this also requires NaN/None values being dropped before running.
is_new_person = df['Row Type'].str.contains('first name')
df.loc[is_new_person, 'person_id'] = range(df[is_new_person].shape[0])
df['person_id'].fillna(method='ffill', inplace=True)
@shavilyarajput5477 2 года назад
10:01 I didnt understand how the iterations got created as in my case when i create a new column called iterations it says "Length of value(5994) does not match length of index (58397) " .Overall great learning experience .
@ShahzadHassanBangash 3 года назад
I kind of understand the logic's of how you did all this but i think i am very new to all these commands and their specific functionalities. I will be watching all your python videos on how we clean data to get a hang of this.
@imalannguyen 3 года назад
Just graduated with an MIS degree and I am pursing a Data Analyst position. Your videos give me a grasp of what this role is all about! Thank you!
@alexandremachado1014 3 года назад
Great video bro! Keep it up! I'm just starting to learn Python and watching you videos is really cool to see what I can still learn.
@jordanarce17 3 года назад
I haven't gone much in depth with python yet, but watching this video sure inspires me to push forward into learning more python for data analytics! Great video!
@khalidnajam3567 3 года назад
This is gold, Please have Series on this.
@maxeman514 3 года назад ⁺¹
Great video Shashank! I was able to restructure this data very quickly just using a few excel formulas but I definitely see how this would run into issues with larger datasets
@ShashankData 3 года назад
Love to hear it! Thanks for trying the exercise!
@jamesm2892 3 года назад ⁺³
Sir, as someone who is new to python, this one video saves HOURS of googling. This channel is going to be huge if you keep putting out videos like this. Please keep them coming and I am sure you will see the views and subscribers!
@jamesm2892 3 года назад ⁺²
One suggestion to make this video even better -- if you take some of the key lines of code and paste them into the description with a little comment description. Should take only a couple minutes, but I feel at least myself am consistently practicing by watching the video and typing out many lines
@GabrielSoares-ju9yq 3 года назад
@@jamesm2892 i agree. Perhaps even sharing a notebook with the comments would be amazing for us
@ba-en1io 3 года назад
I love your channel and your tutorials are incredible! gives great insights into what kind of tasks I can expect in a data analyst role, would love to have more such challenges that are related to the work you do at your workplace thanks a lot! :)
@dragoran149 3 года назад
counter++ is not possible in Python because it is actually a wrapper (object) and pointer arithmetic is not allowed, either.
@JamTik734 3 года назад ⁺¹
excel can do i 3 minutes
@sckhoo 3 года назад
fantastic video. really like how you think out loud and explain it clearly. subscribed and liked!
@adesatyawahana414 3 года назад ⁺³
Nice video...
I actually used a different approach for this case. I used pandas fillna with ffill (forward fill) method.
Same thing with what u did, first I cleaned every rows containing column headers or all na then removed extra columns,
after that I copied first three columns only on rows that contain "first name:" on the first column. Therefore, I have three new columns containing first name, last name and date with all NAs on other rows.
then I used fillna with ffill method to filling every na rows of that three columns, so every na will be filled with the previous non-na rows (the first name, last name and date).
and that's it, after that I only need to remove rows containing "first name:" on the first column.
@ShashankData 3 года назад ⁺¹
Love this method! If you don’t mind me asking, about how many lines of code did this take?
@adesatyawahana414 3 года назад ⁺¹
@@ShashankData
3 lines for cleaning extra columns, NAs rows and header name rows
3 lines for copying first name, last name and date values to new columns. And I sliced the str values with these lines of code too
3 lines for ffill method on each new columns
1 line for removing rows with "first name:" values
1 line for ordering columns so that the result looks the same with yours
I added more lines in the end because my result didn't have "Iteration" column, but I realized I don't need it for my approach since it is used as a merging key.
@User-1543 3 года назад
have you tried using the recycle bin?
@marcus.edmondson 2 года назад
Your tutorials are really great!
@toshmishra7695 3 года назад ⁺¹
when we merge it will merge with each iteration right in that it will increase the data by merging all vaalues kindly let me know if i am wrong
@R_nold_v Год назад ⁺¹
This is a little late, but here's the function in R for steps 1:3 using the pinned comment solution:
fill
@waynemwangi9444 3 года назад
This walk throughs are super important! Super helpful. Great content!
@waynemwangi9444 3 года назад
😂That iteration column rings a bell, I use such all the time; dummy_col_1, dummy_col_2.......dummy_col_n . Glad to know I'm not the only one.😂😂
@digigoliath 3 года назад
It ain't sexy, it keeps us busy! Would love more of data cleaning projects. TQVM!!
@peterluo1776 3 года назад
This is a very easy task - could use a vba macro (since you're using MS Excel).
Each block of data has of almost same format (as some has more rows for 'Iter') so there isn't much high logics involved.
1) Track the last first and last row as well as the column (begin and end)
2) Loop through the entire dataset and consolidate into the desired output format (perhaps in a different sheet)
Done..
@twocrazyeyes 3 года назад
Hey man great video. Really enjoyed this style of walkthrough! Keep it coming!
@HUSSEINHUSSEINAHMED 3 года назад
You have solved so many of my problems - you have no idea.
@pritombhowmik6106 3 года назад
Thanks a lot for sharing real data analyst work. Do you use Tableau Prep for data preparation? or How much it is being used in the industry?
@theforester_ 3 года назад
Awesome video mate. Big shout out from Brazil
@jakelockwood1366 3 года назад
Hey Shashank great video, really felt like this gave a good example of a typical real-world cleaning exercise. One thing I like to do instead of string slicing, in your case would be to search and take only after the ":" character. Either way would accomplish the same thing, nice work!
@tb9144 3 года назад
Hey Shashank, I've had job interview recently and got some data to work on. I thought I am well prepared in terms of working with pandas, excel etc.. but the problem that I had to work on was something new for me and in reality quite a common example.
Task was just dates, sales value, quantity, ID, category for two groups - test and control. Based on this I had to calculate ROI and the overall effectiveness of a marketing campaign including its cost, knowing when campaign started.
It would be really nice if you prepare a video on something similar and teach us a little bit about commonly use financial metrics used by data analysts.
@MrJuvette 2 года назад
Could you find a way to do it in R todyverse. It is very good video. More please.
@alinajaved2165 2 года назад
sir how about automated data cleansing in python?
@etchay 2 года назад
So I tried solving this problem by myself, and then now I am watching your video. You dropped an extra column. The "Unnamed: 9" column actually has values if you deep down. It is a "Notes" column. I hope your client didn't have some important notes in there Shashank ;)
Regardless, good solution!
@chuckt8246 3 года назад ⁺³
Great video, but one thing you do makes me really nervous. You drop entire columns because they appear to be empty. What if they're not?
Seems a safer solution to use df.dropna(how='all', axis=1). That would delete an entire column only if it truly is NaN all the way down.
And of course, run it once before adding inplace=True.
Thanks for making this video. As a fairly new data analyst and pandas enthusiast following along with you gave me that "hey, I really do know what I'm doing!" feeling. : ]
@ShashankData 3 года назад ⁺²
Great call out! I’m going to add your method to my workflow, never new about that argument in the dropna method
@chrisknollman2149 3 года назад
Thanks Shashank this was really useful. Keep it up!
@sauravkhantwal2271 3 года назад
Sir please bring more of these hands on data cleaning videos.
@keyyyla 3 года назад
Bro, your videos are extremly valuable. Thanks for sharing your knowledge.
@vinamragupta8325 3 года назад
This was fun. Bring more challenging videos and keep up the good work.
@jinhuang9468 3 года назад
Subscribed! Your videos are just beyond perfect, and helped me a lot. I'm about to quit my current job, and start putting time learnin. I really enjoy data analysis overall, and wish I can land my next job in this area. Again, thank you!
@ShashankData 3 года назад
That’s amazing to hear Jin! Stay tuned for more videos and good luck!
@vanditdubey8346 3 года назад
Can someone please explain at 12:30 how did he removed the extra row. I tried on my own but I'm not able to understand
@BrokeAgain 3 года назад
How about cleaning memory on iphone? seems like the cloud is getting expensive
@ansarijuned9883 3 года назад ⁺¹
what do you do if u cant find a solution to these problems?
@waynemwangi9444 3 года назад
Stark Overflow is your best friend😃
@joaoalmirante4268 3 года назад
First sorry for my poor english.
Since I work for a while with pandas on python if your CSV have the same number of lines for each "set" I would use chuking option on pandas.
When you give X row for read, you can simple skip X row and append the data to an array and when you reach the end of file simple create a DataFrame with the array, give your headers columns name and its all done.
About use str:[n] if the data it's all the same I use replace. Anything else, good job :D
@qandstuff 3 года назад
Great video, really enjoyed watching it.
@sumitkumargupta8677 Год назад ⁺¹
Thanks for this type of video ❤❤, make more such videos
@ashishsinha5338 3 года назад
thats awsome,, can u make more like such videos on data transformation and cleaning using pyspark or python with more functionality. Thanks
@esmaelawad1369 2 года назад
Can't i just clean it using power query in excel? just asking to know if it's possible. thanks.
@toshmishra7695 3 года назад
i have a doubt while merging the data based on iteration both data frame which we are merging is having many repition of iteration like many 1 ,2,3
@qbjak1351 3 года назад
Why not in Power Querry?
@MrNobleme 3 года назад
Hi Shashank,
@12:08 here is where you're deleting one of the rows that contain unnecessary headers. I'm just curious how come you don't use the Dataframe.drop() function? i dont quite understand the statement iter_cols[iter_cols["Row Type"] != "Row Type"] and how does that statement remove that specified row? is iter_col[] a function? or something else?
Thanks in advance!
@ShashankData 3 года назад
Hey Swivel thank you for watching my video! SO the .drop method would work just fine. What I'm doing is filtering the iter_cols DataFrame by the iter_cols["RowType"] column where no row = the value "Row Type"
@hemantbadhe3468 3 года назад
It's was an awesome session Shashank,
Also, can you mention some portals/sites to practice of such kind of challenges.
@chammo10 3 года назад
Man your videos are so helpful and practical!
@Isaac-fw6hg 3 года назад
How could I do the same in r?
@serendipitytouch7942 3 года назад
hey guys， please tell me where I can get the raw data.
@Revockz 3 года назад
Hey Shas! Thank you for this awesome video!
@lt.hineko 3 года назад
Great video! While I love pandas, for the particular example I think there are other optimal ways of doing the cleaning, especially if the data is huge. I see that it has a set format of data repeating per first name, last name and date. For a lot of data i.e. 1 bill+ rows, here's how I would approach it:
Preprocess one level of the csv within construct of bash. Bash can process this way faster. Resume the rest i.e. bringing a tabular structure, column nomenclature etc in python.
OR
Read the data in python using the file.readlines construct and not pandas. You can then use custom functions to call in conjunction with list comprehensions. Finally change the results to pandas df. This way we are segregating all memory intensive operations outside pandas. I know pandas is optimized to a certain extent regarding this, but for a lot of data pandas still fails.
OR
If one is using pandas, then based on limits of numerical fields one can change to appropriate dtype. For example a number ranging from 0 to 5 needn't be stored as int64. Strings having set number of categories neednt be stored as object and rather as category. More on this here:
vincentteyssier.medium.com/optimizing-the-size-of-a-pandas-dataframe-for-low-memory-environment-5f07db3d72e
OR
Read the data using a distributed framework like Spark
@adriancittadini 3 года назад
Is the Excel file uploaded? Can't get it from the Drive
@supppose 3 года назад
this was one of the first video of your that's I watched, and I didn't have much knowledge of sql. However, after studying SQL for a bit this video makes a lot more sense
@smellypunks 3 года назад
Your autocompletes are so fast! What language server are you using? Or why is it so fast. Mine takes one or two seconds to show.
@AbhishekSharma-hy4nl 3 года назад
You and Keith Galli and Alex Freiberg are the only RUclipsrs who actually deal with the real world problems...
@vikranttyagiRN 3 года назад
Thanks, You are doing absolutely amazing work.
@esmanar2 3 года назад ⁺¹
Great video. If you want to see your results in Excel and analyze them, you just have to create a new dataframe with the Merge result and then use the following command Final_dataframe.to_excel ('Data.xlsx', index = False). Index = False removes from the export the indexes that it places by default. If you want to convert to CSV use the following command Final_dataframe.to_csv ('Datos.csv', index = False). You can also add the attribute SEP = ";" not to separate by commas (,).
@dolikapandey5425 3 года назад
Thanks for these handson data challenge . One comment or may be a question .We could have also used the split function to get desired values in name_dateframe as well right?
Something like below ?Is split slicing approach more time consuming and so you preferred to slice based on number of characters ?
name_dataframe = name_dataframe.loc[:,["Row Type", "Iter Number", "Power1", "Iteration"]]
name_dataframe.rename(columns={"Row Type": "First Name", "Iter Number":"Last Name", "Power1": "Date"}, inplace=True)
#name_dataframe["First Name"] = name_dataframe["First Name"].str[12:]
name_dataframe["First Name"] = name_dataframe["First Name"].str.split(": ",1).str[1]
#name_dataframe["Last Name"] = name_dataframe["Last Name"].str[10:]
name_dataframe["Last Name"] = name_dataframe["Last Name"].str.split(": ",1).str[1]
#name_dataframe["Date"] = name_dataframe["Date"].str[5:]
name_dataframe["Date"] = name_dataframe["Date"].str.split(": ",1).str[1]
name_dataframe
@lucasdummer9091 3 года назад ⁺¹
Top notch content 👌
@asuresh2541 3 года назад
Hi which ide your are using.
@shubhamchoudhary5461 3 года назад
Please bring more videos like this.. thanks for this video & your efforts for us .. ❤️
@ShahzadHassanBangash 3 года назад
One more question, what happens if somebody update the original CSV file, does all the python code and the final csv file does get updated as well ? Is it all automated ?
@maclynnandrade6118 3 года назад ⁺¹
I can do this with PHP
@goldenducky12 3 года назад
I'm more of a traditional software engineer but I did dip my toes into data science/analytics at university so this is an interesting insight.
@ShashankData 3 года назад
Nice! Hope to see you in future videos!
@brayn7742 3 года назад
Great challenge! I managed with a similar approach in R
@SirSoulhunter 3 года назад
I'm on a Mac and do not have PowerQuery in my version of Excel, sure I could do some VBA, however, this(Python) has considerably more power than that.
Thank you Shashank(Subscribed, Liked, here is my comment, and I set the notifications to All, next is Patreon)
Cheers,
Dane
@davidjackson7675 3 года назад
Did you keep the Average, Maximum,Std.Dev. and Totals rows/values?
@Furniez 3 года назад
Why do you ignore SettingWithCopyWarning?
@antonyglenngomez4204 3 года назад
Thanks for this content !!😍
@ericsamson731 3 года назад ⁺¹
Really interesting watch. Would have done it a little differently, it’s great to see someone else’s interpretation. Would you use something like pyodbc to load the data frame into sql? What are some of the benefits of loading it into sql from here? I guess it would depend on the use case? Would be awesome to see a whole project that implements that. Subscribed! A note: when you only need a few columns, I love just setting a new data frame with those columns instead of using drop: new_df =og_df[[“columns”, “we”, “want”]]
@nfs284 3 года назад
Hey this was super insightful! please make more videos 😁
@alexandrunknown1456 3 года назад
COOL Shashank !
@johnnygp9397 3 года назад
Just great! Thank you very much!
@DuyTran-ss4lu 3 года назад
Awesome. Thanks so much
@seanharricharan9191 3 года назад
I managed to do it using pandas methods alone without the 'for' loop which i suspect would make things quicker.
@jonathanrousseau9552 3 года назад
How long did it take you to learn the things you do in the day in a life videos? And are you a senior analyst?
@ShashankData 3 года назад
I’m a senior analyst, it takes a few minutes this to pick up the skills at a basic level but less time to pick them up as time goes on
@christopherrseay3148 3 года назад ⁺²
nice vid. a small tip:
you can say:
for i, data in enumerate(data):
pass
so that you can iterate the loop and drop the manual counter.

Следующие

Автовоспроизведение

Beginner's Excel Tutorial - Data Science/Analysis and Beyond