Dude I cannot explain how helpful this was, man! Seriously, you literally allowed me to pickup any datasets I download and immediately gave me the practical guidelines to clean/analyze it. Thank you!!
Fantastic easy to follow data cleaning video. I also appreciate you blatantly saying yes it's 50 or so rows but it could be 10k, but the same techniques apply.
I know everyone else is saying the same thing, but it's for a reason; this video is great! It's perfectly explained the hows and the whys, and I have learnt so much! +1 for the beautiful hair. Thank you!
Thank you so much for the amazing video! For the part of cleaning in order not to copy and past this symbols it possible to save it as string variable and then pass in to str.strip() method symbols_list = cleaned_df['Client'].str.replace('[a-zA-Z0-9 ]', '', regex=True).unique() symbols_str = ' '.join(symbols_list) cleaned_df['Client'] = cleaned_df['Client'].str.strip(symbols_str)
Thanks for your comment :)! That is a good option, but do remember that .strip() operates on the left and right of the string, but not the middle of the string - so if the chars are in the middle of the string they will be missed!
Cool, thanks. Is Polars making much of an impact in your world? I've used it a bit and I think I prefer the more explicit syntax - besides the potential for enormous performance gains it brings.
Hi tmb8807 :) I have followed a couple of tutorials on polars, but never used it on anything in a professional setting as of yet 🤔 I'll test it out more extensively. Any good tutorials you'd recommend? Typically, when I've worked on projects that needed high performance I've used Apache Spark - but Polars could be a nice in-between pandas and spark? Thanks for the support!
@@trentdoesmath thanks for the reply. There are a few tutorials on RUclips, the one from Rob Mulla is what got me onto it. Because Polars can work with larger-than-memory data via the streaming API I’ve seen it suggested it could replace Spark on a single node for some jobs, although I’ve not done that first hand! But it could potentially expand the 'in-between' area, as you say. Main reason I like it is that I just find the syntax much more consistent and readable (and easier to write as a result). Your mileage may vary on that, though, especially if you're extremely comfortable with Pandas (it's a bit less "Pythonic", with more explicit methods for everything). Lazy evaluation and the query optimisation engine are a big selling point of it as well - can greatly improve memory usage.
Awesome! I'll check out the Rob Mulla stuff, thanks for the recommendation👍 For sure! It actually reminds me a bit of Scala 🤔... Very 'to the point'. Not sure if you have tried out Dask before? but it's yet another performance option out there.
Hi ! Thanks for this video. I wanted to know, as a data scientist/analyst, why did you choose to use Jupyter and a .ipynb cleaning file ? Why not using pycharm and a .py for example ? Is that just a matter of personal preference ? Sorry I am new to python, proficient on Stata but trying to make a shift
Hi @kikiboy2545 🙂 thank you for your question. TL; DR - I chose to use jupyter as it is easier for me to demo with and record the video with. To your point on creating a .py file - I would recommend this if you are creating cleaning logic that is going to be re-used and shipped to 'production' as it is easier to test and maintain a straight Python script IMO. That being said, there is increasing support for the use of notebooks as the preferred environment - as examples, Snowflake, Databricks, Azure Synapse and more all support the use of re-useable notebooks to contain all of your logic. I've worked in teams where notebooks are preferred for all data pipeline code due to how intuitive and approachable they are - but as I say my personal preference is: use notebooks for exploration, and .py scripts for your production code 🙂 No need to apologize! I am glad to be part of your learning journey - keep pushing man! 😎
Dude I cannot explain how helpful this was, man! Seriously, you literally allowed me to pickup any datasets I download and immediately gave me the practical guidelines to clean/analyze it. Thank you!!
You're very welcome!😎
Great Video !! I like the way that you explain your thought process behind every line of codes.
This is the 1st video I watched that actually seeing the python libraries in action.
Thank you for this.
You're very welcome! I'm excited to hear about what you will build with them 🙂
Very helpful and super simple explanation. Looking forward for your next advance pandas with larger dataset videos. Thank you for this video.
Thank u so much i definitely found the best teacher of data science
Thank you so much for the support! 🙏
Fantastic easy to follow data cleaning video. I also appreciate you blatantly saying yes it's 50 or so rows but it could be 10k, but the same techniques apply.
Excellent very helpful Leanings.......!!
I know everyone else is saying the same thing, but it's for a reason; this video is great! It's perfectly explained the hows and the whys, and I have learnt so much! +1 for the beautiful hair. Thank you!
Thank you for the feedback 🙌 and much appreciated 🪮 🤣
wow your are the real goat
the best video so far
please more video like this
Keep doing good work. Big help
I appreciate the kind words 🙏 thanks for the support!
Actual actionable real life skills not fluffy fun python skills but actual valuable stuff we need to know!
Danke!
Thank You for the Super🙏
I'm glad you enjoyed!😎
Excellent video❤, thanks for sharing
Thanks for watching!
This is great. ❤❤🎉
nice tutorial
Thank you so much for the amazing video!
For the part of cleaning in order not to copy and past this symbols it possible to save it as string variable and then pass in to str.strip() method
symbols_list = cleaned_df['Client'].str.replace('[a-zA-Z0-9 ]', '', regex=True).unique()
symbols_str = ' '.join(symbols_list)
cleaned_df['Client'] = cleaned_df['Client'].str.strip(symbols_str)
Thanks for your comment :)!
That is a good option, but do remember that .strip() operates on the left and right of the string, but not the middle of the string - so if the chars are in the middle of the string they will be missed!
What are some data cleaning techniques that you have used? 🤔
Cool, thanks. Is Polars making much of an impact in your world? I've used it a bit and I think I prefer the more explicit syntax - besides the potential for enormous performance gains it brings.
Hi tmb8807 :) I have followed a couple of tutorials on polars, but never used it on anything in a professional setting as of yet 🤔
I'll test it out more extensively.
Any good tutorials you'd recommend?
Typically, when I've worked on projects that needed high performance I've used Apache Spark - but Polars could be a nice in-between pandas and spark?
Thanks for the support!
@@trentdoesmath thanks for the reply. There are a few tutorials on RUclips, the one from Rob Mulla is what got me onto it.
Because Polars can work with larger-than-memory data via the streaming API I’ve seen it suggested it could replace Spark on a single node for some jobs, although I’ve not done that first hand! But it could potentially expand the 'in-between' area, as you say.
Main reason I like it is that I just find the syntax much more consistent and readable (and easier to write as a result). Your mileage may vary on that, though, especially if you're extremely comfortable with Pandas (it's a bit less "Pythonic", with more explicit methods for everything).
Lazy evaluation and the query optimisation engine are a big selling point of it as well - can greatly improve memory usage.
Awesome! I'll check out the Rob Mulla stuff, thanks for the recommendation👍
For sure! It actually reminds me a bit of Scala 🤔... Very 'to the point'.
Not sure if you have tried out Dask before? but it's yet another performance option out there.
You misspelled Tidyverse 😮
🤣
Hi ! Thanks for this video. I wanted to know, as a data scientist/analyst, why did you choose to use Jupyter and a .ipynb cleaning file ? Why not using pycharm and a .py for example ? Is that just a matter of personal preference ? Sorry I am new to python, proficient on Stata but trying to make a shift
Hi @kikiboy2545 🙂 thank you for your question.
TL; DR - I chose to use jupyter as it is easier for me to demo with and record the video with.
To your point on creating a .py file - I would recommend this if you are creating cleaning logic that is going to be re-used and shipped to 'production' as it is easier to test and maintain a straight Python script IMO.
That being said, there is increasing support for the use of notebooks as the preferred environment - as examples, Snowflake, Databricks, Azure Synapse and more all support the use of re-useable notebooks to contain all of your logic. I've worked in teams where notebooks are preferred for all data pipeline code due to how intuitive and approachable they are - but as I say my personal preference is: use notebooks for exploration, and .py scripts for your production code 🙂
No need to apologize! I am glad to be part of your learning journey - keep pushing man! 😎
Just like Thor said: "Another"