How to Clean Data Like a Pro: Pandas for Data Scientists and Analysts

TrentDoesMath

Просмотров 4,7 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 28 ноя 2024

Комментарии •

@Carlos-wv4zk 4 месяца назад ⁺⁴
Dude I cannot explain how helpful this was, man! Seriously, you literally allowed me to pickup any datasets I download and immediately gave me the practical guidelines to clean/analyze it. Thank you!!
@trentdoesmath 4 месяца назад
You're very welcome!😎
@meatsoup4244 Месяц назад
Great Video !! I like the way that you explain your thought process behind every line of codes.
@IsraSuazo 4 месяца назад
This is the 1st video I watched that actually seeing the python libraries in action.
Thank you for this.
@trentdoesmath 4 месяца назад
You're very welcome! I'm excited to hear about what you will build with them 🙂
@souravbarua3991 3 месяца назад
Very helpful and super simple explanation. Looking forward for your next advance pandas with larger dataset videos. Thank you for this video.
@jervintravis3002 2 месяца назад
Thank u so much i definitely found the best teacher of data science
@trentdoesmath 2 месяца назад
Thank you so much for the support! 🙏
@newenglandnomad9405 3 месяца назад
Fantastic easy to follow data cleaning video. I also appreciate you blatantly saying yes it's 50 or so rows but it could be 10k, but the same techniques apply.
@jhonfir2235 2 месяца назад
Excellent very helpful Leanings.......!!
@lunaloynaz-lopez2318 2 месяца назад
I know everyone else is saying the same thing, but it's for a reason; this video is great! It's perfectly explained the hows and the whys, and I have learnt so much! +1 for the beautiful hair. Thank you!
@trentdoesmath 2 месяца назад ⁺¹
Thank you for the feedback 🙌 and much appreciated 🪮 🤣
@Simplifieddeeplearning 4 месяца назад ⁺¹
wow your are the real goat
the best video so far
please more video like this
@LivingG6170 4 месяца назад ⁺¹
Keep doing good work. Big help
@trentdoesmath 4 месяца назад
I appreciate the kind words 🙏 thanks for the support!
@ImJordanHubbard-qg9qt 3 месяца назад
Actual actionable real life skills not fluffy fun python skills but actual valuable stuff we need to know!
@longbow116 3 дня назад
Danke!
@trentdoesmath 2 дня назад
Thank You for the Super🙏
I'm glad you enjoyed!😎
@kebincui 3 месяца назад
Excellent video❤, thanks for sharing
@trentdoesmath 3 месяца назад
Thanks for watching!
@mapletech_22 3 месяца назад
This is great. ❤❤🎉
@dogsapparatus7504 3 месяца назад
nice tutorial
@TresMinecraft 21 день назад
Thank you so much for the amazing video!
For the part of cleaning in order not to copy and past this symbols it possible to save it as string variable and then pass in to str.strip() method
symbols_list = cleaned_df['Client'].str.replace('[a-zA-Z0-9 ]', '', regex=True).unique()
symbols_str = ' '.join(symbols_list)
cleaned_df['Client'] = cleaned_df['Client'].str.strip(symbols_str)
@trentdoesmath 21 день назад ⁺¹
Thanks for your comment :)!
That is a good option, but do remember that .strip() operates on the left and right of the string, but not the middle of the string - so if the chars are in the middle of the string they will be missed!
@trentdoesmath 4 месяца назад
What are some data cleaning techniques that you have used? 🤔
@tmb8807 4 месяца назад
Cool, thanks. Is Polars making much of an impact in your world? I've used it a bit and I think I prefer the more explicit syntax - besides the potential for enormous performance gains it brings.
@trentdoesmath 4 месяца назад ⁺¹
Hi tmb8807 :) I have followed a couple of tutorials on polars, but never used it on anything in a professional setting as of yet 🤔
I'll test it out more extensively.
Any good tutorials you'd recommend?
Typically, when I've worked on projects that needed high performance I've used Apache Spark - but Polars could be a nice in-between pandas and spark?
Thanks for the support!
@tmb8807 4 месяца назад ⁺¹
@@trentdoesmath thanks for the reply. There are a few tutorials on RUclips, the one from Rob Mulla is what got me onto it.
Because Polars can work with larger-than-memory data via the streaming API I’ve seen it suggested it could replace Spark on a single node for some jobs, although I’ve not done that first hand! But it could potentially expand the 'in-between' area, as you say.
Main reason I like it is that I just find the syntax much more consistent and readable (and easier to write as a result). Your mileage may vary on that, though, especially if you're extremely comfortable with Pandas (it's a bit less "Pythonic", with more explicit methods for everything).
Lazy evaluation and the query optimisation engine are a big selling point of it as well - can greatly improve memory usage.
@trentdoesmath 4 месяца назад
Awesome! I'll check out the Rob Mulla stuff, thanks for the recommendation👍
For sure! It actually reminds me a bit of Scala 🤔... Very 'to the point'.
Not sure if you have tried out Dask before? but it's yet another performance option out there.
@CaribouDataScience 4 месяца назад
You misspelled Tidyverse 😮
@trentdoesmath 4 месяца назад
🤣
@kikiboy2545 4 месяца назад
Hi ! Thanks for this video. I wanted to know, as a data scientist/analyst, why did you choose to use Jupyter and a .ipynb cleaning file ? Why not using pycharm and a .py for example ? Is that just a matter of personal preference ? Sorry I am new to python, proficient on Stata but trying to make a shift
@trentdoesmath 4 месяца назад ⁺²
Hi @kikiboy2545 🙂 thank you for your question.
TL; DR - I chose to use jupyter as it is easier for me to demo with and record the video with.
To your point on creating a .py file - I would recommend this if you are creating cleaning logic that is going to be re-used and shipped to 'production' as it is easier to test and maintain a straight Python script IMO.
That being said, there is increasing support for the use of notebooks as the preferred environment - as examples, Snowflake, Databricks, Azure Synapse and more all support the use of re-useable notebooks to contain all of your logic. I've worked in teams where notebooks are preferred for all data pipeline code due to how intuitive and approachable they are - but as I say my personal preference is: use notebooks for exploration, and .py scripts for your production code 🙂
No need to apologize! I am glad to be part of your learning journey - keep pushing man! 😎
@totoarifiyanto8679 4 месяца назад
Just like Thor said: "Another"

Следующие

Автовоспроизведение

Master Missing Data with Pandas: Essential Techniques for Data Pros