Data Analyst Portfolio Project (Exploratory Data Analysis With Python Pandas)

Поделиться
HTML-код
  • Опубликовано: 20 сен 2024
  • In this video, we take a look at an Exploratory Data Analysis (EDA) portfolio project within Python Pandas. Everything is coded within Jupyter Notebook and the data is sourced from Kaggle.
    Python Libraries needed: Pandas, Seaborn
    Kaggle Data: www.kaggle.com...
    Interested in discussing a Data or AI project? Feel free to reach out via email or simply complete the contact form on my website.
    📧 Email: ryannolandata@gmail.com
    🌐 Website & Blog: ryannolandata....
    🍿 WATCH NEXT
    Python for Data Analyst and Scientists Playlist: • Python Tutorials For D...
    Python Data Cleaning: • Real World Data Cleani...
    Python Groupby: • The Complete Guide to ...
    Vid 3:
    MY OTHER SOCIALS:
    👨‍💻 LinkedIn: / ryan-p-nolan
    🐦 Twitter: / ryannolan_
    ⚙️ GitHub: github.com/Rya...
    🖥️ Discord: / discord
    📚 *Data and AI Courses: datacamp.pxf.i...
    📚 *Practice SQL & Python Interview Questions: stratascratch....
    WHO AM I?
    As a full-time data analyst/scientist at a fintech company specializing in combating fraud within underwriting and risk, I've transitioned from my background in Electrical Engineering to pursue my true passion: data. In this dynamic field, I've discovered a profound interest in leveraging data analytics to address complex challenges in the financial sector.
    This RUclips channel serves as both a platform for sharing knowledge and a personal journey of continuous learning. With a commitment to growth, I aim to expand my skill set by publishing 2 to 3 new videos each week, delving into various aspects of data analytics/science and Artificial Intelligence. Join me on this exciting journey as we explore the endless possibilities of data together.
    *This is an affiliate program. I may receive a small portion of the final sale at no extra cost to you.

Комментарии • 107

  • @RyanAndMattDataScience
    @RyanAndMattDataScience  Месяц назад

    Thanks for checking out this video.
    Join our Data Science Discord Here: discord.com/invite/F7dxbvHUhg
    If you want to watch a full course on Python Pandas check out Datacamp: datacamp.pxf.io/XYD7Qg
    Want to solve Python data interview questions: stratascratch.com/?via=ryan
    I'm also open to freelance data projects. Hit me up at ryannolandata@gmail.com
    *Both Datacamp and Stratascratch are affiliate links.

  • @WarbossPepe
    @WarbossPepe 10 месяцев назад +10

    You're a good man Ryan. Hope the run went well

  • @idreeskhan5129
    @idreeskhan5129 6 месяцев назад +3

    Great work Ryan . Thank you

  • @234bellamkonda
    @234bellamkonda Месяц назад

    Awesome video, finished it in a day. Planning to do 1 project a day following videos till I get comfortable doing things on my own. Very easy to follow, thank you so much 😊

  • @arun_jakhmola
    @arun_jakhmola 4 месяца назад

    Hey Ryan, Greetings from India
    I shadowed you for 3 days and completed the project in bits but glad I finished the whole video.
    Loved the project and the way you taught it.
    (Just a suggestion - Please go by the agenda for the project, so that we can have an outline in our minds of the key things that we as data analysts need to extract from the data.)

  • @emastehr
    @emastehr Год назад +4

    Great Project. Could you develop a full project? Something that includes sql, python and then a visualization tool. That would be amazing

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  Год назад

      Yes I’ll be working on one in the future. Focus atm is more models like the one I uploaded today

  • @kokowin5851
    @kokowin5851 3 месяца назад +8

    This is an easier way to remove USA from the event name = df2["Event name"] = df2["Event name"].str.replace("(USA)", " ")

    • @sandydalhousie
      @sandydalhousie Месяц назад

      yes this is better I agree. Also, I also tried using the split method as used by Ryan but all my entries in the "Event name" get replaced with "None" somehow! I don't understand.

  • @shailendra_kunwar
    @shailendra_kunwar 4 месяца назад

    Awesome work Ryan 🔥🔥🔥🔥
    I have just watched it and I appreciate the effort that you put in for the video. I will be using this as my portfolio project.

  • @nlnl72
    @nlnl72 6 месяцев назад +1

    Thanks for the video! really helpful.
    Do you think you can do a Data Scientist Portfolio Project(s) series? I'm sure you'll find a lot of people interested in that (including me haha)!

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  6 месяцев назад +1

      hey I have 2 out so far! and I just published another data analyst project last week

    • @nlnl72
      @nlnl72 6 месяцев назад

      @@RyanAndMattDataScience Okey thanks, I'll definitely check them out!

  • @navid7467
    @navid7467 3 дня назад

    New subscriber here! Thank you for your good work. Just a quick question. To extract events held in USA, since we know we are looking for the 3 letters between the 5th last and last as USA, couldn't we use this condition: (df['Event name'].str[-4:-1]=='USA')? I used it but my dataframe returns 26524 rows which I thought might be due to difference in the version of dataset.
    I also tried (df['Event name'].str.endswith("(USA)")) and got the same number of rows.

  • @jkzhakom
    @jkzhakom 4 месяца назад

    Fantastic video, Ryan. Thanks for sharing your knowledge with us.

  • @Nighthunterm
    @Nighthunterm 5 месяцев назад

    Was just doing some python learning to get some more knowledge and and I just found your channel. I heard you say you ran your marathon around UCF. I'm a fellow alumni as well from there haha. Go knights!

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  5 месяцев назад

      Haha charge on and 25 loops around campus. I’ll never run there again lol

  • @thekendev
    @thekendev 5 месяцев назад +1

    Hey Ryan,
    Just watching this and following along.
    I’ve got a question please;
    At the 17:30 mark I noticed that the split you did seemed a bit overwhelming. As a novice in data scientce, I couldn't help but notice something interesting in the data. There were event names labeled inconsistently for the USA, some as "usaaaaA" and others as "usaaa". So I used a simple string.contains() function with case sensitivity turned off to standardize it, resulting in 1.7 million rows. Wanted to hear your thoughts on this approach.
    I know might be labeled a lazy and easy approach but I found this catching more rows effectively. Please give me your views(I’m still learning)

    • @thekendev
      @thekendev 5 месяцев назад

      So my .shape() is 30120 not 26090

  • @rahulpal_dsml
    @rahulpal_dsml 10 месяцев назад +2

    Not subscribing you would be a sin, after going through this beautiful and informative video!. keep going!

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  10 месяцев назад +1

      I appreciate it! Working on a big class video next! Followed by Pytoch

    • @rahulpal_dsml
      @rahulpal_dsml 10 месяцев назад

      @@RyanAndMattDataScience would love it, I am not sure whether you do it or not, as i just came across your video today, but do try posting (community post) some time before the videos, would not want to miss it.
      Appreciate for valuable input by you, really impressed by a tutor's ability to convey after more than a decade !!

    • @rahulpal_dsml
      @rahulpal_dsml 10 месяцев назад

      Hey, Ryan, i am getting this error when combining all the filters together. Could you please guide how to sort this?
      MemoryError: Unable to allocate 75.9 TiB for an array with shape (7461195, 1398540) and data type float64
      I have a 8th gen cpu (i5 - 8350U), 24 Gb RAM, 500 GB SSD (Crucial mx500), and am using jupyter notebook in anaconda env.

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  10 месяцев назад

      @@rahulpal_dsmlcan you try running it in Kaggle or Google collab?

    • @rahulpal_dsml
      @rahulpal_dsml 10 месяцев назад

      @@RyanAndMattDataScience Hi, yes, it did run on google colab, thanks a lot

  • @shayanakhavan6002
    @shayanakhavan6002 5 месяцев назад

    Great video, Ryan!

  • @7a30adnanbin5
    @7a30adnanbin5 4 месяца назад +1

    Great Vid mahn .. really helpful

  • @takashiiexe
    @takashiiexe 6 месяцев назад

    Thanks Ryan! Great Project.

  • @lujingyan6853
    @lujingyan6853 6 месяцев назад +1

    Thank you for your sharing. But when you use (df["Event name"].str.split("(").str.get(1).str.split(")").str.get(0) == "USA") to select all the USA races, it will ignore the events that contain more than one () in their name, such as Palisades Ultra Trail Series (PUTS) - Big Elk 50k (USA). It might be a good way to use df["Event name"].str.contains(r"\(USA\)".

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  6 месяцев назад

      Ah didn’t realize when doing this project. Great catch and thanks for commenting

  • @RoleJohn
    @RoleJohn 11 месяцев назад

    great great content ! i am subscribing only on the condition you upload more and more in depth analysis using Python. Keep it up

  • @RRangel7b
    @RRangel7b 2 месяца назад +1

    Hello
    1th of Thank you !!
    & how about:
    df = pd.DataFrame(data)
    usa_events = df[df['Event name'].str.contains('USA')]
    print(usa_events)

  • @athayaazaria1825
    @athayaazaria1825 6 месяцев назад +1

    hi, can I get the full syntax at minute 49.07, I can't see the continuation. I need it for my current school assignment, and this will help me a lot😊😊😊

  • @alexrosen8762
    @alexrosen8762 Год назад

    Really useful project for learning especially since the datasample is included. Thanks a lot 🙏

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  Год назад +1

      Glad it was helpful! Currently in the initial stages of next months project

    • @alexrosen8762
      @alexrosen8762 Год назад

      @@RyanAndMattDataScience Great! Looking forward to that👌

  • @ayantikaC03
    @ayantikaC03 9 месяцев назад

    Great video Ryan!

  • @akshatalanjewar3056
    @akshatalanjewar3056 6 месяцев назад

    Its simply amazing ....i lke the way u teach and informative video

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  6 месяцев назад

      Thank you

    • @akshatalanjewar3056
      @akshatalanjewar3056 6 месяцев назад

      @@RyanAndMattDataScience ...need one question answer .. according to job market ....which python libraries I should know for data analyst profile ..

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  6 месяцев назад

      @@akshatalanjewar3056 start with pandas and scikit learn

    • @akshatalanjewar3056
      @akshatalanjewar3056 6 месяцев назад

      @@RyanAndMattDataScience well , I know python libraries like pandas , numpy , seaborn and maplotlib ....sql , power bi ..is this sufficient to get a job

  • @AmbarGharat
    @AmbarGharat 6 месяцев назад +2

    Hi Ryan, Instead of df['Event name'].str.split('(').str.get(1).str.split(')').str.get(0) == 'USA' can we use df['Event name'].str[-5:] == '(USA)'?

    • @shailendra_kunwar
      @shailendra_kunwar 4 месяца назад

      Yes this is somehow giving 1408416 rows while the method that Ryan in the video is giving 1398540 rows.

  • @charlieadleydog
    @charlieadleydog 3 месяца назад

    Hey Ryan, great video. Just wanted to ask how much RAM you suggest for these projects to be able to run quickly?

  • @dj-mt1pz
    @dj-mt1pz 6 месяцев назад +2

    My kernel keeps dying whenever I combine all the filters of the df to create df2. Does anyone know how to resolve this issue? Otherwise I can't progress :(

    • @linda_erose
      @linda_erose 2 месяца назад

      same, did u figure it out?

  • @maxnicolasnavarro4017
    @maxnicolasnavarro4017 27 дней назад

    Thank you so much for bringing back my love for this field.
    I needed this so much...

  • @michaelshepherdmunemo4414
    @michaelshepherdmunemo4414 3 месяца назад

    Great work. Thank you i was following hands on. # Subscribed_and_Liked

  • @binarify4364
    @binarify4364 6 месяцев назад

    Brilliant Project !

  • @stallonengobua8820
    @stallonengobua8820 3 месяца назад

    Thank you very much Ryan

  • @everywoman2774
    @everywoman2774 9 месяцев назад

    subscribed! great video. Thank you for this

  • @chalamohamed2013
    @chalamohamed2013 11 месяцев назад

    Hello Ryan,
    Thanks for sharing your skills.
    I would like to understand why you have dropped Athlethe Club and Country ?
    I thinks it would be better if you had dropped rows whose have an empty value than you can modify the type of column.

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  11 месяцев назад

      There’s a lot of ways you could look at a dataset. I did this a long time ago so can’t remember exactly why I did but for what I was working on I don’t believe it mattered

  • @geoffreycg5650
    @geoffreycg5650 7 месяцев назад

    Great video!

  • @michaelg9359
    @michaelg9359 7 месяцев назад

    thanks for the vid -- very good - your camera view cuts off far right side of visual, though

  • @MiguelGracia-g2d
    @MiguelGracia-g2d 2 месяца назад

    Hi, had a quick question!
    at 17:27 would there be any downside to me using something like df[df['Event name'].str.contains('USA')] instead?
    Thanks!

  • @tarekhusam
    @tarekhusam Год назад

    You are amazing, keep it bro

  • @Al-Ahdal
    @Al-Ahdal 4 месяца назад

    In event_len column there are many row items with km, mi, h..... how can we check all these to get the correct count, and how to extract numbers only. Should we be using REGEX for that?

  • @dominiktokarski8054
    @dominiktokarski8054 Год назад

    Liked, subscribed and commented for stats. Keep going :)

  • @katehudson7405
    @katehudson7405 6 месяцев назад

    is it okay if I add this project to my portfolio after completing it? great video!

  • @onurdatascience
    @onurdatascience Год назад

    Great project!

  • @jonathangarcia8124
    @jonathangarcia8124 4 месяца назад

    Is this lesson possible in vscode or would I need to learn to use jupyternotebook?

  • @rishidixit7939
    @rishidixit7939 3 месяца назад

    Between Matplotlib and Seaborn which one should be used or both should be used ?

  • @aliomar9594
    @aliomar9594 Год назад +1

    Great

  • @iniuntukutube
    @iniuntukutube 11 месяцев назад

    halloo, ryan... can i ask something? is there any other tools (software/ application/ website) that can be used for using python? im so soorry for the question,, please dont laugh for me,, hehehehe... im very new beginner learning for data analyst... i have a dream to become business analyst... do u have some suggestion for me please?

  • @mikefranko2832
    @mikefranko2832 11 месяцев назад

    What is the reason behind cleaning up NaN values?

  • @JC_333
    @JC_333 Год назад

    Subscribed!

  • @dennisbunarta1190
    @dennisbunarta1190 5 месяцев назад

    I can't find 2020 year of event.. Any solution?

    • @J4vierC
      @J4vierC 2 месяца назад

      same problem here, i made with .contains() and i dont know why i cant return 2020 rows

  • @mikefranko2832
    @mikefranko2832 11 месяцев назад

    What is the reason behind dropping columns?

  • @SriramKoyalkar
    @SriramKoyalkar 4 месяца назад

    Where do I find this project source code?

    • @RyanAndMattDataScience
      @RyanAndMattDataScience  4 месяца назад

      I plan on putting all the code from videos on my website, but I need to scale up a bit dont have the resources atm

  • @GreyHatGenX
    @GreyHatGenX 2 месяца назад

    comment

  • @tosinwilliams9343
    @tosinwilliams9343 8 месяцев назад

    Thanks Ryan