Exploratory Data Analysis (EDA) using Apache Spark and Python

Поделиться
HTML-код
  • Опубликовано: 7 ноя 2024

Комментарии • 43

  • @AIEngineeringLife
    @AIEngineeringLife  3 года назад

    Many have asked for the file I used for this video- You can download it from here -
    drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing
    Remove the last 2 line from the csv file

  • @agammishra9674
    @agammishra9674 4 года назад +3

    Would like to congratulate to you sir!!!! Really liked your passion of making people educate about these cutting edge new technologies and also giving the whole picture not sticking to just solve one problem...loved your work and you inspired me to always try to give others what u have as it will only come back to you...once again kuddos to your passion and humanity

  • @pariksheetde4304
    @pariksheetde4304 4 года назад +2

    Hats off to you sir. :)
    for providing such a wonderful explanation and detail analysis. Thank you once again.

  • @ijeffking
    @ijeffking 4 года назад +1

    Very nice and useful pointers. Thank you very much.

  • @sachinsarathe846
    @sachinsarathe846 4 года назад +1

    Sir you are providing us great content that too free of cost....... I sincerely want to thank you for all the hard work that you are putting in ...
    Also Sir could please suggest us some personal projects that we could take up to impart this knowledge.

  • @yasoram8007
    @yasoram8007 3 года назад +1

    Well explained in each code and scenario. Thanks a lot

  • @seemunyum832
    @seemunyum832 3 года назад

    Thank you so much sir, really respect what you are doing to help people that want to learn and make a career in data

  • @chetanmundhe7899
    @chetanmundhe7899 2 года назад

    Very very usefull playlist,thanks for sharing indepth knowledge, I have question- how to use spark with snowflake, how to connect?

  • @priyalarunnile7981
    @priyalarunnile7981 4 года назад +1

    Great videos! Looking forward to more videos. Keep up the good work! :)

  • @infinioda108
    @infinioda108 3 года назад

    DAMN no DISLIKES thats so cool sirji

  • @sankarshkadambari2742
    @sankarshkadambari2742 4 года назад +1

    Hello Sir
    Just wanted to confirm Spark is framework which works on the principal of Distributed Datasets and here we are using the pyspark library in the databricks notebook in order to perform the EDA and data cleaning. Right ?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Yes right this is databricks specific feature. If one is running outside of spark then we can convert spark dataframe to pandas and do EDA. In that case all data is brought to driver node and no longer distributed. I think I have covered it as well towards end of video

  • @AkshayKumar-xo2sk
    @AkshayKumar-xo2sk 3 года назад

    @AIEngineering - I would like to learn Spark. So, I am following your "Mastering APACHE Spark" playlist. Am I right to understand that the videos are in proper order in the playlist of 30 videos? Because as playlist progresses, I see some MLOps video as well. So just wanna seek your help in understanding the order is correct or not. Thanks for your help with this tutorials

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      Akshay . First 8 videos are key if you are focusing on Data Engineering and Big Data. From ML video it is if you want to understand Spark for machine learning. You can pick the topic you are looking at to learn here

  • @dineshvarma6733
    @dineshvarma6733 4 года назад +1

    Hi Sir,
    Thanks for your time. At 18:32 time, When saying about creating "Exposure" column. What is revol_bal (Revolving Balance).
    Is it (rev_util)% * Loan_amnt.?
    Because below statement is throwing me error.
    lc_df = lc_df.withColumn("exposure",when(lc_df.bad_loan=="No",col("revol_bal")).otherwise(-10*col("revol_bal")))
    display(lc_df)
    Error:
    org.apache.spark.sql.AnalysisException: cannot resolve '`revol_bal`' given input columns:[.....................]
    Please correct me if I'm wrong.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      revol_bal is column in input dataset. Can you check if that column is present as part of your data?

    • @dineshvarma6733
      @dineshvarma6733 4 года назад

      @@AIEngineeringLife Got it Sir. Able to include in 'df.select()' and continued with analysis. Thank you.

  • @ankushojha5089
    @ankushojha5089 4 года назад

    Hello Sir,
    It will be very helpful if you can make a dedicated video on How to prepare for interview of Data Engg profile along with topic and Sub topic details...
    I am a beginer and I want to move in Data engg filed, I have working experience on SQL, Python
    Sorry to give trouble with my silly doubt
    Thanks in advance

  • @royxss
    @royxss 4 года назад +1

    Can you please explain your choice of databricks community edition? Can we use it for free completely just like Colab?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      royxss yes you can use colab and run pyspark. I wanted to show databricks feature of EDA and also in future videos scala spark. I doubt we can run scala on colab. But to your point pyspark can be run in colab

  • @Azureandfabricmastery
    @Azureandfabricmastery 3 года назад

    Thanks. Helpful.

  • @ankushojha5089
    @ankushojha5089 4 года назад +1

    Hi Sir,
    If possible please explain the usage of " -,-" from line "sns.countplot(pd_df.loc[pd_df['total_acc']

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Did you do it as in my notebook here - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Lending%20Club%20EDA.ipynb
      You can leave it as well as it is just to set label and text location

  • @shaikrasool1316
    @shaikrasool1316 4 года назад +1

    Sir, have you shared this notebooks anywhere like github..
    How can i access them

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Check this repo out - github.com/srivatsan88/Mastering-Apache-Spark

  • @karndeepsingh
    @karndeepsingh 4 года назад +1

    Sir, can you rearrange the video sequence in this mastering Apache Spark Playlist as it would be good if we get every video one after the other. Thank you.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Kamdeep.. I have arranged in this order data analysis/cleaning, eda, data engineering and then ML
      Only thing I did is moved transformation part 2 to end as it is kind of optional if required. Do you suggest any change?

    • @karndeepsingh
      @karndeepsingh 4 года назад

      @@AIEngineeringLife No Sir, No changes needed i got your point.🙂

  • @karndeepsingh
    @karndeepsingh 4 года назад +1

    Sir, can you recommend the best course to learn and apply all of these different stages of DATA SCIENCE using APACHE SPARK!! That would be a great help!!
    Thank you!!

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +2

      Karndeep, If you go for course as well they will be covering on toy dataset and just explain you functions. There are plenty online, some good one you can find is on MapR website but they are outdated. My plan when I started doing Spark video is to cover functions and take it with real world dataset like Lending Club, Churn dataset. I have lot of videos on Spark in my course below and it is free
      ruclips.net/p/PL3N9eeOlCrP7_vt6jq7GdJz6CSFmtTBpI
      I will be adding to it. In case if you feel I am missing something and if I feel it might help I can add it. Also I cover all of Spark not only data engineering. I have ML and in future graph is also coming

    • @karndeepsingh
      @karndeepsingh 4 года назад

      @@AIEngineeringLife Thankyou Sir, I am going through all your videos and it is helping me alot.

    • @chwaleedsial
      @chwaleedsial 4 года назад

      With the videos, I would suggest to keep the spark book open for any references. I find spark documentation to be bit difficult to wade through.

  • @chetanmundhe7899
    @chetanmundhe7899 2 года назад

    Can you make video on how to use snowpark?

  • @christineeee96
    @christineeee96 4 года назад

    if i'm preparing for de interview should I use spark or pandas/matplot for data cleaning. which one do u suggest

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +2

      You can use pandas if your data is not large and can fit easily into your system. In case if you are dealing with very large dataset that needs distributed computing, thats where Spark comes into play

    • @christineeee96
      @christineeee96 4 года назад

      AIEngineering tq!!

  • @muangeally5463
    @muangeally5463 3 года назад

    hello sir, where can i find the dataset you used in this video??

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      It is over here -
      drive.google.com/file/d/1e6phh7Df8mzYoE-sBXPVJklnSt_wHwkq/view?usp=sharing
      Remove the last 2 line from the csv file

    • @muangeally5463
      @muangeally5463 3 года назад

      thank you, you're such a life saver, your explanations on the vlogs answered all my questions

  •  4 года назад +1

    Where to find the notebooks, code ?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +2

      Lorenco.. You can find notebook here
      github.com/srivatsan88/RUclipsLI

    •  4 года назад

      @@AIEngineeringLife Thank you sir. Cheers! Keep on the good content!

  • @Cricketpracticevideoarchive
    @Cricketpracticevideoarchive 4 года назад

    Day 2 Crash Course : databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6601281371951374/1205982057485879/6899823951304183/latest.html