End to End Machine Learning pipeline using Apache Spark - Hands On

Поделиться
HTML-код
  • Опубликовано: 15 сен 2024
  • #machinelearning #apachespark #end-to-end
    In this video we will see how to apply Spark machine learning to churn prediction problem. This is end to end spark ml video where I will be covering
    - Data Analysis
    - Exploratory Analysis
    - Model Transformers and Estimators
    - Spark Machine Learning Pipeline
    - ML Algorithm
    - Model Evaluation and Metrics
    - Building Own Metrics
    Will work through each of this component to required amount. For details on other transformers and estimators you can refer to apache spark website
    To get a quick overview of Apache Spark ML and why Spark you can check my earlier video - • Machine Learning using...
    If you need a quick overview of databricks then you can check my video - • Databricks for Apache ...
    #sparkml #featureengineering

Комментарии • 114

  • @pranavjayakumar2239
    @pranavjayakumar2239 4 года назад +8

    This is really good stuff. It was high time I transition from pandas and scikit learn to something more industry relevant.

    • @RD-yv4cc
      @RD-yv4cc 4 года назад +1

      My thoughts exactly mate. How's your journey been thus far?

    • @danialmalik80
      @danialmalik80 4 года назад +1

      Exactly the same thought, mate

  • @nikhildmehta3448
    @nikhildmehta3448 4 года назад +3

    Thank you so much for this wonderful video and crisp, simple explanation.I was getting too impatient waiting for the course, so I decided to type the code line by line. Took a long time but definitely worth the effort :) Cant wait for more from u on SparkML.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +2

      Nikhil.. This video will be part of the course as well and will be adding more spark videos. How do you feel working on Spark is now :)

    • @nikhildmehta3448
      @nikhildmehta3448 4 года назад +2

      AIEngineering I’m truly enjoying working on Spark and can’t wait for the next batch of videos! Would love to get my hands dirty working on this😀 Thanks again for all the efforts!

  • @umeshjadhav1586
    @umeshjadhav1586 3 года назад +2

    Sir, you are great, i have no words to say beyond Thank You.

  • @justmesherin
    @justmesherin 4 года назад +1

    You are ahhhhmazing! I am not sure why I was lurking on your linked in and not here :) my 2020 is now set! Excellent content.

  • @AbdulHadi-yj7fl
    @AbdulHadi-yj7fl 3 года назад +1

    this is quite useful sir...this is helping me with my ug project...thanks a lot

  • @hishailesh77
    @hishailesh77 4 года назад +1

    Excellent Video, i have been searching for this kind of Video from long time .

  • @teachingmachine
    @teachingmachine 4 года назад +1

    You are really awesome, its neat and crystal clear tutorial

  • @EugeneKingsley
    @EugeneKingsley 3 года назад +1

    Wow ! Thanks for this beautiful video. Keep doing !

  • @gururajangovindan7766
    @gururajangovindan7766 4 года назад +4

    It would be great if you can publish your notebook link..the video was very useful!!!

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +2

      @gururajan.. I will publish it in my github repo in couple of weeks as other videos

  • @akashprabhakar6353
    @akashprabhakar6353 2 года назад

    Nice explanation!

  • @vishal6361
    @vishal6361 4 года назад +1

    very well explained and clear cut pipelining concepts, it was really a great learning experience, thank you for this content.
    Would you be sharing industry level codes that can be deployed, through your videos?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Thanks Vishal. Could you please elobrate on "industry level codes"?. Did not quiet get it

  • @ZainAhmed-ho5sf
    @ZainAhmed-ho5sf 4 года назад +4

    Thank you for this amazing hands on pyspark tutorials.Is there a way or have you covered adding custom functions to the pipeline anywhere in the next few lectures?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Hi Zain.. if u r looking for custom transformers then I have it in my plan later this year. Currently taking a spark break as have too many videos i did back to back on it :)

    • @ZainAhmed-ho5sf
      @ZainAhmed-ho5sf 4 года назад

      @@AIEngineeringLife Great!Looking forward to it.

  • @ashirbaddas2573
    @ashirbaddas2573 4 года назад +1

    Very neatly explained..:)

  • @yoyovatsa2179
    @yoyovatsa2179 4 года назад +2

    Now that I have binged the SparkML series, it seems very much similar to sklearn, some functions are different , so gotta go to the documentation first. Awesome video as always. I had a doubt though, as you used SQL for most of the EDA, is it only limited to SQL in Spark or is there any other way to do it, like using pandas like functions?
    Anyways thank you for the introduction, I am going to try and build my first pipeline now. Very much appreciate your content, your channel is so underrated.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Mohneesh.. Spark ML is based on scikit-learn pipeline concept. So u will find lot of similaity. Instead of SparkQL you can use spark dataframe functions which again is pandas like. For EDA on databricks you can use data frame function as well

    • @RD-yv4cc
      @RD-yv4cc 4 года назад

      @@AIEngineeringLife Does it have the same great time series EDA functions as pandas?

  • @joshuathomas2660
    @joshuathomas2660 2 года назад +1

    Hello sir, Can you send me the exact dataset? it will be easy for me to follow your tutorials. Thanks in advance.

  • @vinodhkumarbaskaran228
    @vinodhkumarbaskaran228 4 года назад +1

    Thanks Sir!!! Amazing video

  • @orc475
    @orc475 4 года назад +2

    excellent Hands-on ! thank you so much !
    hope you can fix the voice on the next video

  • @sujeeshsvalath
    @sujeeshsvalath 4 года назад +1

    Thank you sir. Great video

  • @maheshkumarsomalinga1455
    @maheshkumarsomalinga1455 3 года назад

    Only recently, I have started looking into your videos...Definitely, one of the best collections available. Can you advise some real-life projects involving Spark for reference? Or if you already have videos on that, could you share the links? I could not find any. Thank you.

  • @mohamedhanifansari9224
    @mohamedhanifansari9224 4 года назад +1

    Thanks Srivatsan for this wonderful video and I've learnt a lot. I have one question - Doesn't fitting the data using training (while data pre-processing) and transforming the testing data using the same fit causes data leakage ? Just wanted to make sure I understand the concept clearly.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Mohammed. Not in the case I showed as I am just applying transformation pipeline separately to the data. It might cause data leakage if I use the test as validation dataset during training and then use the same to predict
      Reason we always have in time and out of time dataset in real world. My actual training in this case has not seen the test data. This was for demo though but I would recommend separate validation set as well in real world pipeline
      Did I answer your question here or was your question intent was something else?.

  • @apremgeorge
    @apremgeorge 4 года назад +1

    Thanks Srivatsan Excellent Video
    if its possible would you please add some extensions on how to make predictions and data cleaning on single record with multithreading options could be helpful as well.
    Also how to use some other python libraries in pyspark, like lime for explanations.
    Thanks Again

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      I will add deployment side in future Prem. It is only list. I would not recommend multi threading within spark while you can do it. But I do have the information you asked on regualr python at this time
      You can watch video 7 and above in this playlist
      ruclips.net/p/PL3N9eeOlCrP5PlN1jwOB3jVZE6nYTVswk

    • @apremgeorge
      @apremgeorge 4 года назад

      @@AIEngineeringLife Thanks Srivatsan, Waiting for that, one more question please , instead of transforming all my pandas codes to pyspark, after loading data as pyspark dataframe, I convert to pandas dataframe and do all the codes for modelling in pandas and save the model, is it possible to use the same for deployment in databricks.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      @@apremgeorge Yes you can.. databricks supports all python or you can install any packages with it. You can also look at koalas if interested
      ruclips.net/video/kOtAMiMe1JY/видео.html

  • @krishnabisen2666
    @krishnabisen2666 3 года назад

    Great video sir, idk why RUclips hides such gems from us. Is it possible to get the link for the notebook?

  • @Azureandfabricmastery
    @Azureandfabricmastery 3 года назад +1

    Thanks Srivatsan for detailed E2E video on Spark ML. Helpful. can't we generate confusion matrix visuals in spark ml in databricks and also classification report like in sklearn?

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад +1

      I have not tried latest version of databricks yet but earlier version it was not there. With increased focus on MLOps these days in databricks maybe there is a better way to do this and also MLFlow integration. I will try and let you know

    • @Azureandfabricmastery
      @Azureandfabricmastery 3 года назад

      @@AIEngineeringLife ok thanks

  • @IsaiahShadE
    @IsaiahShadE 3 года назад

    Im having an issue over here
    stages=[]
    for catCol in catColumns:
    stringIndexer=StringIndexer(inputCol=catCol,outputCol=catCol + "Index")
    encoder=OneHotEncoder(inputCols=[stringIndexer.getOuputCol()],outputCols=[catCol + "catVec"])
    stages=stages+[stringIndexer,encoder]
    eError: 'StringIndexer' object has no attribute 'getOuputCol'
    @AIEngineering help me with this??

  • @rob42897
    @rob42897 4 года назад +1

    Thanks for the video. It has been helpful. You mentioned a few times performing feature engineering. I had some thoughts on feature engineering. I wanted to know how I will go correct the imbalance in the churn? I know in sklearn I can use SMOTE() for oversampling the minority class. Can I do something like that in PySpark? I was trying to run StandardScalar() but could not figure out when to scale the data. I assume I would scale just the numeric data. Do I scale the categorical data as well? Do I put the scaled data in the pipeline or scale the training data and not the validation data? What are some other ways I can do feature engineering? I enjoyed your video.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Hi Rob.. Spark natively does not have SMOTE. But there are some github repo that has implementation of SMOTE on spark. I think you can use SMOTE as spark udf as well but have not tried it. In my case we mostly undersample majority classes and scale positive weights or imbalanced class
      You can scale data when you have lot of features and using non tree based algos to converge faster. For tree based algos it is ok to typically leave features as it is. You scale it part of the pipeline both training and validation. There are lot of feature transformers and depending on the data you can use them as well as write custom transformer for specific need. I will try to cover video on spark feature engineering in future

    • @rob42897
      @rob42897 4 года назад

      @@AIEngineeringLife Thanks to getting back to me. I will check into the github repo for the imbalance.

    • @rob42897
      @rob42897 4 года назад

      @@AIEngineeringLife I found the function SMOTE for Spark. Here is a link: github.com/Angkirat/Smote-for-Spark.

  • @mukeshkund4465
    @mukeshkund4465 4 года назад +1

    Excellent Video Sir. Kindly if we can get this dataset then it will be good to practice more on it..

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Thank you.. Should be in my dataset folder in github - github.com/srivatsan88/RUclipsLI/tree/master/dataset

    • @mukeshkund4465
      @mukeshkund4465 4 года назад

      @@AIEngineeringLife Thank you for your reply. I have got it from another video of yours..

  • @rahulbhatia5657
    @rahulbhatia5657 4 года назад +1

    Hi,
    Your content is really helpful and unique, thanks for this! One question, I have a deep learning model trained using Keras and want to use it for making inferences on a Spark dataframe, can you suggest some options or is it possible to do this, or do I need to rebuild the models in some spark library for deep learning?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Rahul.. Have you tried Spark tensorflow package and if it supports?. If not you can load TF model as python UDF and broadcast the model file. You can then use it for inference in spark

  • @chanchalshukla683
    @chanchalshukla683 4 года назад +1

    Hi Sir,
    Here, imbalanced data has not been handled, although highlighted.
    Is the approach should assign weights to lower class or other approach of oversampling?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      I typically prefer undersampling or class weights rather oversampling. So for this I will first try class weights and see or perform additional feature engineering and see if makes a difference. Sorry could not recollect what dataset I used for this video but in general above is how I approach it

  • @DevanshKhandekar
    @DevanshKhandekar 4 года назад +1

    Excellent tutorial. Is the notebook for this available ?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Check the code in my git repo here - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Churn-Analysis.ipynb

  • @mohammadmuneer6463
    @mohammadmuneer6463 3 года назад +1

    content is very good, but at few places your voice is echoing. Awesome content and info flow.

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      Thanks and Sorry for inconvenience in between. Over time I have upgraded my recording but initial few videos like this one had few echo

    • @mohammadmuneer6463
      @mohammadmuneer6463 3 года назад

      @@AIEngineeringLife no worries buddy....your content is too good :) keep doing the great work.

  • @siddhantsapte
    @siddhantsapte 4 года назад +1

    Hello,
    Is there any git repo for the code? it would be of great help!
    Thank you!

  • @umeshjadhav1586
    @umeshjadhav1586 3 года назад +1

    Do you have spark streaming session available with you, i searched on your channel and not able to find any

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      Spark streaming not yet but have it in plans next year

  • @lazzybirdflying3225
    @lazzybirdflying3225 2 года назад

    Hi , Do you offer any personal training???

  • @sumitbhalla2321
    @sumitbhalla2321 3 года назад

    Is there any api/code snippets to enable model serving? I want to automate enable model serving in databricks/mlflow. Please help. thanks

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад +1

      Sumit. Nope I do not have on Spark yet but have it in general on python

  • @snehagrandhe1418
    @snehagrandhe1418 3 года назад

    Can I make it a real time project to add in resume sir?

  • @1981Praveer
    @1981Praveer Год назад

    #AIEngineering
    Where can I find this dataset to practice ?

  • @deonwagner2643
    @deonwagner2643 3 года назад

    Very Nice. Do you have a link to your Git hub for the notebook in the this video. I would like to apply some of this code to my dataset

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      All of my code is my repo here - github.com/srivatsan88
      Spark has a seperate repo where you can find above code

  • @jeharulhussain9344
    @jeharulhussain9344 3 года назад

    @AI Engineering : What kind of spark application is widely used by most of the clients you have come across. Data Bricks or something else?

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад +1

      It mostly depends on where they are. If onprem which many are then it is cloudera. Those who are on cloud have seen EMR in many cases and databricks or others in some

  • @kousseilarekkam9511
    @kousseilarekkam9511 4 года назад +1

    I tried to fit a random forest classifier in pyspark : this is my code
    from pyspark.ml.tuning import ParamGridBuilder
    rf = RandomForestClassifier(labelCol="label", featuresCol="features")
    paramGrid = (ParamGridBuilder()
    .addGrid(rf.numTrees, [100])
    .build())
    crossval = CrossValidator(estimator=rf,
    estimatorParamMaps=paramGrid,
    evaluator=BinaryClassificationEvaluator(),
    numFolds=10)
    cvModel = crossval.fit(trainingData)
    predictions = crossval.transform(testData)
    predictions.printSchema()
    but i'm getting this error:
    Py4JJavaError: An error occurred while calling o767.fit.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 853, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
    can you help me please

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Based on the error looks like either your system does not have memory to handle the data or you have allocated low memory to your executor than what this job needs. Go to spark UI and check memory allocation as job is running or increase memory and try

    • @kousseilarekkam9511
      @kousseilarekkam9511 4 года назад

      @@AIEngineeringLife im using google colab, how can i check for spark UI, can you help me more please

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Then you cannot check spark UI. You can set executor memory where you get spark context and try . Monitor the CPU on top colab link or in manage sessions menu

    • @kousseilarekkam9511
      @kousseilarekkam9511 4 года назад

      @@AIEngineeringLife Thank you very much for your help, but i still have problem can you please show me an exemple how to set the sparkSession, configurations and sparkContext

  • @thotarakesh2689
    @thotarakesh2689 3 года назад

    Sir, When I run printSchema it shows TotalCharges as string (nullable = true) instead of double. Could please explain why and how can I convert it to double?

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад +1

      Thota.. Have you checked my repo notebook and checked for differences? - github.com/srivatsan88/Mastering-Apache-Spark

    • @thotarakesh2689
      @thotarakesh2689 3 года назад

      @@AIEngineeringLife Thank U Sir, the above one got fixed. Now I am getting the below error while doing fit & transform in pipeline. Please could you advise.
      pipeline = Pipeline().setStages(stages)
      pipelineModel = pipeline.fit(train_data)
      error: IllegalArgumentException: requirement failed: Output column label already exists.

  • @imransharief2891
    @imransharief2891 4 года назад +1

    Hi sir if it is possible please share the csv file so that we can practice with the help of that file in databricks

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Imran.. It is in my github repo. I will add it to video description as well
      github.com/srivatsan88/RUclipsLI/tree/master/dataset

    • @imransharief2891
      @imransharief2891 4 года назад +1

      Thank u sir ur doing a great job teaching us beyond ml

  • @raghurilokesh3270
    @raghurilokesh3270 3 года назад

    As we already have jupyter note book why we are going for this?

  • @arslanjutt4282
    @arslanjutt4282 Год назад

    Where the dataset

  • @kanizfatma1128
    @kanizfatma1128 3 года назад +1

    Cannot import name onehotencoderestimator

  • @dipanjanghosh6862
    @dipanjanghosh6862 3 года назад

    sir i am trying to run the basic command df.groupBy('Churn').count().show() but an error is coming up that says
    "Py4JJavaError Traceback (most recent call last)
    /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    62 try:
    ---> 63 return f(*a, **kw)
    64 except py4j.protocol.Py4JJavaError as e:"
    I have followed all the codes before this. Not sure where i am going wrong. can you please help?

    • @dipanjanghosh6862
      @dipanjanghosh6862 3 года назад

      sir none of the codes are getting executed after select * from churn_analysis. i have loaded the data correctly...so frustrating this is

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      Hi Dipanjan.. I just tried on databricks and it seems to work fine. Which cluster have you created. I tried with databricks 7.5 ML , Spark 3.0.1. Can you check and try again. What you are getting seems to me databricks environment issue

    • @dipanjanghosh6862
      @dipanjanghosh6862 3 года назад

      @@AIEngineeringLife sir i have tried everything. still getting this error while trying to run df.groupBy('Churn').count().show()
      AnalysisException: cannot resolve '`Churn`' given input columns: [_c0, _c1, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19, _c2, _c20, _c3, _c4, _c5, _c6, _c7, _c8, _c9];;
      sir my cluster is running smoothly. can you please help me out with this?

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      @@dipanjanghosh6862 I think it has not inferred column names from header and reason you are getting _C1 and all. Check if you have set Header as true as below
      infer_schema = "true"
      first_row_is_header = "true"
      delimiter = ","
      df = spark.read.format(file_type) \
      .option("inferSchema", infer_schema) \
      .option("header", first_row_is_header) \
      .option("sep", delimiter) \
      .option('nanValue', ' ')\
      .option('nullValue', ' ')\
      .load(file_location)

    • @dipanjanghosh6862
      @dipanjanghosh6862 3 года назад

      sir you are awesome. thanks

  • @varungondu7053
    @varungondu7053 4 года назад

    is databricks is not giving free edition in recent times?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      I do see they are still.. are you not able to create account? - databricks.com/try-databricks

    • @varungondu7053
      @varungondu7053 4 года назад

      @@AIEngineeringLife No after the login it is asking to select the plans total they are three plans all are having price

    • @varungondu7053
      @varungondu7053 4 года назад

      Initially, it will ask community edition or free edition even after free edition also it is asking to select the above three plans which are priced

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Oh I get it I gave you try for free.. let me check today evening not sure if they removed it but I doubt they will

  • @ashwinkumar5223
    @ashwinkumar5223 Год назад

    Pls share code

  • @gasmikaouther6887
    @gasmikaouther6887 3 года назад

    Can you plz share the dataset

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      Most of it must be here - github.com/srivatsan88/RUclipsLI/tree/master/dataset

  • @arpanghosh3801
    @arpanghosh3801 4 года назад +1

    Can you share the code

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      it is in my git repo here - github.com/srivatsan88/Mastering-Apache-Spark