End to End Machine Learning pipeline using Apache Spark - Hands On
HTML-код
- Опубликовано: 15 сен 2024
- #machinelearning #apachespark #end-to-end
In this video we will see how to apply Spark machine learning to churn prediction problem. This is end to end spark ml video where I will be covering
- Data Analysis
- Exploratory Analysis
- Model Transformers and Estimators
- Spark Machine Learning Pipeline
- ML Algorithm
- Model Evaluation and Metrics
- Building Own Metrics
Will work through each of this component to required amount. For details on other transformers and estimators you can refer to apache spark website
To get a quick overview of Apache Spark ML and why Spark you can check my earlier video - • Machine Learning using...
If you need a quick overview of databricks then you can check my video - • Databricks for Apache ...
#sparkml #featureengineering
This is really good stuff. It was high time I transition from pandas and scikit learn to something more industry relevant.
My thoughts exactly mate. How's your journey been thus far?
Exactly the same thought, mate
Thank you so much for this wonderful video and crisp, simple explanation.I was getting too impatient waiting for the course, so I decided to type the code line by line. Took a long time but definitely worth the effort :) Cant wait for more from u on SparkML.
Nikhil.. This video will be part of the course as well and will be adding more spark videos. How do you feel working on Spark is now :)
AIEngineering I’m truly enjoying working on Spark and can’t wait for the next batch of videos! Would love to get my hands dirty working on this😀 Thanks again for all the efforts!
Sir, you are great, i have no words to say beyond Thank You.
You are ahhhhmazing! I am not sure why I was lurking on your linked in and not here :) my 2020 is now set! Excellent content.
Thank Sherin.. More to come :)
this is quite useful sir...this is helping me with my ug project...thanks a lot
Excellent Video, i have been searching for this kind of Video from long time .
You are really awesome, its neat and crystal clear tutorial
Wow ! Thanks for this beautiful video. Keep doing !
It would be great if you can publish your notebook link..the video was very useful!!!
@gururajan.. I will publish it in my github repo in couple of weeks as other videos
Nice explanation!
very well explained and clear cut pipelining concepts, it was really a great learning experience, thank you for this content.
Would you be sharing industry level codes that can be deployed, through your videos?
Thanks Vishal. Could you please elobrate on "industry level codes"?. Did not quiet get it
Thank you for this amazing hands on pyspark tutorials.Is there a way or have you covered adding custom functions to the pipeline anywhere in the next few lectures?
Hi Zain.. if u r looking for custom transformers then I have it in my plan later this year. Currently taking a spark break as have too many videos i did back to back on it :)
@@AIEngineeringLife Great!Looking forward to it.
Very neatly explained..:)
Now that I have binged the SparkML series, it seems very much similar to sklearn, some functions are different , so gotta go to the documentation first. Awesome video as always. I had a doubt though, as you used SQL for most of the EDA, is it only limited to SQL in Spark or is there any other way to do it, like using pandas like functions?
Anyways thank you for the introduction, I am going to try and build my first pipeline now. Very much appreciate your content, your channel is so underrated.
Mohneesh.. Spark ML is based on scikit-learn pipeline concept. So u will find lot of similaity. Instead of SparkQL you can use spark dataframe functions which again is pandas like. For EDA on databricks you can use data frame function as well
@@AIEngineeringLife Does it have the same great time series EDA functions as pandas?
Hello sir, Can you send me the exact dataset? it will be easy for me to follow your tutorials. Thanks in advance.
Thanks Sir!!! Amazing video
excellent Hands-on ! thank you so much !
hope you can fix the voice on the next video
Thank you sir. Great video
Only recently, I have started looking into your videos...Definitely, one of the best collections available. Can you advise some real-life projects involving Spark for reference? Or if you already have videos on that, could you share the links? I could not find any. Thank you.
Thanks Srivatsan for this wonderful video and I've learnt a lot. I have one question - Doesn't fitting the data using training (while data pre-processing) and transforming the testing data using the same fit causes data leakage ? Just wanted to make sure I understand the concept clearly.
Mohammed. Not in the case I showed as I am just applying transformation pipeline separately to the data. It might cause data leakage if I use the test as validation dataset during training and then use the same to predict
Reason we always have in time and out of time dataset in real world. My actual training in this case has not seen the test data. This was for demo though but I would recommend separate validation set as well in real world pipeline
Did I answer your question here or was your question intent was something else?.
Thanks Srivatsan Excellent Video
if its possible would you please add some extensions on how to make predictions and data cleaning on single record with multithreading options could be helpful as well.
Also how to use some other python libraries in pyspark, like lime for explanations.
Thanks Again
I will add deployment side in future Prem. It is only list. I would not recommend multi threading within spark while you can do it. But I do have the information you asked on regualr python at this time
You can watch video 7 and above in this playlist
ruclips.net/p/PL3N9eeOlCrP5PlN1jwOB3jVZE6nYTVswk
@@AIEngineeringLife Thanks Srivatsan, Waiting for that, one more question please , instead of transforming all my pandas codes to pyspark, after loading data as pyspark dataframe, I convert to pandas dataframe and do all the codes for modelling in pandas and save the model, is it possible to use the same for deployment in databricks.
@@apremgeorge Yes you can.. databricks supports all python or you can install any packages with it. You can also look at koalas if interested
ruclips.net/video/kOtAMiMe1JY/видео.html
Great video sir, idk why RUclips hides such gems from us. Is it possible to get the link for the notebook?
Thanks Srivatsan for detailed E2E video on Spark ML. Helpful. can't we generate confusion matrix visuals in spark ml in databricks and also classification report like in sklearn?
I have not tried latest version of databricks yet but earlier version it was not there. With increased focus on MLOps these days in databricks maybe there is a better way to do this and also MLFlow integration. I will try and let you know
@@AIEngineeringLife ok thanks
Im having an issue over here
stages=[]
for catCol in catColumns:
stringIndexer=StringIndexer(inputCol=catCol,outputCol=catCol + "Index")
encoder=OneHotEncoder(inputCols=[stringIndexer.getOuputCol()],outputCols=[catCol + "catVec"])
stages=stages+[stringIndexer,encoder]
eError: 'StringIndexer' object has no attribute 'getOuputCol'
@AIEngineering help me with this??
Thanks for the video. It has been helpful. You mentioned a few times performing feature engineering. I had some thoughts on feature engineering. I wanted to know how I will go correct the imbalance in the churn? I know in sklearn I can use SMOTE() for oversampling the minority class. Can I do something like that in PySpark? I was trying to run StandardScalar() but could not figure out when to scale the data. I assume I would scale just the numeric data. Do I scale the categorical data as well? Do I put the scaled data in the pipeline or scale the training data and not the validation data? What are some other ways I can do feature engineering? I enjoyed your video.
Hi Rob.. Spark natively does not have SMOTE. But there are some github repo that has implementation of SMOTE on spark. I think you can use SMOTE as spark udf as well but have not tried it. In my case we mostly undersample majority classes and scale positive weights or imbalanced class
You can scale data when you have lot of features and using non tree based algos to converge faster. For tree based algos it is ok to typically leave features as it is. You scale it part of the pipeline both training and validation. There are lot of feature transformers and depending on the data you can use them as well as write custom transformer for specific need. I will try to cover video on spark feature engineering in future
@@AIEngineeringLife Thanks to getting back to me. I will check into the github repo for the imbalance.
@@AIEngineeringLife I found the function SMOTE for Spark. Here is a link: github.com/Angkirat/Smote-for-Spark.
Excellent Video Sir. Kindly if we can get this dataset then it will be good to practice more on it..
Thank you.. Should be in my dataset folder in github - github.com/srivatsan88/RUclipsLI/tree/master/dataset
@@AIEngineeringLife Thank you for your reply. I have got it from another video of yours..
Hi,
Your content is really helpful and unique, thanks for this! One question, I have a deep learning model trained using Keras and want to use it for making inferences on a Spark dataframe, can you suggest some options or is it possible to do this, or do I need to rebuild the models in some spark library for deep learning?
Rahul.. Have you tried Spark tensorflow package and if it supports?. If not you can load TF model as python UDF and broadcast the model file. You can then use it for inference in spark
Hi Sir,
Here, imbalanced data has not been handled, although highlighted.
Is the approach should assign weights to lower class or other approach of oversampling?
I typically prefer undersampling or class weights rather oversampling. So for this I will first try class weights and see or perform additional feature engineering and see if makes a difference. Sorry could not recollect what dataset I used for this video but in general above is how I approach it
Excellent tutorial. Is the notebook for this available ?
Check the code in my git repo here - github.com/srivatsan88/Mastering-Apache-Spark/blob/master/Churn-Analysis.ipynb
content is very good, but at few places your voice is echoing. Awesome content and info flow.
Thanks and Sorry for inconvenience in between. Over time I have upgraded my recording but initial few videos like this one had few echo
@@AIEngineeringLife no worries buddy....your content is too good :) keep doing the great work.
Hello,
Is there any git repo for the code? it would be of great help!
Thank you!
Do you have spark streaming session available with you, i searched on your channel and not able to find any
Spark streaming not yet but have it in plans next year
Hi , Do you offer any personal training???
Is there any api/code snippets to enable model serving? I want to automate enable model serving in databricks/mlflow. Please help. thanks
Sumit. Nope I do not have on Spark yet but have it in general on python
Can I make it a real time project to add in resume sir?
#AIEngineering
Where can I find this dataset to practice ?
Very Nice. Do you have a link to your Git hub for the notebook in the this video. I would like to apply some of this code to my dataset
All of my code is my repo here - github.com/srivatsan88
Spark has a seperate repo where you can find above code
@AI Engineering : What kind of spark application is widely used by most of the clients you have come across. Data Bricks or something else?
It mostly depends on where they are. If onprem which many are then it is cloudera. Those who are on cloud have seen EMR in many cases and databricks or others in some
I tried to fit a random forest classifier in pyspark : this is my code
from pyspark.ml.tuning import ParamGridBuilder
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
paramGrid = (ParamGridBuilder()
.addGrid(rf.numTrees, [100])
.build())
crossval = CrossValidator(estimator=rf,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=10)
cvModel = crossval.fit(trainingData)
predictions = crossval.transform(testData)
predictions.printSchema()
but i'm getting this error:
Py4JJavaError: An error occurred while calling o767.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 853, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
can you help me please
Based on the error looks like either your system does not have memory to handle the data or you have allocated low memory to your executor than what this job needs. Go to spark UI and check memory allocation as job is running or increase memory and try
@@AIEngineeringLife im using google colab, how can i check for spark UI, can you help me more please
Then you cannot check spark UI. You can set executor memory where you get spark context and try . Monitor the CPU on top colab link or in manage sessions menu
@@AIEngineeringLife Thank you very much for your help, but i still have problem can you please show me an exemple how to set the sparkSession, configurations and sparkContext
Sir, When I run printSchema it shows TotalCharges as string (nullable = true) instead of double. Could please explain why and how can I convert it to double?
Thota.. Have you checked my repo notebook and checked for differences? - github.com/srivatsan88/Mastering-Apache-Spark
@@AIEngineeringLife Thank U Sir, the above one got fixed. Now I am getting the below error while doing fit & transform in pipeline. Please could you advise.
pipeline = Pipeline().setStages(stages)
pipelineModel = pipeline.fit(train_data)
error: IllegalArgumentException: requirement failed: Output column label already exists.
Hi sir if it is possible please share the csv file so that we can practice with the help of that file in databricks
Imran.. It is in my github repo. I will add it to video description as well
github.com/srivatsan88/RUclipsLI/tree/master/dataset
Thank u sir ur doing a great job teaching us beyond ml
As we already have jupyter note book why we are going for this?
Can you pls eloborate. Did not get ques here
Where the dataset
Cannot import name onehotencoderestimator
Replied on LI. Change it to OneHotEncoder. Spark 3.0 changed the package
Thanks
sir i am trying to run the basic command df.groupBy('Churn').count().show() but an error is coming up that says
"Py4JJavaError Traceback (most recent call last)
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:"
I have followed all the codes before this. Not sure where i am going wrong. can you please help?
sir none of the codes are getting executed after select * from churn_analysis. i have loaded the data correctly...so frustrating this is
Hi Dipanjan.. I just tried on databricks and it seems to work fine. Which cluster have you created. I tried with databricks 7.5 ML , Spark 3.0.1. Can you check and try again. What you are getting seems to me databricks environment issue
@@AIEngineeringLife sir i have tried everything. still getting this error while trying to run df.groupBy('Churn').count().show()
AnalysisException: cannot resolve '`Churn`' given input columns: [_c0, _c1, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19, _c2, _c20, _c3, _c4, _c5, _c6, _c7, _c8, _c9];;
sir my cluster is running smoothly. can you please help me out with this?
@@dipanjanghosh6862 I think it has not inferred column names from header and reason you are getting _C1 and all. Check if you have set Header as true as below
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.option('nanValue', ' ')\
.option('nullValue', ' ')\
.load(file_location)
sir you are awesome. thanks
is databricks is not giving free edition in recent times?
I do see they are still.. are you not able to create account? - databricks.com/try-databricks
@@AIEngineeringLife No after the login it is asking to select the plans total they are three plans all are having price
Initially, it will ask community edition or free edition even after free edition also it is asking to select the above three plans which are priced
Oh I get it I gave you try for free.. let me check today evening not sure if they removed it but I doubt they will
Pls share code
Can you plz share the dataset
Most of it must be here - github.com/srivatsan88/RUclipsLI/tree/master/dataset
Can you share the code
it is in my git repo here - github.com/srivatsan88/Mastering-Apache-Spark