Thanks for watching Ken, it is indeed tremendously useful for rapidly assessing the drug-like properties of a new molecule. Data-driven insight in action 😃
Thats really a good demonstration of application of ML in drug discovery. But how we can test our built model on real world data means which type of data we can input into the model for predictions and how medical professionals benefit by applying this model on real world data. Please clarify I am in confusion and doubt. Thanks
Wow, That's really amazing, really thank a lot, Please answer this question, 1. can you explain how is this model used to predict the pIC50 value, the last scatter plot shows experimental PIC50 vs predicted PIC50, what I expected is all molecules on x axis and predicted PIc50 on y axis, then I can know which molecules has higher or lower PIc50 values. And 2. How can I know the most important features of a molecule to better design a drug. And one more thing, 3. How can I actually design a drug out of this (let's say for corona virus), should I try some chemical compounds (which have many molecules) which has most of their molecules active/inactive?
Hello Prof, I am Getting Error with the scatter plot... please help ax = sns.regplot(Y_test, Y_pred, scatter_kws={'alpha':0.4}) TypeError: regplot() takes from 0 to 1 positional arguments but 2 positional arguments (and 1 keyword-only argument) were given
Dear Professor, Thanks for this informative video. I understood the concept of using variance threshold for fingerprints. I have the following doubts: 1. What is the suitable technique for removing useless features in case of chemical descriptors? 2. What approach to follow when using both, fingerprints as well as chemical descriptors?
Hi, I normally use the variance threshold for removing constant values (SD=0) and near constant values (SD approaching 0) and this is applicable for both fingerprints and molecular descriptors. Beyond this, we will have to do feature selection and for this there are several approaches to try out mostly based on evolutionary algorithms such as the use of genetic algorithm, particle swarms optimization, etc.
Thanks for ur Explanation and support. I have one confused thing; the model r2 value is around 0.51... is it best? how make approach to one any one with this skill can respond to me?
Data Professor, 1) In part 2 the dataset did not had any fingerprint features, it just had chembl_id, cannonical_smiles,standard_value,class in part 3 PaDEL fingerprints, this series had the target for coronavirus, can we use that PaDEL for other targets tooo? 2) Thank you for such a good series
Hello Chanin, I have been following your work for a while now. You are really doing a good job. I am currently working on a similar project for my masters now. I have a couple of questions: 1. Once the output has been predicted, how do we know which compound works the best for the disease? ( I am new to machine learning ) 2. Do you have any published papers?
Hi, 1. Typically the best compounds is judged by several parameters: 1) how well it binds to the target protein, 2) its pharmacokinetic and pharmacodynamic properties and 3) once both have been established then the compound would undergo clinical trials. 2. My publications can be found from this link orcid.org/0000-0003-1040-663X
Hey Professor, Thank you very much for this wonderful explanation. As you have used PaDEL fingerprints..... can we also use PaDEL Descriptors such as Mol Wt or Number of rings etc, to check if these helps in increasing R2 score. And what are the other feature selection techniques that could be used for fingerprints ??
Yes, absolutely, this is a great question that we normally pursue in our research. And there are 12 fingerprints provided by PaDEL (11 that I haven't used). Please see Table 1 of this paper of ours that lists the 12 fingerprints. peerj.com/articles/2322/ As for feature selection techniques, one could use PCA to compress the information into the PC coefficients and then using those as the X variables. Or alternatively, one could use an evolutionary algorithm (e.g. genetic algorithm, particle swarms optimization, ant colony optimization, etc.) coupled to a learning algorithm to select subsets of features for model building.
I have a question about removing low variance columns. What exactly is a low variance column when it is part of the molecule fingerprint? Is this for columns that are almost 0's all the time?
Yes, the low variance columns are those that have near constant values (can be either 0 or 1). Because they provide very little variance we deem them to be less useful for the model building and so we remove them.
Hi Dr. Sorry to keep bombarding you with questions. I ran my model (another SARS single protein) and get a R-squared value around 0.15 with the random forest. Even with an SVR regressor I'm getting around that same value. What constitutes in this field a "decent" r-squared value for these models? or does it not matter that much here? would a 0.15 r-squared model have a data fit that isn't strong enough to predict with? also Is there a way to get the canonical smile output from a predicted plot point so I could use Chem.MolFromSmiles to transcript a picture of the molecule? Thanks for your time!!! FYI, I'm a pharmacist finishing up a Master's in Data Science. You're line of work is quite interesting to me as I would like to work in industry at some point.
Hi Robert, certainly nice to hear of your data science journey. A decent R-squared value in the field of QSAR (quantitative structure-activity relationship) is at least 0.5 for the training set and at least 0.6 for the testing and cross-validation sets. This is according to the best practices in the field as proposed by Tropsha (further information in this article onlinelibrary.wiley.com/doi/abs/10.1002/minf.201000061). Normally, I would try to make a master table (a dataframe) that contains all the essential columns including ChEMBL ID, SMILES notation and their pIC50 values and after descriptor calculation I would keep at least the ChEMBL ID in the same dataframe as the descriptors, then just before model building, I would drop the ChEMBL ID column out (which would keep sklearn happy and also allow us to trace back the compounds identity). Then after predictions are made, I would add the predicted values (and their actual values) into the master table mentioned above that contains the ChEMBL ID and SMILES notation. That way we can trace back any data points in the scatter plot of the predicted versus experimental values. Hope this helps 😃
Is there a reason to use paDel finger print over something like SMILE? I would guess that paDel may have more clear functional group information. But I have read ML papers that have used SMILE to great success
Dear Professor, Thanks for such informative content on this topic. I am confused about, how did you decide the threshold for variance. Can you please elaborate? Thanks
The tutorial was a quick demo of the methodological workflow. Yes, absolutely agree with you that the performance can be improved. In the QSAR field of study, R^2 of 0.5 for the test set is considered to be robust, which is according to the recommended guidelines of Golbraikh and Tropsha (2002), more details at www.sciencedirect.com/science/article/abs/pii/S1093326301001231
Thank you very much Professor. Since the train and test sets will be different for each data split, if we run this codes repeatedly (especially for data split) the r2 value will differ too. am i right?
Yes, that is correct, to remedy that we can set the seed number so that the results will be returning the same values when running the code repeatedly.
Hello Professor, This video series has been very helpful, thank you very much. I have a concern though. Is it good practice to perform feature selection before the data split?
Great video tutorials! I am a little confused on the final plot of experimental vs predicted pIC50. Could you explain the final plot about experimental and predicted pIC50 and what you are looking for in that plot for a good molecule? I'm a little lost on that part.
That's a great question Robert. The scatter plot between experimental and predicted pIC50 will allow us to visually see the correlation between the 2 variables. Ideally, a perfect prediction with 100% Accuracy would show all data points to fall on the trend line. In a practical setting, the residuals or errors (taking the predicted values and subtracting it from the experimental or actual values give us the residual or error values) would cause data points to fall either above or below the trend line (essentially the variance). Hope this helps 😃
A good molecule would have a very high pIC50 value (normally higher than 6 is considered to be good). It should be noted that "good" in this context refers to strong inhibition of the target protein of interest. In a practical setting, a good molecule in addition to strong inhibition of the target protein of interest it should also possess favorable "pharmacokinetic profiles" (the various molecular properties pertaining to the absorption, distribution, metabolism, excretion and toxicity of a drug). Thus, ideally, a ideal drug would have strong inhibition and good pharmacokinetic profiles. In practice, this is essentially an optimization problem and it is rather challenging to find such a molecule. Thus, in the real world setting, we have drugs that are potent but also lead to side effects in patients.
Thanks for watching @Import Data. The R2 tells us the Goodness of Fit of the model. It is essentially the squared value of the Pearson's Correlation Coefficient.
Dear Sir, first of all thanks for producing such a wonderful series. I am working with Part4, and using my own data file. When I want to train regression model with random Forest. I receive a ValueError: Input y contains NaN. Any suggestions to solve this error?
yea it gives me an error as well, and I don't think he knows how to fix it because even if you run his code on his data you get the exact same error so something isn't right
@Travis Hall I have fixed my error, let me share it with you. Open your CSV file, there will be certain rows with empty pIC50 values, remove them and come from the start again with a fresh session.
Hello Great video as always! I was wondering if we could have used something like PCA instead of this VarianceThreshold module Or is it not desireable?
Thanks for watching, Great question! PCA is used to reduce the dimension of the feature space while the variance threshold is a prerequisited that we would normally do prior to performing multivariate analysis (PCA included).
hi professor. When i tried using VarianceThreshold to remove low variances, there was an error saying cannot convert strings into float. How to fix this error? Thank you very much
Dear professor, I am getting an error in random forest model" Input contains NaN, infinity or a value too large for dtype('float64')." how to resolve that
Nice video series! I just want to comment that instead of the Lipinski Rule there are many people using the ADMET properties (c.f. J. Chem. Inf. Model. 2012, 52, 7, 1713-1721). Maybe it is useful for some people ;-)
Thanks for watching and for sharing the paper 👍 Interesting paper indeed, I’ve read it and the descriptors used to produce the ADMET score are based on the same 4 descriptors used in Lipinski rule of five plus some additional descriptors such as LogS and TPSA. The end outcome of both approach is to find drug-like properties (having favorable ADMET/pharmacokinetic characteristics). Another interesting point performed in the paper is that the authors attempted to optimize both potency with pharmacokinetic properties via multi-objective optimization. This is an active field of research where attempts are made to assess drug-like characteristics. However, the paper does not provide a tool or equation for calculating the ADMET score that they described in the paper and thus limits its utility by potential users. Perhaps some of us are interested in doing a project on this? 🤔
@@DataProfessor Most of the ADMET property prediction are subject to industry, for example the ADMET predictor, and unfortunately are not freely available for academics. I just wanted to point this out as an alternative to the Lipinski rule, since a reviewer of one of our papers brought it to our attention. For a quick overview the Lipinski Rule is always sufficient.
Hello dear professor undoubtedly it is one of the best teachings about 3D-QSAR which I have ever seen I am working on the quantum mechanic and molecular dynamic and docking It would be a great pleasure for me if you give your mail to me so I a mail for you to collaborate With respectfulness
Hi, I am Getting Error, please help me execute last set of code ax = sns.regplot(Y_test, Y_pred, scatter_kws={'alpha':0.4}) TypeError: regplot() takes from 0 to 1 positional arguments but 2 positional arguments (and 1 keyword-only argument) were given
Thanks for the question, Salik. The regression model will allow us to predict the pIC50 value which is the degree at which a molecule can inhibit or not inhibit the target protein of interest (if it can inhibit with potently then it can be a good drug candidate. Afterwards, they will need to be subjected to further scrutiny such as their pharmacokinetic profiles (ADMET properties encompassing the properties of molecules pertaining to Absorption, Distribution, Metabolism, Excretion and Toxicity).
You can if you use linear regression (LinearRegression function in scikit-learn) scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
There's 2 that comes to mind: 1. Open3DQSAR open3dqsar.sourceforge.net/ dx.doi.org/10.1007/s00894-010-0684-x 2. Cloud 3D-QSAR chemyang.ccnu.edu.cn/ccb/server/cloud3dQSAR/ doi.org/10.1093/bib/bbaa276
Thanks for asking Partricia, Hopefully soon, I'll bump the 5th episode up my priority list. In the meantime, please check out other related videos: Data Science for Computational Drug Discovery using Python (Part 1) ruclips.net/video/VXFFHHoE1wk/видео.html Data Science for Computational Drug Discovery using Python (Part 2) ruclips.net/video/RGfeGRt32Dk/видео.html
@@DataProfessor Thank you so much! I am doing a capstone in data science and your videos have been so instrumental in my success! I look forward to the next and completing the project.
Excellent videos by the Data Professor. Feel free to read the following blog paper on Medium website “Apply Machine Learning Algorithms for Genomics Data Classification”. This will help you to understand how to apply Machine Learning algorithms for genomic data classification. This blog paper contains the latest ML/AI technologies applied to human genomic data classification today.
Right, the recent videos are mostly Python, I have already filmed a video on deploying R Shiny web app to Heroku and am currently editing the video, will be released soon.
@@DataProfessor Cool! I am learning Data Analysis using R and am always discouraged to see so that python has so much popularity. I 've been learning R for a couple of months now so I don't want to just switch lol. I might just learn both one day. My goal is to reach reinforcement deep learning. I'm going to take it one step at a time. I'll be checking out your videos. Thanks!!
Hi Professor, What is interpretation of predicted value? In my case it is 0.594801672508279. Secondly what is the interpretation of scatter plot of experimental vs predicted value of pIC50? Thank you.
The playlist to 5 tutorial videos on Streamlit is here ruclips.net/video/zK4Ch6e1zq8/видео.html Deployment to Heroku video is here ruclips.net/video/zK4Ch6e1zq8/видео.html Deployment of this Bioinformatics series to heroku is in the plan, pleaae stay tuned for that one 😃
Thanks Prof for this series on Bioinformatic project. I have having trouble working with sample dataset and with your lectures I'm comfortable.
Thanks Michael, Glad the contents are helpful 😊
I really appreciate you taking the time to explain the Lipinski's Rule of 5.
My pleasure, thanks for watching.
I have waited so long for this :D
Thanks so much Thomas for the wait 😃
The rule of 5 sounds so useful!
Thanks for watching Ken, it is indeed tremendously useful for rapidly assessing the drug-like properties of a new molecule. Data-driven insight in action 😃
Thats really a good demonstration of application of ML in drug discovery. But how we can test our built model on real world data means which type of data we can input into the model for predictions and how medical professionals benefit by applying this model on real world data. Please clarify I am in confusion and doubt. Thanks
Dear Professor, Thank You !! These videos are excellent and helped me in writing my Research paper based on the same topic..
Glad to hear, that’s awesome!
Thanks Prof. this series helped me a lot.
This series has been truly awesome!! Can't wait for the next part!
Thanks Edward for the encouragement 😃
Wow, That's really amazing, really thank a lot, Please answer this question, 1. can you explain how is this model used to predict the pIC50 value, the last scatter plot shows experimental PIC50 vs predicted PIC50, what I expected is all molecules on x axis and predicted PIc50 on y axis, then I can know which molecules has higher or lower PIc50 values. And 2. How can I know the most important features of a molecule to better design a drug. And one more thing, 3. How can I actually design a drug out of this (let's say for corona virus), should I try some chemical compounds (which have many molecules) which has most of their molecules active/inactive?
Whooaaa.I am here again. Thanks, Data Professor.
Welcome back! This video has been long over due.
@@DataProfessor I was eagerly waiting for this.
Thank you 😃
Hello Prof, I am Getting Error with the scatter plot... please help
ax = sns.regplot(Y_test, Y_pred, scatter_kws={'alpha':0.4})
TypeError: regplot() takes from 0 to 1 positional arguments but 2 positional arguments (and 1 keyword-only argument) were given
A bit late but if you change it to
ax = sns.regplot(x=Y_test, y=Y_pred, scatter_kws={'alpha':0.4})
it will work!
@@caramel2250 Thank you 🙏🏼
Dear Professor, Thank you so much for the wonderful explanation. Why we are choosing the Random forest for model building here?
Dear Professor,
Thanks for this informative video. I understood the concept of using variance threshold for fingerprints.
I have the following doubts:
1. What is the suitable technique for removing useless features in case of chemical descriptors?
2. What approach to follow when using both, fingerprints as well as chemical descriptors?
Hi, I normally use the variance threshold for removing constant values (SD=0) and near constant values (SD approaching 0) and this is applicable for both fingerprints and molecular descriptors. Beyond this, we will have to do feature selection and for this there are several approaches to try out mostly based on evolutionary algorithms such as the use of genetic algorithm, particle swarms optimization, etc.
Thanks for ur Explanation and support. I have one confused thing;
the model r2 value is around 0.51... is it best?
how make approach to one any one with this skill can respond to me?
Data Professor,
1) In part 2 the dataset did not had any fingerprint features, it just had chembl_id, cannonical_smiles,standard_value,class
in part 3 PaDEL fingerprints, this series had the target for coronavirus, can we use that PaDEL for other targets tooo?
2) Thank you for such a good series
Hi, yes PADEL is a descriptor calculation software that can be used for other molecules.
Hello Chanin,
I have been following your work for a while now. You are really doing a good job.
I am currently working on a similar project for my masters now. I have a couple of questions:
1. Once the output has been predicted, how do we know which compound works the best for the disease?
( I am new to machine learning )
2. Do you have any published papers?
Hi,
1. Typically the best compounds is judged by several parameters: 1) how well it binds to the target protein, 2) its pharmacokinetic and pharmacodynamic properties and 3) once both have been established then the compound would undergo clinical trials.
2. My publications can be found from this link orcid.org/0000-0003-1040-663X
Dear Professor, Looking forward to your update part 5. Thanks
Thanks!
Hey Professor,
Thank you very much for this wonderful explanation.
As you have used PaDEL fingerprints..... can we also use PaDEL Descriptors such as Mol Wt or Number of rings etc, to check if these helps in increasing R2 score.
And what are the other feature selection techniques that could be used for fingerprints ??
Yes, absolutely, this is a great question that we normally pursue in our research. And there are 12 fingerprints provided by PaDEL (11 that I haven't used). Please see Table 1 of this paper of ours that lists the 12 fingerprints.
peerj.com/articles/2322/
As for feature selection techniques, one could use PCA to compress the information into the PC coefficients and then using those as the X variables. Or alternatively, one could use an evolutionary algorithm (e.g. genetic algorithm, particle swarms optimization, ant colony optimization, etc.) coupled to a learning algorithm to select subsets of features for model building.
I have a question about removing low variance columns. What exactly is a low variance column when it is part of the molecule fingerprint? Is this for columns that are almost 0's all the time?
Yes, the low variance columns are those that have near constant values (can be either 0 or 1). Because they provide very little variance we deem them to be less useful for the model building and so we remove them.
Hi Dr. Sorry to keep bombarding you with questions. I ran my model (another SARS single protein) and get a R-squared value around 0.15 with the random forest. Even with an SVR regressor I'm getting around that same value. What constitutes in this field a "decent" r-squared value for these models? or does it not matter that much here? would a 0.15 r-squared model have a data fit that isn't strong enough to predict with?
also
Is there a way to get the canonical smile output from a predicted plot point so I could use Chem.MolFromSmiles to transcript a picture of the molecule?
Thanks for your time!!!
FYI, I'm a pharmacist finishing up a Master's in Data Science. You're line of work is quite interesting to me as I would like to work in industry at some point.
Hi Robert, certainly nice to hear of your data science journey. A decent R-squared value in the field of QSAR (quantitative structure-activity relationship) is at least 0.5 for the training set and at least 0.6 for the testing and cross-validation sets. This is according to the best practices in the field as proposed by Tropsha (further information in this article onlinelibrary.wiley.com/doi/abs/10.1002/minf.201000061).
Normally, I would try to make a master table (a dataframe) that contains all the essential columns including ChEMBL ID, SMILES notation and their pIC50 values and after descriptor calculation I would keep at least the ChEMBL ID in the same dataframe as the descriptors, then just before model building, I would drop the ChEMBL ID column out (which would keep sklearn happy and also allow us to trace back the compounds identity). Then after predictions are made, I would add the predicted values (and their actual values) into the master table mentioned above that contains the ChEMBL ID and SMILES notation. That way we can trace back any data points in the scatter plot of the predicted versus experimental values.
Hope this helps 😃
Is there a reason to use paDel finger print over something like SMILE? I would guess that paDel may have more clear functional group information. But I have read ML papers that have used SMILE to great success
Thanks for the question, let me make a video to explain the answer to this.
Dear Professor,
Thanks for such informative content on this topic. I am confused about, how did you decide the threshold for variance. Can you please elaborate?
Thanks
Hi, threshold is an arbitrary number, so this can definitely be optimized for suitability of different datasets.
Hey Data professor,
Can you explain in a bit more detail about the choice of algorithm, feature engineering and Hyper parameter tuning?
Thanks Neel for the great question, it’s like you could read my mind, these are exactly the topics for the future episodes. Amazing 👍
Aren't the results from the Random forest model 0.512 quite low? Or it doesn't matter
The tutorial was a quick demo of the methodological workflow. Yes, absolutely agree with you that the performance can be improved. In the QSAR field of study, R^2 of 0.5 for the test set is considered to be robust, which is according to the recommended guidelines of Golbraikh and Tropsha (2002), more details at www.sciencedirect.com/science/article/abs/pii/S1093326301001231
Thank you very much Professor. Since the train and test sets will be different for each data split, if we run this codes repeatedly (especially for data split) the r2 value will differ too. am i right?
Yes, that is correct, to remedy that we can set the seed number so that the results will be returning the same values when running the code repeatedly.
Hello Professor,
This video series has been very helpful, thank you very much.
I have a concern though. Is it good practice to perform feature selection before the data split?
Great video tutorials! I am a little confused on the final plot of experimental vs predicted pIC50. Could you explain the final plot about experimental and predicted pIC50 and what you are looking for in that plot for a good molecule? I'm a little lost on that part.
That's a great question Robert. The scatter plot between experimental and predicted pIC50 will allow us to visually see the correlation between the 2 variables. Ideally, a perfect prediction with 100% Accuracy would show all data points to fall on the trend line. In a practical setting, the residuals or errors (taking the predicted values and subtracting it from the experimental or actual values give us the residual or error values) would cause data points to fall either above or below the trend line (essentially the variance).
Hope this helps 😃
A good molecule would have a very high pIC50 value (normally higher than 6 is considered to be good). It should be noted that "good" in this context refers to strong inhibition of the target protein of interest. In a practical setting, a good molecule in addition to strong inhibition of the target protein of interest it should also possess favorable
"pharmacokinetic profiles" (the various molecular properties pertaining to the absorption, distribution, metabolism, excretion and toxicity of a drug). Thus, ideally, a ideal drug would have strong inhibition and good pharmacokinetic profiles. In practice, this is essentially an optimization problem and it is rather challenging to find such a molecule. Thus, in the real world setting, we have drugs that are potent but also lead to side effects in patients.
Should the variance threshold vary based on the number of molecules or will (.8 * (1 - .8)) always suffice?
Hi, actually this parameter can be adjusted, but it should be a good starting point.
Awesome explanation! What does R^2 = 0.51 tell us in this example?
Thanks for watching @Import Data. The R2 tells us the Goodness of Fit of the model. It is essentially the squared value of the Pearson's Correlation Coefficient.
Dear Sir, first of all thanks for producing such a wonderful series. I am working with Part4, and using my own data file.
When I want to train regression model with random Forest. I receive a ValueError: Input y contains NaN.
Any suggestions to solve this error?
yea it gives me an error as well, and I don't think he knows how to fix it because even if you run his code on his data you get the exact same error so something isn't right
@Travis Hall I have fixed my error, let me share it with you. Open your CSV file, there will be certain rows with empty pIC50 values, remove them and come from the start again with a fresh session.
Hello Great video as always! I was wondering if we could have used something like PCA instead of this VarianceThreshold module Or is it not desireable?
Thanks for watching, Great question! PCA is used to reduce the dimension of the feature space while the variance threshold is a prerequisited that we would normally do prior to performing multivariate analysis (PCA included).
Can we make a database to store these predictions?
Yes, we can also do that but since the dataset is relatively small, saving to a CSV file should be sufficient.
@@DataProfessor Thank you!
hi professor. When i tried using VarianceThreshold to remove low variances, there was an error saying cannot convert strings into float. How to fix this error? Thank you very much
Hi, I have a question. How did you define the variance threshold? Why is it 0.8*(1-0.8)? ... Thank you
Hi, here's a good explanation here www.datasciencesmachinelearning.com/2019/10/feature-selection-using-sklearn.html
@@DataProfessor Thank you so much! You are the best! ❤
Dear professor, I am getting an error in random forest model" Input contains NaN, infinity or a value too large for dtype('float64')." how to resolve that
I tried removing the row that contains extremely large value but that is impacting ther2 value lower than 0.5.
Also the adjusted r2 value aas also going considerably low.
Thanks for the video :)
Thanks for watching 😃
Nice video series! I just want to comment that instead of the Lipinski Rule there are many people using the ADMET properties (c.f. J. Chem. Inf. Model. 2012, 52, 7, 1713-1721). Maybe it is useful for some people ;-)
Thanks for watching and for sharing the paper 👍 Interesting paper indeed, I’ve read it and the descriptors used to produce the ADMET score are based on the same 4 descriptors used in Lipinski rule of five plus some additional descriptors such as LogS and TPSA. The end outcome of both approach is to find drug-like properties (having favorable ADMET/pharmacokinetic characteristics). Another interesting point performed in the paper is that the authors attempted to optimize both potency with pharmacokinetic properties via multi-objective optimization. This is an active field of research where attempts are made to assess drug-like characteristics. However, the paper does not provide a tool or equation for calculating the ADMET score that they described in the paper and thus limits its utility by potential users. Perhaps some of us are interested in doing a project on this? 🤔
@@DataProfessor Most of the ADMET property prediction are subject to industry, for example the ADMET predictor, and unfortunately are not freely available for academics. I just wanted to point this out as an alternative to the Lipinski rule, since a reviewer of one of our papers brought it to our attention. For a quick overview the Lipinski Rule is always sufficient.
Thanks
Dear Data professor, is there any next section。thanks
Yes, there is. It is currently in the planning phase.
Amazing!
Thanks Bala!
Hello dear professor
undoubtedly it is one of the best teachings about 3D-QSAR which I have ever seen
I am working on the quantum mechanic and molecular dynamic and docking
It would be a great pleasure for me if you give your mail to me so I a mail for you to collaborate
With respectfulness
Hi, I am Getting Error, please help me execute last set of code
ax = sns.regplot(Y_test, Y_pred, scatter_kws={'alpha':0.4})
TypeError: regplot() takes from 0 to 1 positional arguments but 2 positional arguments (and 1 keyword-only argument) were given
Same here
Please let me know how you solved it... thus if you have
A bit late but if you change it to
ax = sns.regplot(x=Y_test, y=Y_pred, scatter_kws={'alpha':0.4})
it will work!
I can't understand one thing from this that, what we actually get from this model and its benefit..
Thanks for the question, Salik. The regression model will allow us to predict the pIC50 value which is the degree at which a molecule can inhibit or not inhibit the target protein of interest (if it can inhibit with potently then it can be a good drug candidate. Afterwards, they will need to be subjected to further scrutiny such as their pharmacokinetic profiles (ADMET properties encompassing the properties of molecules pertaining to Absorption, Distribution, Metabolism, Excretion and Toxicity).
@@DataProfessor got it Thank you...
is bioinformatics and computational biology related?
Yes they are related. I've distinguished the 2 terms in the 2-part videos ruclips.net/video/p5iZxIT16KQ/видео.html
@@DataProfessor thanks!
A pleasure 😃
Can we generate qsar equation?
You can if you use linear regression (LinearRegression function in scikit-learn) scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
@@DataProfessor 🙏
do you know any open source 3dQSAR software?
There's 2 that comes to mind:
1. Open3DQSAR open3dqsar.sourceforge.net/
dx.doi.org/10.1007/s00894-010-0684-x
2. Cloud 3D-QSAR
chemyang.ccnu.edu.cn/ccb/server/cloud3dQSAR/
doi.org/10.1093/bib/bbaa276
@@DataProfessor thank you sir that was really helpful
When will you release #5?
Thanks for asking Partricia, Hopefully soon, I'll bump the 5th episode up my priority list.
In the meantime, please check out other related videos:
Data Science for Computational Drug Discovery using Python (Part 1)
ruclips.net/video/VXFFHHoE1wk/видео.html
Data Science for Computational Drug Discovery using Python (Part 2)
ruclips.net/video/RGfeGRt32Dk/видео.html
@@DataProfessor Thank you so much! I am doing a capstone in data science and your videos have been so instrumental in my success! I look forward to the next and completing the project.
@@patriciamason3760 Glad to hear that, will prioritize this series for future release.
thank you!
A pleasure, Thank you for watching!
Excellent videos by the Data Professor. Feel free to read the following blog paper on Medium website “Apply Machine Learning
Algorithms for Genomics Data Classification”. This will help you to understand how to apply Machine Learning algorithms for
genomic data classification. This blog paper contains the latest ML/AI technologies applied to human genomic data classification today.
Good topic
Thanks for watching Madara 😃
Why did you switch to python? No more R?
Right, the recent videos are mostly Python, I have already filmed a video on deploying R Shiny web app to Heroku and am currently editing the video, will be released soon.
@@DataProfessor Cool! I am learning Data Analysis using R and am always discouraged to see so that python has so much popularity. I 've been learning R for a couple of months now so I don't want to just switch lol. I might just learn both one day. My goal is to reach reinforcement deep learning. I'm going to take it one step at a time. I'll be checking out your videos. Thanks!!
Great, you’re doing just fine, will definitely continue to push out video content in both R and Python.
Hi Professor, What is interpretation of predicted value? In my case it is 0.594801672508279. Secondly what is the interpretation of scatter plot of experimental vs predicted value of pIC50? Thank you.
I do have the same question! :(
nice
Thanks
You're welcome!
Please teach us also to streamlit or to heroku :((
Thanks for watching, yes definitely in the future episode of this series.
ruclips.net/video/ZZ4B0QUHuNc/видео.html
@@CatBlack01 Thanks 😃
The playlist to 5 tutorial videos on Streamlit is here ruclips.net/video/zK4Ch6e1zq8/видео.html
Deployment to Heroku video is here ruclips.net/video/zK4Ch6e1zq8/видео.html
Deployment of this Bioinformatics series to heroku is in the plan, pleaae stay tuned for that one 😃
Wohooooooo!!! Here we go boyzz 😭😎😎
Waited a lot...
Thanks for the wait 😃
"We're Secretly... " what?? what's next? :)
Yeyyy!! :)))
This episode is long over due, thanks for the wait 😃
The code no longer works for creating scatter plots you might have to take down the video and make another one
Libraries change overtime, just use GPT to understand what has changed and alter the code accordingly