Bioinformatics Project from Scratch - Drug Discovery Part 4 (Model Building)

Data Professor

Просмотров 18 тыс.

470

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 6 фев 2025

Комментарии • 136

@michaeloladunjoye5258 4 года назад ⁺²
Thanks Prof for this series on Bioinformatic project. I have having trouble working with sample dataset and with your lectures I'm comfortable.
@DataProfessor 4 года назад
Thanks Michael, Glad the contents are helpful 😊
@malaria39 3 года назад ⁺¹
I really appreciate you taking the time to explain the Lipinski's Rule of 5.
@DataProfessor 3 года назад
My pleasure, thanks for watching.
@oludhe7 4 года назад ⁺⁸
I have waited so long for this :D
@DataProfessor 4 года назад
Thanks so much Thomas for the wait 😃
@KenJee_ds 4 года назад ⁺²
The rule of 5 sounds so useful!
@DataProfessor 4 года назад ⁺¹
Thanks for watching Ken, it is indeed tremendously useful for rapidly assessing the drug-like properties of a new molecule. Data-driven insight in action 😃
@sid321axn 4 года назад ⁺³
Thats really a good demonstration of application of ML in drug discovery. But how we can test our built model on real world data means which type of data we can input into the model for predictions and how medical professionals benefit by applying this model on real world data. Please clarify I am in confusion and doubt. Thanks
@nishantjha2979 4 года назад ⁺²
Dear Professor, Thank You !! These videos are excellent and helped me in writing my Research paper based on the same topic..
@DataProfessor 4 года назад
Glad to hear, that’s awesome!
@johnotolorin6775 9 месяцев назад ⁺¹
Thanks Prof. this series helped me a lot.
@edwardlampoh128 4 года назад ⁺¹
This series has been truly awesome!! Can't wait for the next part!
@DataProfessor 4 года назад ⁺¹
Thanks Edward for the encouragement 😃
@mohammadabushaera1587 2 года назад ⁺¹
Wow, That's really amazing, really thank a lot, Please answer this question, 1. can you explain how is this model used to predict the pIC50 value, the last scatter plot shows experimental PIC50 vs predicted PIC50, what I expected is all molecules on x axis and predicted PIc50 on y axis, then I can know which molecules has higher or lower PIc50 values. And 2. How can I know the most important features of a molecule to better design a drug. And one more thing, 3. How can I actually design a drug out of this (let's say for corona virus), should I try some chemical compounds (which have many molecules) which has most of their molecules active/inactive?
@shwetaredkar734 4 года назад ⁺¹
Whooaaa.I am here again. Thanks, Data Professor.
@DataProfessor 4 года назад ⁺¹
Welcome back! This video has been long over due.
@shwetaredkar734 4 года назад ⁺¹
@@DataProfessor I was eagerly waiting for this.
@DataProfessor 4 года назад ⁺¹
Thank you 😃
@davidquaynor6677 8 месяцев назад ⁺¹
Hello Prof, I am Getting Error with the scatter plot... please help
ax = sns.regplot(Y_test, Y_pred, scatter_kws={'alpha':0.4})
TypeError: regplot() takes from 0 to 1 positional arguments but 2 positional arguments (and 1 keyword-only argument) were given
@caramel2250 6 месяцев назад ⁺²
A bit late but if you change it to
ax = sns.regplot(x=Y_test, y=Y_pred, scatter_kws={'alpha':0.4})
it will work!
@davidquaynor6677 6 месяцев назад
@@caramel2250 Thank you 🙏🏼
@meenavinaykumar452 3 года назад
Dear Professor, Thank you so much for the wonderful explanation. Why we are choosing the Random forest for model building here?
@prateek13061993 4 года назад ⁺¹
Dear Professor,
Thanks for this informative video. I understood the concept of using variance threshold for fingerprints.
I have the following doubts:
1. What is the suitable technique for removing useless features in case of chemical descriptors?
2. What approach to follow when using both, fingerprints as well as chemical descriptors?
@DataProfessor 4 года назад
Hi, I normally use the variance threshold for removing constant values (SD=0) and near constant values (SD approaching 0) and this is applicable for both fingerprints and molecular descriptors. Beyond this, we will have to do feature selection and for this there are several approaches to try out mostly based on evolutionary algorithms such as the use of genetic algorithm, particle swarms optimization, etc.
@fetamedia788 Год назад
Thanks for ur Explanation and support. I have one confused thing;
the model r2 value is around 0.51... is it best?
how make approach to one any one with this skill can respond to me?
@shreenidhihipparagi6253 3 года назад ⁺¹
Data Professor,
1) In part 2 the dataset did not had any fingerprint features, it just had chembl_id, cannonical_smiles,standard_value,class
in part 3 PaDEL fingerprints, this series had the target for coronavirus, can we use that PaDEL for other targets tooo?
2) Thank you for such a good series
@DataProfessor 3 года назад
Hi, yes PADEL is a descriptor calculation software that can be used for other molecules.
@himasaikeerthi3516 3 года назад ⁺¹
Hello Chanin,
I have been following your work for a while now. You are really doing a good job.
I am currently working on a similar project for my masters now. I have a couple of questions:
1. Once the output has been predicted, how do we know which compound works the best for the disease?
( I am new to machine learning )
2. Do you have any published papers?
@DataProfessor 3 года назад ⁺¹
Hi,
1. Typically the best compounds is judged by several parameters: 1) how well it binds to the target protein, 2) its pharmacokinetic and pharmacodynamic properties and 3) once both have been established then the compound would undergo clinical trials.
2. My publications can be found from this link orcid.org/0000-0003-1040-663X
@pcliang2693 4 года назад ⁺¹
Dear Professor, Looking forward to your update part 5. Thanks
@DataProfessor 4 года назад
Thanks!
@prateek13061993 4 года назад ⁺²
Hey Professor,
Thank you very much for this wonderful explanation.
As you have used PaDEL fingerprints..... can we also use PaDEL Descriptors such as Mol Wt or Number of rings etc, to check if these helps in increasing R2 score.
And what are the other feature selection techniques that could be used for fingerprints ??
@DataProfessor 4 года назад ⁺²
Yes, absolutely, this is a great question that we normally pursue in our research. And there are 12 fingerprints provided by PaDEL (11 that I haven't used). Please see Table 1 of this paper of ours that lists the 12 fingerprints.
peerj.com/articles/2322/
As for feature selection techniques, one could use PCA to compress the information into the PC coefficients and then using those as the X variables. Or alternatively, one could use an evolutionary algorithm (e.g. genetic algorithm, particle swarms optimization, ant colony optimization, etc.) coupled to a learning algorithm to select subsets of features for model building.
@acidic_ph 4 года назад ⁺¹
I have a question about removing low variance columns. What exactly is a low variance column when it is part of the molecule fingerprint? Is this for columns that are almost 0's all the time?
@DataProfessor 4 года назад ⁺¹
Yes, the low variance columns are those that have near constant values (can be either 0 or 1). Because they provide very little variance we deem them to be less useful for the model building and so we remove them.
@roberthodges508 4 года назад ⁺²
Hi Dr. Sorry to keep bombarding you with questions. I ran my model (another SARS single protein) and get a R-squared value around 0.15 with the random forest. Even with an SVR regressor I'm getting around that same value. What constitutes in this field a "decent" r-squared value for these models? or does it not matter that much here? would a 0.15 r-squared model have a data fit that isn't strong enough to predict with?
also
Is there a way to get the canonical smile output from a predicted plot point so I could use Chem.MolFromSmiles to transcript a picture of the molecule?
Thanks for your time!!!
FYI, I'm a pharmacist finishing up a Master's in Data Science. You're line of work is quite interesting to me as I would like to work in industry at some point.
@DataProfessor 4 года назад ⁺¹
Hi Robert, certainly nice to hear of your data science journey. A decent R-squared value in the field of QSAR (quantitative structure-activity relationship) is at least 0.5 for the training set and at least 0.6 for the testing and cross-validation sets. This is according to the best practices in the field as proposed by Tropsha (further information in this article onlinelibrary.wiley.com/doi/abs/10.1002/minf.201000061).
Normally, I would try to make a master table (a dataframe) that contains all the essential columns including ChEMBL ID, SMILES notation and their pIC50 values and after descriptor calculation I would keep at least the ChEMBL ID in the same dataframe as the descriptors, then just before model building, I would drop the ChEMBL ID column out (which would keep sklearn happy and also allow us to trace back the compounds identity). Then after predictions are made, I would add the predicted values (and their actual values) into the master table mentioned above that contains the ChEMBL ID and SMILES notation. That way we can trace back any data points in the scatter plot of the predicted versus experimental values.
Hope this helps 😃
@dylanf4678 4 года назад ⁺¹
Is there a reason to use paDel finger print over something like SMILE? I would guess that paDel may have more clear functional group information. But I have read ML papers that have used SMILE to great success
@DataProfessor 4 года назад
Thanks for the question, let me make a video to explain the answer to this.
@anjalisetiya2335 3 года назад ⁺¹
Dear Professor,
Thanks for such informative content on this topic. I am confused about, how did you decide the threshold for variance. Can you please elaborate?
Thanks
@DataProfessor 3 года назад
Hi, threshold is an arbitrary number, so this can definitely be optimized for suitability of different datasets.
@neelshah6970 4 года назад ⁺¹
Hey Data professor,
Can you explain in a bit more detail about the choice of algorithm, feature engineering and Hyper parameter tuning?
@DataProfessor 4 года назад ⁺¹
Thanks Neel for the great question, it’s like you could read my mind, these are exactly the topics for the future episodes. Amazing 👍
@jasneekchugh6746 4 года назад ⁺²
Aren't the results from the Random forest model 0.512 quite low? Or it doesn't matter
@DataProfessor 4 года назад
The tutorial was a quick demo of the methodological workflow. Yes, absolutely agree with you that the performance can be improved. In the QSAR field of study, R^2 of 0.5 for the test set is considered to be robust, which is according to the recommended guidelines of Golbraikh and Tropsha (2002), more details at www.sciencedirect.com/science/article/abs/pii/S1093326301001231
@eyupbilgi3191 3 года назад ⁺¹
Thank you very much Professor. Since the train and test sets will be different for each data split, if we run this codes repeatedly (especially for data split) the r2 value will differ too. am i right?
@DataProfessor 3 года назад
Yes, that is correct, to remedy that we can set the seed number so that the results will be returning the same values when running the code repeatedly.
@cyrilakafia8972 2 года назад
Hello Professor,
This video series has been very helpful, thank you very much.
I have a concern though. Is it good practice to perform feature selection before the data split?
@roberthodges508 4 года назад ⁺¹
Great video tutorials! I am a little confused on the final plot of experimental vs predicted pIC50. Could you explain the final plot about experimental and predicted pIC50 and what you are looking for in that plot for a good molecule? I'm a little lost on that part.
@DataProfessor 4 года назад ⁺⁴
That's a great question Robert. The scatter plot between experimental and predicted pIC50 will allow us to visually see the correlation between the 2 variables. Ideally, a perfect prediction with 100% Accuracy would show all data points to fall on the trend line. In a practical setting, the residuals or errors (taking the predicted values and subtracting it from the experimental or actual values give us the residual or error values) would cause data points to fall either above or below the trend line (essentially the variance).
Hope this helps 😃
@DataProfessor 4 года назад ⁺³
A good molecule would have a very high pIC50 value (normally higher than 6 is considered to be good). It should be noted that "good" in this context refers to strong inhibition of the target protein of interest. In a practical setting, a good molecule in addition to strong inhibition of the target protein of interest it should also possess favorable
"pharmacokinetic profiles" (the various molecular properties pertaining to the absorption, distribution, metabolism, excretion and toxicity of a drug). Thus, ideally, a ideal drug would have strong inhibition and good pharmacokinetic profiles. In practice, this is essentially an optimization problem and it is rather challenging to find such a molecule. Thus, in the real world setting, we have drugs that are potent but also lead to side effects in patients.
@hunterfitzpatrick8389 3 года назад ⁺¹
Should the variance threshold vary based on the number of molecules or will (.8 * (1 - .8)) always suffice?
@DataProfessor 3 года назад
Hi, actually this parameter can be adjusted, but it should be a good starting point.
@ImportData1 4 года назад ⁺¹
Awesome explanation! What does R^2 = 0.51 tell us in this example?
@DataProfessor 4 года назад ⁺¹
Thanks for watching @Import Data. The R2 tells us the Goodness of Fit of the model. It is essentially the squared value of the Pearson's Correlation Coefficient.
@ahmaddarling Год назад
Dear Sir, first of all thanks for producing such a wonderful series. I am working with Part4, and using my own data file.
When I want to train regression model with random Forest. I receive a ValueError: Input y contains NaN.
Any suggestions to solve this error?
@TravisKPHall Год назад
yea it gives me an error as well, and I don't think he knows how to fix it because even if you run his code on his data you get the exact same error so something isn't right
@ahmaddarling Год назад
@Travis Hall I have fixed my error, let me share it with you. Open your CSV file, there will be certain rows with empty pIC50 values, remove them and come from the start again with a fresh session.
@honeyBadger582 4 года назад ⁺¹
Hello Great video as always! I was wondering if we could have used something like PCA instead of this VarianceThreshold module Or is it not desireable?
@DataProfessor 4 года назад
Thanks for watching, Great question! PCA is used to reduce the dimension of the feature space while the variance threshold is a prerequisited that we would normally do prior to performing multivariate analysis (PCA included).
@diptilulla2895 4 года назад ⁺¹
Can we make a database to store these predictions?
@DataProfessor 4 года назад ⁺¹
Yes, we can also do that but since the dataset is relatively small, saving to a CSV file should be sufficient.
@diptilulla2895 4 года назад ⁺¹
@@DataProfessor Thank you!
@nguyenkaitlyn6364 3 года назад
hi professor. When i tried using VarianceThreshold to remove low variances, there was an error saying cannot convert strings into float. How to fix this error? Thank you very much
@rizapaolahawkeye 3 года назад ⁺¹
Hi, I have a question. How did you define the variance threshold? Why is it 0.8*(1-0.8)? ... Thank you
@DataProfessor 3 года назад ⁺¹
Hi, here's a good explanation here www.datasciencesmachinelearning.com/2019/10/feature-selection-using-sklearn.html
@rizapaolahawkeye 3 года назад ⁺¹
@@DataProfessor Thank you so much! You are the best! ❤
@nishantjha2979 2 года назад
Dear professor, I am getting an error in random forest model" Input contains NaN, infinity or a value too large for dtype('float64')." how to resolve that
@arpitanand1930 8 месяцев назад
I tried removing the row that contains extremely large value but that is impacting ther2 value lower than 0.5.
@arpitanand1930 8 месяцев назад
Also the adjusted r2 value aas also going considerably low.
@minicorefacility 4 года назад ⁺²
Thanks for the video :)
@DataProfessor 4 года назад
Thanks for watching 😃
@12_vermouth_12 4 года назад ⁺¹
Nice video series! I just want to comment that instead of the Lipinski Rule there are many people using the ADMET properties (c.f. J. Chem. Inf. Model. 2012, 52, 7, 1713-1721). Maybe it is useful for some people ;-)
@DataProfessor 4 года назад ⁺²
Thanks for watching and for sharing the paper 👍 Interesting paper indeed, I’ve read it and the descriptors used to produce the ADMET score are based on the same 4 descriptors used in Lipinski rule of five plus some additional descriptors such as LogS and TPSA. The end outcome of both approach is to find drug-like properties (having favorable ADMET/pharmacokinetic characteristics). Another interesting point performed in the paper is that the authors attempted to optimize both potency with pharmacokinetic properties via multi-objective optimization. This is an active field of research where attempts are made to assess drug-like characteristics. However, the paper does not provide a tool or equation for calculating the ADMET score that they described in the paper and thus limits its utility by potential users. Perhaps some of us are interested in doing a project on this? 🤔
@12_vermouth_12 4 года назад ⁺¹
@@DataProfessor Most of the ADMET property prediction are subject to industry, for example the ADMET predictor, and unfortunately are not freely available for academics. I just wanted to point this out as an alternative to the Lipinski rule, since a reviewer of one of our papers brought it to our attention. For a quick overview the Lipinski Rule is always sufficient.
@bio-pharmalearners 3 месяца назад ⁺¹
Thanks
@pcliang2693 4 года назад ⁺¹
Dear Data professor, is there any next section。thanks
@DataProfessor 4 года назад
Yes, there is. It is currently in the planning phase.
@balavinai7549 4 года назад ⁺²
Amazing!
@DataProfessor 4 года назад
Thanks Bala!
@briansadeghi51 4 года назад
Hello dear professor
undoubtedly it is one of the best teachings about 3D-QSAR which I have ever seen
I am working on the quantum mechanic and molecular dynamic and docking
It would be a great pleasure for me if you give your mail to me so I a mail for you to collaborate
With respectfulness
@chandanraj5778 11 месяцев назад
Hi, I am Getting Error, please help me execute last set of code
ax = sns.regplot(Y_test, Y_pred, scatter_kws={'alpha':0.4})
TypeError: regplot() takes from 0 to 1 positional arguments but 2 positional arguments (and 1 keyword-only argument) were given
@davidquaynor6677 8 месяцев назад
Same here
Please let me know how you solved it... thus if you have
@caramel2250 6 месяцев назад ⁺²
A bit late but if you change it to
ax = sns.regplot(x=Y_test, y=Y_pred, scatter_kws={'alpha':0.4})
it will work!
@salikmalik7631 4 года назад ⁺³
I can't understand one thing from this that, what we actually get from this model and its benefit..
@DataProfessor 4 года назад ⁺¹
Thanks for the question, Salik. The regression model will allow us to predict the pIC50 value which is the degree at which a molecule can inhibit or not inhibit the target protein of interest (if it can inhibit with potently then it can be a good drug candidate. Afterwards, they will need to be subjected to further scrutiny such as their pharmacokinetic profiles (ADMET properties encompassing the properties of molecules pertaining to Absorption, Distribution, Metabolism, Excretion and Toxicity).
@salikmalik7631 4 года назад
@@DataProfessor got it Thank you...
@mandarvaidya7947 4 года назад ⁺¹
is bioinformatics and computational biology related?
@DataProfessor 4 года назад ⁺¹
Yes they are related. I've distinguished the 2 terms in the 2-part videos ruclips.net/video/p5iZxIT16KQ/видео.html
@mandarvaidya7947 4 года назад ⁺¹
@@DataProfessor thanks!
@DataProfessor 4 года назад
A pleasure 😃
@rajatnandi2175 3 года назад ⁺¹
Can we generate qsar equation?
@DataProfessor 3 года назад ⁺¹
You can if you use linear regression (LinearRegression function in scikit-learn) scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
@rajatnandi2175 3 года назад ⁺¹
@@DataProfessor 🙏
@misganamengistu5503 4 года назад ⁺¹
do you know any open source 3dQSAR software?
@DataProfessor 4 года назад
There's 2 that comes to mind:
1. Open3DQSAR open3dqsar.sourceforge.net/
dx.doi.org/10.1007/s00894-010-0684-x
2. Cloud 3D-QSAR
chemyang.ccnu.edu.cn/ccb/server/cloud3dQSAR/
doi.org/10.1093/bib/bbaa276
@misganamengistu5503 4 года назад
@@DataProfessor thank you sir that was really helpful
@patriciamason3760 4 года назад ⁺¹
When will you release #5?
@DataProfessor 4 года назад
Thanks for asking Partricia, Hopefully soon, I'll bump the 5th episode up my priority list.
In the meantime, please check out other related videos:
Data Science for Computational Drug Discovery using Python (Part 1)
ruclips.net/video/VXFFHHoE1wk/видео.html
Data Science for Computational Drug Discovery using Python (Part 2)
ruclips.net/video/RGfeGRt32Dk/видео.html
@patriciamason3760 4 года назад ⁺¹
@@DataProfessor Thank you so much! I am doing a capstone in data science and your videos have been so instrumental in my success! I look forward to the next and completing the project.
@DataProfessor 4 года назад
@@patriciamason3760 Glad to hear that, will prioritize this series for future release.
@edpalen5295 4 года назад ⁺¹
thank you!
@DataProfessor 4 года назад
A pleasure, Thank you for watching!
@ernestbonat2440 3 года назад
Excellent videos by the Data Professor. Feel free to read the following blog paper on Medium website “Apply Machine Learning
Algorithms for Genomics Data Classification”. This will help you to understand how to apply Machine Learning algorithms for
genomic data classification. This blog paper contains the latest ML/AI technologies applied to human genomic data classification today.
@madaragrothendieckottchiwa8648 4 года назад ⁺¹
Good topic
@DataProfessor 4 года назад
Thanks for watching Madara 😃
@davidramos1826 4 года назад ⁺¹
Why did you switch to python? No more R?
@DataProfessor 4 года назад
Right, the recent videos are mostly Python, I have already filmed a video on deploying R Shiny web app to Heroku and am currently editing the video, will be released soon.
@davidramos1826 4 года назад ⁺¹
@@DataProfessor Cool! I am learning Data Analysis using R and am always discouraged to see so that python has so much popularity. I 've been learning R for a couple of months now so I don't want to just switch lol. I might just learn both one day. My goal is to reach reinforcement deep learning. I'm going to take it one step at a time. I'll be checking out your videos. Thanks!!
@DataProfessor 4 года назад
Great, you’re doing just fine, will definitely continue to push out video content in both R and Python.
@traveldiaries347 4 года назад
Hi Professor, What is interpretation of predicted value? In my case it is 0.594801672508279. Secondly what is the interpretation of scatter plot of experimental vs predicted value of pIC50? Thank you.
@citricmary 2 года назад
I do have the same question! :(
@pcliang2693 4 года назад ⁺¹
nice
Thanks
@DataProfessor 4 года назад
You're welcome!
@josefftan1203 4 года назад ⁺¹
Please teach us also to streamlit or to heroku :((
@DataProfessor 4 года назад
Thanks for watching, yes definitely in the future episode of this series.
@CatBlack01 4 года назад ⁺¹
ruclips.net/video/ZZ4B0QUHuNc/видео.html
@DataProfessor 4 года назад
@@CatBlack01 Thanks 😃
@DataProfessor 4 года назад
The playlist to 5 tutorial videos on Streamlit is here ruclips.net/video/zK4Ch6e1zq8/видео.html
Deployment to Heroku video is here ruclips.net/video/zK4Ch6e1zq8/видео.html
Deployment of this Bioinformatics series to heroku is in the plan, pleaae stay tuned for that one 😃
@josefftan1203 4 года назад
Wohooooooo!!! Here we go boyzz 😭😎😎
@muhammadjamalahmed8664 4 года назад ⁺¹
Waited a lot...
@DataProfessor 4 года назад
Thanks for the wait 😃
@observer698 4 года назад ⁺¹
"We're Secretly... " what?? what's next? :)
@josefftan1203 4 года назад ⁺²
Yeyyy!! :)))
@DataProfessor 4 года назад
This episode is long over due, thanks for the wait 😃
@TravisKPHall Год назад
The code no longer works for creating scatter plots you might have to take down the video and make another one
@henriquerabelo9111 Год назад
Libraries change overtime, just use GPT to understand what has changed and alter the code accordingly

Следующие

Автовоспроизведение

Bioinformatics Project from Scratch - Drug Discovery Part 5 (Compare Models)