Super human with the amalgamation of bio and data science skills in a parallel universe! Thank you Data Professor. We are here to see more and more content on cheminformatics series.
Thanks Shweta for the kind praise and encouragement. I am glad you are finding value in the contents of this channel. I am thinking of publishing a paper about this experience as an academic RUclipsr 😃
if we are using external software to calculate descriptors, how we can be selective about descriptors required for correlating with inhibitory concentration? because descriptor calculators came with thousands of descriptor option and I heard that its not good for the model to use every descriptor.
Hi, that's a great question! There's an important aspect of the model building process known as feature selection (also known as variable selection). It is a process of selecting an important subset of features from an initially large collection of features. The reason why I like random forest is that it has a built-in feature selection where features with high Gini values are deemed to be important.
Hello Sir First of all, thank you so much for such great videos !! I build my research on this. I have a doubt regarding the input for this part, to get the csv file for a different target, what would have to be done? do i have to make the file manually ? I checked in ChemBL and there was only 4 drugs for my target. What do you think can be done such a situation ? Thanks in advance
I am unable to execute any bash commands in my jupyter notebook running on a windows machine. I have followed your tutorial listed in this section without success. It appears that I will need access to bash for executing CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb. I have tried restarting my kernal but there appears to be only one selection. Any advice would be helpful. Thanks
@@DataProfessor thank you so much for your kind response. Few more doubts: 1. Can't we use logP, molwt, Rotatablebond, achromatic proportion along with finger print, and does it make sense combining those four features with finger print features?? 2. Extraction of fingerprint features is very very slow, how can we speed up it, does one need to go for parallel computing like spark ?? Or there is any way in pandas to speed process?
Hey Chanin, Can you tell me why did we calculated the Lipinski descriptors in the previous video if we are just gonna use the PaDEL descriptor for ML? Are we going to combine both descriptors for model and what kind of regression model we are gonna use to get the the best top 10 molecules from this
Hi Neel, that's a great question. Each descriptor types are used for different purposes although both are used for describing the molecular properties of a molecule. Firstly, the Lipinski descriptors are describing the general properties of a molecule at the macro level since it accounts for the molecule's entire molecular weight, the total count of hydrogen bond donors/acceptors and the molecule's solubility (LogP). Secondly, the PubChem descriptors from PaDEL are accounting for the local properties of a molecule at the micro level since it breaks the molecule down into its atomic constituents since each of the 881 descriptors are binary fingerprints of whether the molecule have or not have a particular functional group (a small collection of bonded atoms, which if compared to Lego building blocks, 1 Lego would correspond to 1 atom, 50 Lego would correspond to the molecule whereas N (e.g. 3, 4, 5, or 6 or 7, etc.) would correspond to a functional group (the order and connectivity of these Lego blocks matter and influences the molecular property, thus allow researchers to design new drugs. This is the primary expertise of medicinal chemists). Hope this helps 😃
Sir, Does PubChem finger prints can be directly used as input to discriminator of a GAN for drug discovery model by using GAN or it is necessary to convert canonical smiles to one hot encoded format.
I have a question if I wanted to look at a different protein? How would I go about doing that I've completed part 1 and 2 looking at a protein associated with Alzheimer's and would like to add it to do part 3 and 4.
So am confused as to why Part 1 creates the "bioactivity file associated with coronavirus" and then Part 2 uses this file. Part 3 then references a file that was never created. Looked in the gitub and saw another codeset showing how to get the 2nd file. Wish this was explained better at the beginning of this video.
Good job prof. Please why do you use only the fingerprint descriptors instead of calculating all the descriptors that padel can calculate. Also will the fingerprint descriptors describe the compounds better
in the descriptors_output.csv file how did you get the labels as PubChemFP0,PubChemFP1,PubChemFP2... When i converted the molecules.smi file i'm getting the labels as nHBAcc nHBAcc2 nHBAcc3 nHBAcc_Lipinski nHBDon....
Thanks a lot for these videos. Recently I´ve ever tried to move to cheminformatics without success. I know we have to learn python, R, and other programming tools, nevertheless I'm not sure if it is mandatory, I guess so, anyway I´m starting with python That´s why I was able to understand your Drug Discovery ('coronavirus') project. Finally, a very clear tutorial to begin in it. I´m an associated professor in organic chemistry and really I will be following you, as ever as you upload videos. Thanks a lot again. Right now, I´m trying to write a project for my new Ph.D. but now in Cheminformatics. I hope I will be able to be in contact with you, and that you be in a disposition to clear several questions that I´m sure I will have. Here is the first doubt I have. Me, as an Organic Chemistry Specialist, Since your experiences which subject could I develop in cheminformatics?
Thanks for sharing your journey. Glad you are finding tutorials helpful. With your expertise in organic chemistry, an area that you could look into could be predicting the synthetic yield or the synthesis outcome on the basis of the reactants. Another area could be quantitative structure-activity relationship that would allow the correlation of the chemical structures with their physicochemical properties or biological activity. In one paper earlier in my career I predicted the bond dissociation energy of antioxidants in order to understand the structure-activity relationship that could help in the discovery of novel antioxidants. My other work is into predicting the enzyme inhibition potential of compounds (predicting the pIC50 values).
@@DataProfessor Thank you very much for your suggestions, I gonna take into account. I have been teaching chemistry in a post-degree of alimentary biotechnology. And in reality, I am thinking more about foodinformatics and chemistry. If you have another suggestion about foods and chemistry, I will appreciate it. Thank you very much in advance.
I can't run !bash (and !wget too) command in my jupyter notebook. I don't find any solution that work for me in stackoverflow either. Anyone ever have this problem? In the meantime I just download descriptor_output from github.
What's the purpose of the ro5? Bcs in this experimentation u did use just the similes notations for the construction of the features' table ? Thx for the content 😍
Professor. I am having trouble using ! cat in jupyter notebooks as well, like one of the other posters. I'm not familiar with ! cat and ! bash commands or what prerequisites are required to use these commands in jupyter on windows. Are there other commands that can be used in place of this? Thanks for the awesome videos!
Hi Robert. I made a video dedicated to Bash command lines please check it out at ruclips.net/video/SZj55PihnBs/видео.html After that watching that video, I think it will allow you to understand how to handle files and navigate through directories in the command line using Bash. The exclamation mark in Jupyter is a way of telling Jupyter that what follows are Bash commands otherwise it will assume that they are Python codes. ! cat needs to be ! cat file.csv which will display the contents of a file called file.csv
Great video Professor. I have one doubt, can use descriptors, fingerprints and pIC50 values and pose as a classification problem by predicting the bioactivity labels?
pIC50 values can be used as Y in regression models, while the binned version (class labels of active and inactive) could be used for classification models.
Thanks Professor. Finally I found this video. Your video is very good and easy to understand. I have a plan to do a virtual screening to look for lead compounds. I want to ask something, can we save the data in sdf (structure-data file)?
Hi panel descriptors can be calculated using the PaDEL-Descriptor software available at www.yapcwsoft.com/dd/padeldescriptor/PaDEL-Descriptor.zip The video shows how to compute in the Jupyter notebook while the above ZIP file can also be used to manually calculate using point and click, which I show how to do manually in the following video ruclips.net/video/yf3N0nnAFDk/видео.html Back to the wget problem, it should be noted that using ! In front of wget will work inside a Jupyter notebook, if you are typing into a command line terminal then you don’t need the ! symbol and can just go ahead and use wget (works in Ubuntu Linux)
Hi, thanks. Actually, this series takes a look at predicting interactions of compounds to drug targets. Please make sure to finish the Bioinformatics playlist (ruclips.net/video/plVLRashaA8/видео.html), there's more of these type of videos.
Hello, I have a doubt which is not regarding this video. If it's okay to comment here-- Can you explain how to draw a learning curve for multi-class classification problems (one vs all method). Is it possible? Thank you for the great content!
Hi Siddhi, your answer is in this webpage, scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#plot-roc-curves-for-the-multilabel-problem
Good day Prof, I just want to say a big thank you to you for the knowledge you dishing out for free. I am very grateful and you've been a source of motivation for me. Just a quick one, how do I run install and run PaDEL descriptors in Pandas? I have been using pandas for this tutorial thus far till I got suck with the PaDEL descriptors. Kindly help assists with this. Thank you so much, Prof.
Thanks Jacob for the kind words and encouragement. On running PADEL, it is a java JAR file and I've put the specific commands to get it working inside Jupyter notebook, you can simply run the code cells from within the Jupyter notebook from the Data Professor GitHub (make sure to run in sequential order, it is also of note that the PADEL calculation will take a long time for calculation if the number of compounds are large), github.com/dataprofessor/code/blob/master/python/CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
@@NikhilSharma-ey3ml I might be creating some courses on the popular learning platform in the future but currently I don't have any online courses. So far, us 4 are creating online tutorial videos on our respective RUclips channels.
@@muditarora9860 Sure here they are - rdkit www.rdkit.org/ - nglview github.com/nglviewer/nglview I can make a video about these libraries if interested.
@@DataProfessor yes sir it would be beneficial I saw ur Part 1,2 and if we could plot these cannonical smiles, it can give more meaningful insights from the data.
Excellent videos by the Data Professor. Feel free to read the following blog paper on Medium website “Apply Machine Learning Algorithms for Genomics Data Classification”. This will help you to understand how to apply Machine Learning algorithms for genomic data classification. This blog paper contains the latest ML/AI technologies applied to human genomic data classification today.
No words how can I praise your work..!! Please don't ever stop making these videos.. Such a great work you are doing..
Thank you for the kind motivational words, greatly appreciated it 😃
Super human with the amalgamation of bio and data science skills in a parallel universe! Thank you Data Professor. We are here to see more and more content on cheminformatics series.
Awesome, glad you’re enjoying the series
You're an absolute star. Wish i can be as good as lecturer as you. Looking forward to Part 4 of this series. Absolutely amazing !
Thank you for the kind words. I’m working on the Part 4, please stay tuned 😀
you are lifesaver. I think you have helped to train more data scientists than many universities by yourself.
Wow, appreciate the kind words!
I haven't been this excited for a series as Neural Networks From Scratch by Sentdex
Thanks Thomas
Good stuff! Exceptionally good stuff! Followed to the end.
This courses spetially bioinformatic cursese are fantastic. Your work is exelent bro
Glad to hear!
I am halfway through. I am very excited to be working on this project.
Wonderful!
Great video!! From my part, let's move to the implementation. Thank you Data Prof!!
You are welcome! Glad it is helpful!
Right now Data Professor is equivalent to God for me. We need such more professors who help to understand bioinformatics with data science better.
Thanks Shweta for the kind praise and encouragement. I am glad you are finding value in the contents of this channel. I am thinking of publishing a paper about this experience as an academic RUclipsr 😃
@@DataProfessor By the way I always wanted to ask. Are you related to Joma Tech? You both look alike. I get confused sometimes on the thumbnails :D
Thank you so much professor, grateful to be learn from you
sir, What does the pubchem fingerprints describes? that means what 0 and 1 represents in CSV file?
what is significace of using acetylcholinesterase protien ? little confuse is this targate proteine present in covid-19?
if we are using external software to calculate descriptors, how we can be selective about descriptors required for correlating with inhibitory concentration? because descriptor calculators came with thousands of descriptor option and I heard that its not good for the model to use every descriptor.
Hi, that's a great question! There's an important aspect of the model building process known as feature selection (also known as variable selection). It is a process of selecting an important subset of features from an initially large collection of features. The reason why I like random forest is that it has a built-in feature selection where features with high Gini values are deemed to be important.
Hello Sir
First of all, thank you so much for such great videos !!
I build my research on this. I have a doubt regarding the input for this part, to get the csv file for a different target, what would have to be done? do i have to make the file manually ? I checked in ChemBL and there was only 4 drugs for my target. What do you think can be done such a situation ?
Thanks in advance
I am unable to execute any bash commands in my jupyter notebook running on a windows machine. I have followed your tutorial listed in this section without success.
It appears that I will need access to bash for executing CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb. I have tried restarting my kernal but there appears
to be only one selection. Any advice would be helpful. Thanks
clap-clap for me, I listened till the end!
Hi, How ! bash padel.sh is getting applied on molecule.smi, if there are more .smi files in the current working directing?
Good question, make sure that there is only 1 .semi file in the current working directory. Thanks, I forgot to mention this important point. 😊
@@DataProfessor thank you so much for your kind response.
Few more doubts:
1. Can't we use logP, molwt, Rotatablebond, achromatic proportion along with finger print, and does it make sense combining those four features with finger print features??
2. Extraction of fingerprint features is very very slow, how can we speed up it, does one need to go for parallel computing like spark ?? Or there is any way in pandas to speed process?
Hello Sir, you are doing a great job. Your videos are the best source information related to data science.
Thanks so much Varun for the kind words, I am flattered 😊
I am a computational chemistry researcher. great videos. Thank you. how can we generate other finger print descriptors using padel
Hello, when I tried running !bash padel.sh, It said line 1: java no command found. How to fix this?
Hi, it means that you don’t have Java installed, please install it and try again.
Hey Chanin,
Can you tell me why did we calculated the Lipinski descriptors in the previous video if we are just gonna use the PaDEL descriptor for ML? Are we going to combine both descriptors for model and what kind of regression model we are gonna use to get the the best top 10 molecules from this
Hi Neel, that's a great question. Each descriptor types are used for different purposes although both are used for describing the molecular properties of a molecule. Firstly, the Lipinski descriptors are describing the general properties of a molecule at the macro level since it accounts for the molecule's entire molecular weight, the total count of hydrogen bond donors/acceptors and the molecule's solubility (LogP). Secondly, the PubChem descriptors from PaDEL are accounting for the local properties of a molecule at the micro level since it breaks the molecule down into its atomic constituents since each of the 881 descriptors are binary fingerprints of whether the molecule have or not have a particular functional group (a small collection of bonded atoms, which if compared to Lego building blocks, 1 Lego would correspond to 1 atom, 50 Lego would correspond to the molecule whereas N (e.g. 3, 4, 5, or 6 or 7, etc.) would correspond to a functional group (the order and connectivity of these Lego blocks matter and influences the molecular property, thus allow researchers to design new drugs. This is the primary expertise of medicinal chemists). Hope this helps 😃
You restore my hopes for becoming a researcher ;)))
Hi, glad to hear that!
Sir,
Does PubChem finger prints can be directly used as input to discriminator of a GAN for drug discovery model by using GAN
or it is necessary to convert canonical smiles to one hot encoded format.
Great I cam across your channel in recent days
Thank you and welcome to the channel 😃
really wonderful videos, I hope you may make a video for ligand-protein interactions representation
Dear Data Professor, Looking forward to updating the fourth part.thanks
Thanks for the support, it is in the making 😃
I have a question if I wanted to look at a different protein? How would I go about doing that I've completed part 1 and 2 looking at a protein associated with Alzheimer's and would like to add it to do part 3 and 4.
Hi, you’ll have to redo the steps but change the protein name to a different one
Wait, so why was the SARS coronavirus 3C-like proteinase target abandoned in this? Did you try to see if you can build a model and did not work?
Hi the dataset was quite small and owing to the scarce data so the shift was to a larger data
Waiting to see the model building episode!
It’s live and available here ruclips.net/video/wGaGm0sj04M/видео.html
So am confused as to why Part 1 creates the "bioactivity file associated with coronavirus" and then Part 2 uses this file. Part 3 then references a file that was never created. Looked in the gitub and saw another codeset showing how to get the 2nd file. Wish this was explained better at the beginning of this video.
I am looking for protein sequencing feature extraction descriptor in python. Can you please suggest some one
Hi, there are a couple, some are Python libraries such as PyBioMed, iFeature, pydpi while some are online webservers such as PROFEAT.
Thank you for this amazing project.
Thank you for the kind words 😃
Great! thanks for the video!
My pleasure!
Good job prof. Please why do you use only the fingerprint descriptors instead of calculating all the descriptors that padel can calculate.
Also will the fingerprint descriptors describe the compounds better
in the descriptors_output.csv file how did you get the labels as PubChemFP0,PubChemFP1,PubChemFP2...
When i converted the molecules.smi file i'm getting the labels as nHBAcc nHBAcc2 nHBAcc3 nHBAcc_Lipinski nHBDon....
I got em 😅😁. just had to choose fingerprints checkbox in the paDel interface in general section
Yes they’re different descriptors :)
Hi Prof. Could you please the papers published from this research activity. I will like to read interpretation of some of the result. Thank you sir
Thanks a lot for these videos. Recently I´ve ever tried to move to cheminformatics without success. I know we have to learn python, R, and other programming tools, nevertheless I'm not sure if it is mandatory, I guess so, anyway I´m starting with python That´s why I was able to understand your Drug Discovery ('coronavirus') project. Finally, a very clear tutorial to begin in it. I´m an associated professor in organic chemistry and really I will be following you, as ever as you upload videos. Thanks a lot again. Right now, I´m trying to write a project for my new Ph.D. but now in Cheminformatics. I hope I will be able to be in contact with you, and that you be in a disposition to clear several questions that I´m sure I will have. Here is the first doubt I have. Me, as an Organic Chemistry Specialist, Since your experiences which subject could I develop in cheminformatics?
Thanks for sharing your journey. Glad you are finding tutorials helpful. With your expertise in organic chemistry, an area that you could look into could be predicting the synthetic yield or the synthesis outcome on the basis of the reactants. Another area could be quantitative structure-activity relationship that would allow the correlation of the chemical structures with their physicochemical properties or biological activity. In one paper earlier in my career I predicted the bond dissociation energy of antioxidants in order to understand the structure-activity relationship that could help in the discovery of novel antioxidants. My other work is into predicting the enzyme inhibition potential of compounds (predicting the pIC50 values).
@@DataProfessor Thank you very much for your suggestions, I gonna take into account. I have been teaching chemistry in a post-degree of alimentary biotechnology. And in reality, I am thinking more about foodinformatics and chemistry. If you have another suggestion about foods and chemistry, I will appreciate it. Thank you very much in advance.
I can't run !bash (and !wget too) command in my jupyter notebook. I don't find any solution that work for me in stackoverflow either. Anyone ever have this problem? In the meantime I just download descriptor_output from github.
What's the purpose of the ro5? Bcs in this experimentation u did use just the similes notations for the construction of the features' table ?
Thx for the content 😍
When can we see the next part on model building?
Working on it, hoping to push it out real soon.
Professor. I am having trouble using ! cat in jupyter notebooks as well, like one of the other posters. I'm not familiar with ! cat and ! bash commands or what prerequisites are required to use these commands in jupyter on windows. Are there other commands that can be used in place of this?
Thanks for the awesome videos!
Hi Robert. I made a video dedicated to Bash command lines please check it out at ruclips.net/video/SZj55PihnBs/видео.html
After that watching that video, I think it will allow you to understand how to handle files and navigate through directories in the command line using Bash. The exclamation mark in Jupyter is a way of telling Jupyter that what follows are Bash commands otherwise it will assume that they are Python codes.
! cat
needs to be
! cat file.csv
which will display the contents of a file called file.csv
Great video Professor.
I have one doubt, can use descriptors, fingerprints and pIC50 values and pose as a classification problem by predicting the bioactivity labels?
pIC50 values can be used as Y in regression models, while the binned version (class labels of active and inactive) could be used for classification models.
@@DataProfessor 🙌🏼
Thanks! Amazing serie :)
Thanks Juan for the kind words 😃
Awesome tutorial series!
Thank you @Import Data 😃
Hi, @Data Professor Can you post a blog/video on 3D descriptor calculations using rdkit? Many thanks.
Thanks for the suggestion, I'll keep this in mind for the future videos.
@@DataProfessor Thank you, Sir!
Thanks Professor. Finally I found this video. Your video is very good and easy to understand.
I have a plan to do a virtual screening to look for lead compounds. I want to ask something, can we save the data in sdf (structure-data file)?
Yes you can do that using
>>> w = Chem.SDWriter('data/foo.sdf')
>>> for m in mols: w.write(m)
Source: www.rdkit.org/docs/GettingStartedInPython.html
@@DataProfessor Thank you for the reply. Can't wait for the fourth part
When I use the -wget command I get the error "Unable to establish SSL connection." Can anyone help me with this?
Hi, can you change -wget to ! wget in a code cell.
@@DataProfessor I'm still getting the same error, is there any alternative way to get the padel descriptor without using !wget?
Hi panel descriptors can be calculated using the PaDEL-Descriptor software available at www.yapcwsoft.com/dd/padeldescriptor/PaDEL-Descriptor.zip
The video shows how to compute in the Jupyter notebook while the above ZIP file can also be used to manually calculate using point and click, which I show how to do manually in the following video ruclips.net/video/yf3N0nnAFDk/видео.html
Back to the wget problem, it should be noted that using ! In front of wget will work inside a Jupyter notebook, if you are typing into a command line terminal then you don’t need the ! symbol and can just go ahead and use wget (works in Ubuntu Linux)
Most interesting explaination
welldone man
Thank you for watching!
great work, i have a question about Drug Target Interactions. Can you make a series on drug target interactions
Hi, thanks. Actually, this series takes a look at predicting interactions of compounds to drug targets. Please make sure to finish the Bioinformatics playlist (ruclips.net/video/plVLRashaA8/видео.html), there's more of these type of videos.
@@DataProfessor ok sir thank you
Hello, I have a doubt which is not regarding this video. If it's okay to comment here-- Can you explain how to draw a learning curve for multi-class classification problems (one vs all method). Is it possible? Thank you for the great content!
Hi Siddhi, your answer is in this webpage, scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#plot-roc-curves-for-the-multilabel-problem
Looking forward to your update, thank you.
The next part of this series is in the making, will update soon on its release. Please stay tuned and hit on notification bell 😃
Thanks Data Professor.it's the best news for me,thanks.
@@pcliang2693 😃
Good day Prof, I just want to say a big thank you to you for the knowledge you dishing out for free. I am very grateful and you've been a source of motivation for me.
Just a quick one, how do I run install and run PaDEL descriptors in Pandas? I have been using pandas for this tutorial thus far till I got suck with the PaDEL descriptors. Kindly help assists with this.
Thank you so much, Prof.
Thanks Jacob for the kind words and encouragement. On running PADEL, it is a java JAR file and I've put the specific commands to get it working inside Jupyter notebook, you can simply run the code cells from within the Jupyter notebook from the Data Professor GitHub (make sure to run in sequential order, it is also of note that the PADEL calculation will take a long time for calculation if the number of compounds are large), github.com/dataprofessor/code/blob/master/python/CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
Sir kindly tell ... are u associated to INeuron team ?? with Krish sir, code basics and ken jee.. if u are a part of that then i am interested
I am not associated with iNeuron, yes I am friends with Ken, Krish and Dhaval, we have done 2 prior podcast video with all 3.
@@DataProfessor thanks sir., one more question.. don't u have any courses regarding data science or ML .. (including those 3 tutors also)
@@NikhilSharma-ey3ml I might be creating some courses on the popular learning platform in the future but currently I don't have any online courses. So far, us 4 are creating online tutorial videos on our respective RUclips channels.
You are awesome!
Thanks!
any way we can plot cannonical smiles to get more meaningful data
We can display it as a 2D or 3D structure via rdkit or nglview
@@DataProfessor any github or youtube link????
@@muditarora9860 Sure here they are
- rdkit www.rdkit.org/
- nglview github.com/nglviewer/nglview
I can make a video about these libraries if interested.
@@DataProfessor yes sir it would be beneficial I saw ur Part 1,2 and if we could plot these cannonical smiles, it can give more meaningful insights from the data.
Thanks professor, it's very cool,and could I use your way to do my research?it's my pleasure to konw you,thanks.
Of course, what are you studying and researching about?
Dear professor, I'm a master in shanghai univercity of TCM, China. My research direction is drug screening and Drug-loaded implant in orthopedics。
Thanks for sharing pc liang
Excellent videos by the Data Professor. Feel free to read the following blog paper on Medium website “Apply Machine Learning
Algorithms for Genomics Data Classification”. This will help you to understand how to apply Machine Learning algorithms for
genomic data classification. This blog paper contains the latest ML/AI technologies applied to human genomic data classification today.