Bioinformatics Project from Scratch - Drug Discovery Part 3 (Dataset Preparation)

Data Professor

Просмотров 21 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 29 янв 2025

Комментарии • 120

@muhammadjamalahmed8664 4 года назад ⁺¹⁹
No words how can I praise your work..!! Please don't ever stop making these videos.. Such a great work you are doing..
@DataProfessor 4 года назад ⁺³
Thank you for the kind motivational words, greatly appreciated it 😃
@traveldiaries347 4 года назад ⁺²
Super human with the amalgamation of bio and data science skills in a parallel universe! Thank you Data Professor. We are here to see more and more content on cheminformatics series.
@DataProfessor 4 года назад ⁺¹
Awesome, glad you’re enjoying the series
@xavierchee 4 года назад ⁺⁵
You're an absolute star. Wish i can be as good as lecturer as you. Looking forward to Part 4 of this series. Absolutely amazing !
@DataProfessor 4 года назад
Thank you for the kind words. I’m working on the Part 4, please stay tuned 😀
@eyupbilgi3191 3 года назад ⁺¹
you are lifesaver. I think you have helped to train more data scientists than many universities by yourself.
@DataProfessor 3 года назад ⁺¹
Wow, appreciate the kind words!
@oludhe7 4 года назад ⁺¹
I haven't been this excited for a series as Neural Networks From Scratch by Sentdex
@DataProfessor 4 года назад ⁺¹
Thanks Thomas
@fjg9657 2 года назад ⁺¹
Good stuff! Exceptionally good stuff! Followed to the end.
@davudrabiei6371 3 года назад ⁺¹
This courses spetially bioinformatic cursese are fantastic. Your work is exelent bro
@DataProfessor 3 года назад
Glad to hear!
@hyrunnisa997 3 года назад ⁺¹
I am halfway through. I am very excited to be working on this project.
@DataProfessor 3 года назад
Wonderful!
@musicalrea5433 4 года назад ⁺²
Great video!! From my part, let's move to the implementation. Thank you Data Prof!!
@DataProfessor 4 года назад ⁺¹
You are welcome! Glad it is helpful!
@shwetaredkar734 4 года назад ⁺¹
Right now Data Professor is equivalent to God for me. We need such more professors who help to understand bioinformatics with data science better.
@DataProfessor 4 года назад ⁺¹
Thanks Shweta for the kind praise and encouragement. I am glad you are finding value in the contents of this channel. I am thinking of publishing a paper about this experience as an academic RUclipsr 😃
@shwetaredkar734 4 года назад
@@DataProfessor By the way I always wanted to ask. Are you related to Joma Tech? You both look alike. I get confused sometimes on the thumbnails :D
@somanathshyamsundar685 Год назад ⁺¹
Thank you so much professor, grateful to be learn from you
@surendrasukesheducationist8055 3 года назад
sir, What does the pubchem fingerprints describes? that means what 0 and 1 represents in CSV file?
@manu93ize Год назад
what is significace of using acetylcholinesterase protien ? little confuse is this targate proteine present in covid-19?
@misganamengistu5503 4 года назад ⁺¹
if we are using external software to calculate descriptors, how we can be selective about descriptors required for correlating with inhibitory concentration? because descriptor calculators came with thousands of descriptor option and I heard that its not good for the model to use every descriptor.
@DataProfessor 4 года назад
Hi, that's a great question! There's an important aspect of the model building process known as feature selection (also known as variable selection). It is a process of selecting an important subset of features from an initially large collection of features. The reason why I like random forest is that it has a built-in feature selection where features with high Gini values are deemed to be important.
@sandrajose7452 3 года назад ⁺²
Hello Sir
First of all, thank you so much for such great videos !!
I build my research on this. I have a doubt regarding the input for this part, to get the csv file for a different target, what would have to be done? do i have to make the file manually ? I checked in ChemBL and there was only 4 drugs for my target. What do you think can be done such a situation ?
Thanks in advance
@davidcovell-rn9sw Месяц назад
I am unable to execute any bash commands in my jupyter notebook running on a windows machine. I have followed your tutorial listed in this section without success.
It appears that I will need access to bash for executing CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb. I have tried restarting my kernal but there appears
to be only one selection. Any advice would be helpful. Thanks
@jw4659 2 года назад ⁺¹
clap-clap for me, I listened till the end!
@VLM234 4 года назад ⁺¹
Hi, How ! bash padel.sh is getting applied on molecule.smi, if there are more .smi files in the current working directing?
@DataProfessor 4 года назад ⁺¹
Good question, make sure that there is only 1 .semi file in the current working directory. Thanks, I forgot to mention this important point. 😊
@VLM234 4 года назад ⁺¹
@@DataProfessor thank you so much for your kind response.
Few more doubts:
1. Can't we use logP, molwt, Rotatablebond, achromatic proportion along with finger print, and does it make sense combining those four features with finger print features??
2. Extraction of fingerprint features is very very slow, how can we speed up it, does one need to go for parallel computing like spark ?? Or there is any way in pandas to speed process?
@varunkaushik5051 4 года назад ⁺¹
Hello Sir, you are doing a great job. Your videos are the best source information related to data science.
@DataProfessor 4 года назад
Thanks so much Varun for the kind words, I am flattered 😊
@1234567890000918 Год назад ⁺¹
I am a computational chemistry researcher. great videos. Thank you. how can we generate other finger print descriptors using padel
@nguyenkaitlyn6364 3 года назад ⁺¹
Hello, when I tried running !bash padel.sh, It said line 1: java no command found. How to fix this?
@DataProfessor 3 года назад
Hi, it means that you don’t have Java installed, please install it and try again.
@neelshah6970 4 года назад ⁺¹
Hey Chanin,
Can you tell me why did we calculated the Lipinski descriptors in the previous video if we are just gonna use the PaDEL descriptor for ML? Are we going to combine both descriptors for model and what kind of regression model we are gonna use to get the the best top 10 molecules from this
@DataProfessor 4 года назад ⁺³
Hi Neel, that's a great question. Each descriptor types are used for different purposes although both are used for describing the molecular properties of a molecule. Firstly, the Lipinski descriptors are describing the general properties of a molecule at the macro level since it accounts for the molecule's entire molecular weight, the total count of hydrogen bond donors/acceptors and the molecule's solubility (LogP). Secondly, the PubChem descriptors from PaDEL are accounting for the local properties of a molecule at the micro level since it breaks the molecule down into its atomic constituents since each of the 881 descriptors are binary fingerprints of whether the molecule have or not have a particular functional group (a small collection of bonded atoms, which if compared to Lego building blocks, 1 Lego would correspond to 1 atom, 50 Lego would correspond to the molecule whereas N (e.g. 3, 4, 5, or 6 or 7, etc.) would correspond to a functional group (the order and connectivity of these Lego blocks matter and influences the molecular property, thus allow researchers to design new drugs. This is the primary expertise of medicinal chemists). Hope this helps 😃
@fernandomartinez9976 3 года назад ⁺²
You restore my hopes for becoming a researcher ;)))
@DataProfessor 3 года назад
Hi, glad to hear that!
@hritikkumar3403 4 года назад
Sir,
Does PubChem finger prints can be directly used as input to discriminator of a GAN for drug discovery model by using GAN
or it is necessary to convert canonical smiles to one hot encoded format.
@manojsailankepalle2399 4 года назад ⁺²
Great I cam across your channel in recent days
@DataProfessor 4 года назад
Thank you and welcome to the channel 😃
@amaransi4900 4 года назад ⁺¹
really wonderful videos, I hope you may make a video for ligand-protein interactions representation
@pcliang2693 4 года назад ⁺¹
Dear Data Professor, Looking forward to updating the fourth part.thanks
@DataProfessor 4 года назад
Thanks for the support, it is in the making 😃
@seandollinger8021 Год назад
I have a question if I wanted to look at a different protein? How would I go about doing that I've completed part 1 and 2 looking at a protein associated with Alzheimer's and would like to add it to do part 3 and 4.
@DataProfessor Год назад
Hi, you’ll have to redo the steps but change the protein name to a different one
@matasvitkauskas7541 3 года назад ⁺¹
Wait, so why was the SARS coronavirus 3C-like proteinase target abandoned in this? Did you try to see if you can build a model and did not work?
@DataProfessor 3 года назад ⁺¹
Hi the dataset was quite small and owing to the scarce data so the shift was to a larger data
@minicorefacility 4 года назад ⁺¹
Waiting to see the model building episode!
@DataProfessor 4 года назад
It’s live and available here ruclips.net/video/wGaGm0sj04M/видео.html
@danielfaber4020 2 года назад ⁺¹
So am confused as to why Part 1 creates the "bioactivity file associated with coronavirus" and then Part 2 uses this file. Part 3 then references a file that was never created. Looked in the gitub and saw another codeset showing how to get the 2nd file. Wish this was explained better at the beginning of this video.
@arjunpuri1104 4 года назад ⁺¹
I am looking for protein sequencing feature extraction descriptor in python. Can you please suggest some one
@DataProfessor 4 года назад
Hi, there are a couple, some are Python libraries such as PyBioMed, iFeature, pydpi while some are online webservers such as PROFEAT.
@otmanjai9724 4 года назад ⁺³
Thank you for this amazing project.
@DataProfessor 4 года назад ⁺¹
Thank you for the kind words 😃
@chyokomizo 4 года назад ⁺³
Great! thanks for the video!
@DataProfessor 4 года назад ⁺¹
My pleasure!
@beatricenkiruka 2 года назад
Good job prof. Please why do you use only the fingerprint descriptors instead of calculating all the descriptors that padel can calculate.
Also will the fingerprint descriptors describe the compounds better
@tinaparker9974 Год назад
in the descriptors_output.csv file how did you get the labels as PubChemFP0,PubChemFP1,PubChemFP2...
When i converted the molecules.smi file i'm getting the labels as nHBAcc nHBAcc2 nHBAcc3 nHBAcc_Lipinski nHBDon....
@tinaparker9974 Год назад
I got em 😅😁. just had to choose fingerprints checkbox in the paDel interface in general section
@DataProfessor Год назад
Yes they’re different descriptors :)
@EChannelVentures 2 года назад
Hi Prof. Could you please the papers published from this research activity. I will like to read interpretation of some of the result. Thank you sir
@LGLVELL 4 года назад ⁺¹
Thanks a lot for these videos. Recently I´ve ever tried to move to cheminformatics without success. I know we have to learn python, R, and other programming tools, nevertheless I'm not sure if it is mandatory, I guess so, anyway I´m starting with python That´s why I was able to understand your Drug Discovery ('coronavirus') project. Finally, a very clear tutorial to begin in it. I´m an associated professor in organic chemistry and really I will be following you, as ever as you upload videos. Thanks a lot again. Right now, I´m trying to write a project for my new Ph.D. but now in Cheminformatics. I hope I will be able to be in contact with you, and that you be in a disposition to clear several questions that I´m sure I will have. Here is the first doubt I have. Me, as an Organic Chemistry Specialist, Since your experiences which subject could I develop in cheminformatics?
@DataProfessor 4 года назад
Thanks for sharing your journey. Glad you are finding tutorials helpful. With your expertise in organic chemistry, an area that you could look into could be predicting the synthetic yield or the synthesis outcome on the basis of the reactants. Another area could be quantitative structure-activity relationship that would allow the correlation of the chemical structures with their physicochemical properties or biological activity. In one paper earlier in my career I predicted the bond dissociation energy of antioxidants in order to understand the structure-activity relationship that could help in the discovery of novel antioxidants. My other work is into predicting the enzyme inhibition potential of compounds (predicting the pIC50 values).
@LGLVELL 4 года назад
@@DataProfessor Thank you very much for your suggestions, I gonna take into account. I have been teaching chemistry in a post-degree of alimentary biotechnology. And in reality, I am thinking more about foodinformatics and chemistry. If you have another suggestion about foods and chemistry, I will appreciate it. Thank you very much in advance.
@alfonlinata6797 4 года назад ⁺¹
I can't run !bash (and !wget too) command in my jupyter notebook. I don't find any solution that work for me in stackoverflow either. Anyone ever have this problem? In the meantime I just download descriptor_output from github.
@souldiezcamp2380 3 года назад
What's the purpose of the ro5? Bcs in this experimentation u did use just the similes notations for the construction of the features' table ?
Thx for the content 😍
@TechwithmeTamil 4 года назад ⁺²
When can we see the next part on model building?
@DataProfessor 4 года назад ⁺²
Working on it, hoping to push it out real soon.
@roberthodges508 4 года назад
Professor. I am having trouble using ! cat in jupyter notebooks as well, like one of the other posters. I'm not familiar with ! cat and ! bash commands or what prerequisites are required to use these commands in jupyter on windows. Are there other commands that can be used in place of this?
Thanks for the awesome videos!
@DataProfessor 4 года назад
Hi Robert. I made a video dedicated to Bash command lines please check it out at ruclips.net/video/SZj55PihnBs/видео.html
After that watching that video, I think it will allow you to understand how to handle files and navigate through directories in the command line using Bash. The exclamation mark in Jupyter is a way of telling Jupyter that what follows are Bash commands otherwise it will assume that they are Python codes.
! cat
needs to be
! cat file.csv
which will display the contents of a file called file.csv
@aks2045 4 года назад ⁺¹
Great video Professor.
I have one doubt, can use descriptors, fingerprints and pIC50 values and pose as a classification problem by predicting the bioactivity labels?
@DataProfessor 4 года назад ⁺¹
pIC50 values can be used as Y in regression models, while the binned version (class labels of active and inactive) could be used for classification models.
@aks2045 4 года назад
@@DataProfessor 🙌🏼
@optimizacioneningenieria3385 4 года назад ⁺²
Thanks! Amazing serie :)
@DataProfessor 4 года назад
Thanks Juan for the kind words 😃
@ImportData1 4 года назад ⁺¹
Awesome tutorial series!
@DataProfessor 4 года назад ⁺¹
Thank you @Import Data 😃
@afreens.k.8088 4 года назад ⁺¹
Hi, @Data Professor Can you post a blog/video on 3D descriptor calculations using rdkit? Many thanks.
@DataProfessor 4 года назад ⁺¹
Thanks for the suggestion, I'll keep this in mind for the future videos.
@afreens.k.8088 4 года назад
@@DataProfessor Thank you, Sir!
@vanpacoel663 4 года назад ⁺²
Thanks Professor. Finally I found this video. Your video is very good and easy to understand.
I have a plan to do a virtual screening to look for lead compounds. I want to ask something, can we save the data in sdf (structure-data file)?
@DataProfessor 4 года назад
Yes you can do that using
>>> w = Chem.SDWriter('data/foo.sdf')
>>> for m in mols: w.write(m)
Source: www.rdkit.org/docs/GettingStartedInPython.html
@vanpacoel663 4 года назад
@@DataProfessor Thank you for the reply. Can't wait for the fourth part
@djk5867 4 года назад ⁺¹
When I use the -wget command I get the error "Unable to establish SSL connection." Can anyone help me with this?
@DataProfessor 4 года назад ⁺¹
Hi, can you change -wget to ! wget in a code cell.
@djk5867 4 года назад ⁺¹
@@DataProfessor I'm still getting the same error, is there any alternative way to get the padel descriptor without using !wget?
@DataProfessor 4 года назад ⁺¹
Hi panel descriptors can be calculated using the PaDEL-Descriptor software available at www.yapcwsoft.com/dd/padeldescriptor/PaDEL-Descriptor.zip
The video shows how to compute in the Jupyter notebook while the above ZIP file can also be used to manually calculate using point and click, which I show how to do manually in the following video ruclips.net/video/yf3N0nnAFDk/видео.html
Back to the wget problem, it should be noted that using ! In front of wget will work inside a Jupyter notebook, if you are typing into a command line terminal then you don’t need the ! symbol and can just go ahead and use wget (works in Ubuntu Linux)
@neelcoc6749 4 года назад ⁺¹
Most interesting explaination
@bikonkumardas 4 года назад ⁺¹
welldone man
@DataProfessor 4 года назад
Thank you for watching!
@BilalAhmad-pi6lp 4 года назад ⁺¹
great work, i have a question about Drug Target Interactions. Can you make a series on drug target interactions
@DataProfessor 4 года назад
Hi, thanks. Actually, this series takes a look at predicting interactions of compounds to drug targets. Please make sure to finish the Bioinformatics playlist (ruclips.net/video/plVLRashaA8/видео.html), there's more of these type of videos.
@BilalAhmad-pi6lp 4 года назад ⁺¹
@@DataProfessor ok sir thank you
@siddhipandare3882 4 года назад ⁺¹
Hello, I have a doubt which is not regarding this video. If it's okay to comment here-- Can you explain how to draw a learning curve for multi-class classification problems (one vs all method). Is it possible? Thank you for the great content!
@DataProfessor 4 года назад
Hi Siddhi, your answer is in this webpage, scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#plot-roc-curves-for-the-multilabel-problem
@pcliang2693 4 года назад
Looking forward to your update, thank you.
@DataProfessor 4 года назад
The next part of this series is in the making, will update soon on its release. Please stay tuned and hit on notification bell 😃
@pcliang2693 4 года назад ⁺¹
Thanks Data Professor.it's the best news for me,thanks.
@DataProfessor 4 года назад
@@pcliang2693 😃
@jacobukokobili6457 4 года назад ⁺¹
Good day Prof, I just want to say a big thank you to you for the knowledge you dishing out for free. I am very grateful and you've been a source of motivation for me.
Just a quick one, how do I run install and run PaDEL descriptors in Pandas? I have been using pandas for this tutorial thus far till I got suck with the PaDEL descriptors. Kindly help assists with this.
Thank you so much, Prof.
@DataProfessor 4 года назад
Thanks Jacob for the kind words and encouragement. On running PADEL, it is a java JAR file and I've put the specific commands to get it working inside Jupyter notebook, you can simply run the code cells from within the Jupyter notebook from the Data Professor GitHub (make sure to run in sequential order, it is also of note that the PADEL calculation will take a long time for calculation if the number of compounds are large), github.com/dataprofessor/code/blob/master/python/CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
@NikhilSharma-ey3ml 4 года назад ⁺¹
Sir kindly tell ... are u associated to INeuron team ?? with Krish sir, code basics and ken jee.. if u are a part of that then i am interested
@DataProfessor 4 года назад
I am not associated with iNeuron, yes I am friends with Ken, Krish and Dhaval, we have done 2 prior podcast video with all 3.
@NikhilSharma-ey3ml 4 года назад ⁺¹
@@DataProfessor thanks sir., one more question.. don't u have any courses regarding data science or ML .. (including those 3 tutors also)
@DataProfessor 4 года назад
@@NikhilSharma-ey3ml I might be creating some courses on the popular learning platform in the future but currently I don't have any online courses. So far, us 4 are creating online tutorial videos on our respective RUclips channels.
@gemechisdegaga1754 4 года назад ⁺¹
You are awesome!
@DataProfessor 4 года назад
Thanks!
@muditarora9860 3 года назад ⁺¹
any way we can plot cannonical smiles to get more meaningful data
@DataProfessor 3 года назад
We can display it as a 2D or 3D structure via rdkit or nglview
@muditarora9860 3 года назад
@@DataProfessor any github or youtube link????
@DataProfessor 3 года назад
@@muditarora9860 Sure here they are
- rdkit www.rdkit.org/
- nglview github.com/nglviewer/nglview
I can make a video about these libraries if interested.
@muditarora9860 3 года назад ⁺¹
@@DataProfessor yes sir it would be beneficial I saw ur Part 1,2 and if we could plot these cannonical smiles, it can give more meaningful insights from the data.
@pcliang2693 4 года назад ⁺²
Thanks professor, it's very cool,and could I use your way to do my research？it's my pleasure to konw you,thanks.
@DataProfessor 4 года назад
Of course, what are you studying and researching about?
@pcliang2693 4 года назад ⁺¹
Dear professor, I'm a master in shanghai univercity of TCM, China. My research direction is drug screening and Drug-loaded implant in orthopedics。
@DataProfessor 4 года назад
Thanks for sharing pc liang
@ernestbonat2440 3 года назад
Excellent videos by the Data Professor. Feel free to read the following blog paper on Medium website “Apply Machine Learning
Algorithms for Genomics Data Classification”. This will help you to understand how to apply Machine Learning algorithms for
genomic data classification. This blog paper contains the latest ML/AI technologies applied to human genomic data classification today.

Следующие

Автовоспроизведение

Bioinformatics Project from Scratch - Drug Discovery Part 4 (Model Building)