👉Watch this video next (How to learn data science in 2021) ruclips.net/video/oR670Txwh88/видео.html Support this Channel 👇👇👇 🌟 Buy me a coffee www.buymeacoffee.com/dataprofessor 🌟 Download Kite for FREE www.kite.com/get-kite/? 👉 Subscribe to this RUclips channel ruclips.net/user/dataprofessor 👉 Join the Newsletter of Data Professor newsletter.dataprofessor.org 👉 Blogs on Medium medium.dataprofessor.org/
professor you really saved my life. I am biotech and I was desperate about my M.Sc. dissertation since all labs are still closed due to covid. Bless you
for anyone having the error when importing rdkit, just install it manually in a new cell and run !pip install rdkit , the run again the original cell for getting Chem
@@DataProfessor Hi Professor. I really appreciate all that you do to teach us in data science and Bioinformatics. I just wish I learned of your channel a year ago. I was going through the code posted for Part 2 and I noticed that when I tried to run the code. The part under "Removing the "intermediate" bioactivity class" returned intermediate values as NaN. I got 133 rows instead of 119 rows. These intermediates with NaN values were not removed as the code says. Then I changed my command to remove the intermediate bioactivity class as "df_2class = df_final[df_final.bioactivity_class.notna()] df_2class" by deleting the "df_2class = df_final[df_final.bioactivity_class != "intermediate"] df_2class". Using the former helped me remove all the NaN values which I assumed to be intermediates. This is also when I saw only 119 rows instead of 133 rows. I also noticed that my scatterplot has fewer active dots and the pIC50 values for active and inactive values are different from your box plot. My p value are also significantly different from yours 0.061356. I have no idea what it is I did wrong. My github is cnewton1428/Bioinformatics-python "Copy_of_CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb". Any help would be very appreciated!
@@jesussaves5556 In concised version, else statement of the classification #ed so if you delete # and turn it into a code statement and run the code you will be able to get the right output.
Thank you for a wonderful walk-through and great tips on how to use RDKit! Looking forward to similar educational videos in the realm of drug discovery!
Thank you, Professor, for making this excellent video to start to learn Drug Discovery. I had a few questions regarding the Converting IC50 to pIC50 part. Could you direct me in a direction to study further? 1. You cap the value of IC50 before converting to pIC50. Will that affect the analyzing result? Or, just when the value is great enough, can we treat them as the same thing? 2. Why would we want to avoid negative values? How does It affect the analysis?
Good evening Data Professor, I am currently a student in Machine learning in Cameroon and as part of a project I would like to set up a model capable of predicting the behavior of proteins expressed by cancer cells when they are subjected to certain drugs. But I am a bit lost on the approach to adopt, I would like to please have some advice (What is the dataset to use?, the most appropriate models?, how to manage negative examples? etc)
Absolutely love the video! Apologies for posting on such an old video, but I have one question, I'm getting an error when compiling df_norm around 13:15, saying that it's giving too many arguments for the function but I have not added any arguments to the code. This is the exact error: "TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given" but I'm unsure where the extra argument could have come from
Dear professor I have one query regarding the statistical significance exhibited while checking the Logp values(mannwhitney test). if there is no difference in active and inactive compounds , what does that mean?
We are testing whether there is statistically significant difference between the actives and inactives for a descriptor of interest, if there is a difference then it means that this particular descriptor is crucial for a compound being active or inactive. In a nutshell we are figuring out which descriptors are important for a compound being active (and inactive)
From my experience, I remember if you don't apply log transformation then the scatter plot looks very skewed which makes us hard to see the pattern! So we apply log transformation so that the scatter plot is interpretable!
Great video, thanks for publishing this quality content! A question - I was curious as to whether this project would be considered chemoinformatics more than bioinformatics due to the focus on molecules and chemical descriptors?
Thank you for this amazing tutorial. I am getting an error at removing the intermediate bioactivity class as: AttributeError: 'DataFrame' object has no attribute 'bioactivity_class' Could you please tell me how to solve this error?
Thanks very much for your efforts. Would you consider doing omics analysis tutorials with heavy focus on graphical visualization of omics data. As you know, visualization is very key in omics bioinformatics. Thanks.
Great suggestion! There's a lot of great visualizations that one can do for analyzing such datasets. I'll definitely to some data viz video for some of the bioinformatics datasets that is provided in this Bioinformatics tutorial series.
Hello sir, at 9:49, where you have used codeocean's code to calculate the descriptors, while working a dataset of EGFR, iam geting this error: ArgumentError: Python argument types in rdkit.Chem.rdMolDescriptors._CalcMolWt(NoneType) did not match C++ signature: _CalcMolWt(RDKit::ROMol mol, bool onlyHeavy=False) Can you please tell me a way, how to resolve it?
I'm just a newbie to bioinformatics. I watch your videos everyday before going to bed though I don't understand much of your coding. 😊😊😊. Hope that next year, I can finish my own bioinformatics project. Thank you so much for giving me this much motivation to follow my dream.
Awesome, thank you! I'll also make more introductory level contents. I have just get an introductory lecture on computational drug discovery and will be uploading to the channel soon.
Really thankful to you for sharing very helpful and informative content. Can you also make some tutorials on deep learning methods like GANs on bio sequential data?
One more great video on data science. I am beginner and learning a lot from your RUclips channel, it's really helping me a lot. Thank you so much. I have a kind request to you, could you please provide me some research papers on artificial intelligence in inventory and supply chain management with python code. Thank you again....
Thanks for watching and glad you're finding the contents on this channel helpful. I would recommend to check out paperswithcode.com/ and for papers with code try paperswithcode.com/search?q_meta=&q=supply+chain Hope this helps.
Such a great videos..... I am following this series sincerely.... I have a question, Is there any way that we can directly call the csv file from first colab file to the current colab file instead of downloading from the first file and then using in the other file.... ?? Thank you so much for giving so much knowledge to needy totally free of charge....
Thanks for watching! That's a great question, correct me if I'm wrong but I don't think it is possible to download file from the first colab file. It is however possible to save files generated from one Colab notebook directly into your Google Drive, from which you can access to them in your subsequent Colab notebooks.
@@DataProfessor I am not sure, but i think, convert the file from .ipynb to.py and then run the file in the current colab file will give you the saved file and then we can use.... I don't know whether it will work....
Hi, can you please tell me where can I learn all the basics to understand your videos better? I'm a newbie to this field and I'm looking forward to learn the same. Please suggest books or articles to understand your project even better.
That's a great question! To better answer your question, I am writing a Medium article on Bioinformatics 101 that is due to be release soon (my Medium profile is medium.com/@chanin.nantasenamat). In the meantime you might want to check out the 2-Part introductory video on Bioinformatics 101 that I made, there's also a 2-hour long lecture that provides the basics of Computational Drug Discovery. They are all at the Playlist bit.ly/dataprofessor-bioinformatics
I used a different data set. I searched compounds associated with estrogen receptors. and I picked human estrogen receptor alpha... I keep getting an error from one line in your lipinski function. mol = Chem.MolFromSmiles(elem) TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float I think there is an issue with the smiles data...but I am not sure how to fix it since there are about 3000 items...
@@DataProfessor Thank you. That definitely helped. I was able to eliminate 30 rows that were a problem. I am really sorry to keep asking but after that, I developed another issue. I think something is not in the right format or not the right type. this is the error code: ArgumentError: Python argument types in rdkit.Chem.rdMolDescriptors._CalcMolWt(NoneType) did not match C++ signature: _CalcMolWt(RDKit::ROMol mol, bool onlyHeavy=False)
@@DataProfessor I figured out the issue! So no worries. Thanks again for this awesome tutorial. I am going to submit my project for an interview. hopefully, I get the job.
Hi, thanks for the suggestion, I am afraid that would be outside my expertise, sorry. However, I may show some basic sequence analysis using the BioPython library in the future.
I have two questions about this process. First, if no "active' compounds exist for a given protein, can these statistical tests be performed between the intermediate and inactive compounds instead? Secondly, if we 'fail to reject H0' on all of the statistical Mann-Whitney tests, should we discard the data and choose a different protein instead of progressing to the next steps? Thanks.
Thank you Professor for this great lecture. I have a question about the interpretation of Lipinski's descriptor. After doing it by myself I found that all of the descriptor shows no significant difference. What does it means? Thank you in advance prof.
Hi, that's a great question! The Lipinski's descriptors usually don't contribute to highly predictive models of biological activities. For that there are several descriptor types such as the PubChem fingerprint (computed via PADEL Descriptor software). The Lipinski descriptors are used to evaluate the drug-likeness property of compounds. The 4 descriptors provides a rough idea as to whether compounds are "drug-like" or not.
Please help I am getting an error TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
Right, probably have to check that once we get 119 columns after removal of intermediate class, that we should also assign that to a variable for which we could use for subsequent steps.
can you please answer my question? in the step of generating the Box plot for the pIC50 there wasn't a threshold value between active and inactive classes so after that when I applied the Mann-Whitney u test it showed the "same distribution for all the Lipinski descriptors" did I miss something or its normal to show this result? Thank You ✨
I need one help your one code is not working, it shows that no any registered c++ convert smiles. Only this can't works. It's imp single line code. Pl reply me
Can you please explain why did you write the three line of codes before '! conda install -c rdkit rdkit -y'. Also I am using mac so what lines should i use since mine is not linus. My code is showing 'rdkit' module not found when i try to import it. Please help!
Hi those 3 lines are for installing conda in Linux (Ubuntu) for the Google Colab since the notebook resets every logout. If you already have conda installed, you can ignore those 3 lines. If you have installed rdkit on your mac, then you can also delete that line as well.
No worries, if all reported compounds are active that is great news for you, you have a set of potent compounds. In this case, I would recommend to build regression models.
while importing rdkit, I am getting error /usr/local/lib/python3.7/site-packages/rdkit/__init_.py import error undefined symbol: py__open, any idea where is the path of rdkit defined? I am using colab and already installed conda and rdkit as per instructions. I also installed miniconda3 in home directory (but not sure how to change path for program to look for rdkt in my home directory instead of python3.7/site-packages)
When I run the Lipinski descriptors function on the dataset (df_lipinski), i'm gettin this error: TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float Can someone help me solve this? I cant understand what the problem is
Thanks Mandar for the question. I would say that data science can be used in all branches of science, healthcare, engineering and businesses. The reason for saying this is because all of these generate data in one form or another (text, numbers, images, videos, audio, etc.). As patterns are inherently presented in these big data it follows that data science has great potential in making sense of these data. Hope this helps.
Hi Professor, I've tried changing my target protein but in part 2 I'm getting an error and got stuck there. It will be helpful if you kindly help me with the problem : df_lipinski = lipinski(df_clean_smiles.canonical_smiles) in this cell I'm getting error which says as follows: ArgumentError Traceback (most recent call last) in () ----> 1 df_lipinski = lipinski(df_clean_smiles.canonical_smiles) ArgumentError: Python argument types in rdkit.Chem.rdMolDescriptors._CalcMolWt(NoneType) did not match C++ signature: _CalcMolWt(RDKit::ROMol mol, bool onlyHeavy=False)
I had the same error. In my case, the problem was NAN in canonical_smiles column. I solved it removing rows with NA values in the cannonical_smiles column with something like df3 = df2[df2.canonical_smiles.notna()] df3. i hope it help you
Dear sir, How to change and to install the another source-types or libraries of database? Thank you. For example, $ ! pip install PubChemPy $ import pandas as pd $ from pubchempy import * Is this correct?
This is literally awesome and a very very useful tutorial. Could you please make some tutorials on various packages such as protr, PyBiomed, propy, iFeature? Thanks would be helpful.
Hi professor I'm trying to do a project on this topic. Everything thing was going well until I stumble on df_norm = norm_value(df_combined) df_norm This code does not run please help me fix this problem. Thank you
Please check the df_combined variable to see if the dataframe is as expected, also check to see the norm_value function to by going through line by line if it is generating expected data transformation
Hi, these codes can be used from within Python environment which can be used on any OS (Windows, Linux, Mac OSX). You can install Python via anaconda, mini-conda or directly from python.org. On top of Python, you can code in a Jupyter notebook (by first installing the jupyter library) or in an IDE (Spyder, PyCharm). Alternatively you can also work on the cloud using Google Colab (the Jupyter notebook on the cloud).
Amazing series on Drug discovery using ML. I have a question on installing rdkit in my virtual environment. Despite successful installation, import rdkit throws an error. Could you please, help me with it?
@@DataProfessor Yes, I did. I'm trying to install it on Ubuntu 18.04. I have set the python version of my virtual environment as 2.7.18 and then tried installing using conda.
Error in Calculate descriptors step: TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float How to solve?
Are you running from the Google Colab? Try running again in sequential order, if the error is still there, can you paste the entire error message here?
@@DataProfessor I tried running it again in sequential order but still getting traceback. TypeError Traceback (most recent call last) in () ----> 1 df_lipinski = lipinski(df.canonical_smiles) in lipinski(smiles, verbose) 5 moldata= [] 6 for elem in smiles: ----> 7 mol=Chem.MolFromSmiles(elem) 8 moldata.append(mol) 9 TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
@@maithilipokle4437 Hi, there’s 2 parts to this tutorial. Here’s what you need to do: 1. Run Part 1 and save the generated bioactivity data CSV file to your local computer 2. Run Part 2 and make sure to upload the CSV file from Part 2 to it. Make sure to do the upload before running the reading of the bioactivity data file. Hopefully this should work.
Hello! Thank you for this amazing tutorial! I just have a small problem that I would like some help with. Whenever I run the norm_value() function I keep getting the error below. Does anyone know how I can solve this? Cell In[53], line 1 ----> 1 df_norm = norm_value(df_combined) 2 df_norm Cell In[49], line 10, in norm_value(input) 7 norm.append(i) 9 input['standard_value_norm'] = norm ---> 10 x = input.drop('standard_value', 1) 12 return x TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given
Dear Data Professor, Thank you for your great lecture. I need your help. Once I run the following input: df_norm = norm_value(df_combined) df_norm The output Is the following: TypeError Traceback (most recent call last) in () ----> 1 df_norm = norm_value(df_combined) 2 df_norm in norm_value(input) 3 4 for i in input['standard_value']: ----> 5 if i > 100000000: 6 i = 100000000 7 norm.append(i) TypeError: '>' not supported between instances of 'str' and 'int' Thank you in advance for your help
Python is confused when the input which is a string data type is used with the > operator. What you need to do is to first convert the input to an integer then try again. Hint: int(i)
@@DataProfessor thank you very much!! I solved that problem typing the following input: df_combined.standard_value.describe() pd.to_numeric(df_combined.standard_value, errors='coerce').fillna(0, downcast='infer'). Now when I run: df_norm.standard_value_norm.describe() the output is: count 1.320000e+02 mean 2.126145e+07 std 4.101124e+07 min 5.000000e+01 25% 1.070000e+04 50% 2.415000e+04 75% 3.000000e+05 max 1.000000e+08 Name: standard_value_norm, dtype: float64 Is different from your lecture because I used another dataset but now it is working well!! Many thanks. May I suggest some new lectures in chemoinformatics, in particular in protein-ligand interaction or protein structure prediction. This topic is very hot!! Have a nice weekend and thank you again!!
@@gabrieleblanca4702 Glad it's working out for you. Thanks for the suggestion. Protein-ligand interaction by means of Molecular docking, I presume, will definitely try to package the topic to be of wide interest and relevant to data science.Will definitely consider this for future videos.
@@gabrieleblanca4702 Glad it's working out for you. Thanks for the suggestion. Protein-ligand interaction by means of Molecular docking, I presume, will definitely try to package the topic to be of wide interest and relevant to data science.Will definitely consider this for future videos.
@@DataProfessor absolutely yes! Molecular docking should be of great relevance! I can't wait to watch it!! Thank you again for your availability. Have a nice day💪
This is so great. I juz started learning data science and ML and these r the projects i am hoping to do. ALso, I reached out to u in instagram. Plz kindly check it out, I have some questions regarding it. Thank u so much.
Running 'CDD_ML_Part_2_Acetylcholinesterase_Exploratory_Data_Analysis.ipynb' on a windows machine and followinf 'df_norm = norm_value(df_combined) am getting --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[49], line 1 ----> 1 df_norm = norm_value(df_combined) 2 df_norm Cell In[47], line 10, in norm_value(input) 7 norm.append(i) 9 input['standard_value_norm'] = norm ---> 10 x = input.drop('standard_value', 1) 12 return x TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given This also occurs with the corona virus example in this same set. Any help/solution would be appreciated. Thanks
👉Watch this video next (How to learn data science in 2021) ruclips.net/video/oR670Txwh88/видео.html
Support this Channel 👇👇👇
🌟 Buy me a coffee www.buymeacoffee.com/dataprofessor
🌟 Download Kite for FREE www.kite.com/get-kite/?
👉 Subscribe to this RUclips channel ruclips.net/user/dataprofessor
👉 Join the Newsletter of Data Professor newsletter.dataprofessor.org
👉 Blogs on Medium medium.dataprofessor.org/
professor you really saved my life. I am biotech and I was desperate about my M.Sc. dissertation since all labs are still closed due to covid. Bless you
Glad the contents are helpful in your Masters thesis 😊
for anyone having the error when importing rdkit, just install it manually in a new cell and run !pip install rdkit , the run again the original cell for getting Chem
Thank you for this, you saved me
Me too you saved me
Thanks
Bro giving master thesis project in a golden plate, bless you prof
If it were possible to like this video a thousand times. I would. Thank you so much, Data Professor.
Finally part 2! Great stuff. Loved the EDA!
Thanks Ken for the encouragement!
@@DataProfessor Hi Professor. I really appreciate all that you do to teach us in data science and Bioinformatics. I just wish I learned of your channel a year ago.
I was going through the code posted for Part 2 and I noticed that when I tried to run the code. The part under "Removing the "intermediate" bioactivity class" returned intermediate values as NaN. I got 133 rows instead of 119 rows. These intermediates with NaN values were not removed as the code says. Then I changed my command to remove the intermediate bioactivity class as "df_2class = df_final[df_final.bioactivity_class.notna()]
df_2class" by deleting the "df_2class = df_final[df_final.bioactivity_class != "intermediate"]
df_2class". Using the former helped me remove all the NaN values which I assumed to be intermediates. This is also when I saw only 119 rows instead of 133 rows. I also noticed that my scatterplot has fewer active dots and the pIC50 values for active and inactive values are different from your box plot. My p value are also significantly different from yours 0.061356. I have no idea what it is I did wrong. My github is cnewton1428/Bioinformatics-python "Copy_of_CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb". Any help would be very appreciated!
@@jesussaves5556 In concised version, else statement of the classification #ed so if you delete # and turn it into a code statement and run the code you will be able to get the right output.
Best hands on bioinformatics YT tutorial - Jing jing! ขอบคุณมาก! :)
Wow, thanks for the kind comments! ขอบคุณครับ
Thankyou for making such effort to create content like this. It means alot! . Love from India
Thank you for a wonderful walk-through and great tips on how to use RDKit! Looking forward to similar educational videos in the realm of drug discovery!
Thanks Ebrahim for watching, more drug discovery video in the pipeline 😊
The Difficult Concepts Made Easy by DataProfessor... Thank you, Sir.
My pleasure, thank you!
Professor, I love your lecture. Thank you much from a Bangladeshi learner.
Thanks for the encouraging words!
Thank you, Professor, for making this excellent video to start to learn Drug Discovery.
I had a few questions regarding the Converting IC50 to pIC50 part. Could you direct me in a direction to study further?
1. You cap the value of IC50 before converting to pIC50. Will that affect the analyzing result? Or, just when the value is great enough, can we treat them as the same thing?
2. Why would we want to avoid negative values? How does It affect the analysis?
Good evening Data Professor, I am currently a student in Machine learning in Cameroon and as part of a project I would like to set up a model capable of predicting the behavior of proteins expressed by cancer cells when they are subjected to certain drugs. But I am a bit lost on the approach to adopt, I would like to please have some advice (What is the dataset to use?, the most appropriate models?, how to manage negative examples? etc)
Thank you professor for your videos ..absolute blessing in resource limited setups where i work
You are very welcome
Absolutely love the video! Apologies for posting on such an old video, but I have one question, I'm getting an error when compiling df_norm around 13:15, saying that it's giving too many arguments for the function but I have not added any arguments to the code. This is the exact error: "TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given" but I'm unsure where the extra argument could have come from
Dear professor
I have one query regarding the statistical significance exhibited while checking the Logp values(mannwhitney test). if there is no difference in active and inactive compounds , what does that mean?
We are testing whether there is statistically significant difference between the actives and inactives for a descriptor of interest, if there is a difference then it means that this particular descriptor is crucial for a compound being active or inactive. In a nutshell we are figuring out which descriptors are important for a compound being active (and inactive)
@@DataProfessor Thank you so much for your time.
From my experience, I remember if you don't apply log transformation then the scatter plot looks very skewed which makes us hard to see the pattern! So we apply log transformation so that the scatter plot is interpretable!
Thanks for sharing your answer 😃Right, the distribution becomes more uniform after log transformation.
@@DataProfessor 😬👍
Thank you, professor! This is exactly what I need for now!
Great video, thanks for publishing this quality content! A question - I was curious as to whether this project would be considered chemoinformatics more than bioinformatics due to the focus on molecules and chemical descriptors?
Yes, you are absolutely correct. These are indeed cheminformatics.
Quick and dirty python code: Log-transformation reduces the skew of the IC50 distribution:
***
fig, axes = plt.subplots(1,2, figsize = (10,4), sharey = True)
x1 = df_final.loc[:,"standard_value"]
x2 = df_final.loc[:, "pIC50"]
x = [x1,x2]
titles = ['Before_norm', 'After_norm']
for i, ax in enumerate(axes):
ax.hist(x[i], color = 'red', bins = 10)
ax.set_title(titles[i])
***
Thank you for this amazing tutorial. I am getting an error at removing the intermediate bioactivity class as:
AttributeError: 'DataFrame' object has no attribute 'bioactivity_class'
Could you please tell me how to solve this error?
I am also facing the same problem @Data Professor may you kindly assist. Thank you.
Very helpful video! One question. Where are the intermediate classes coming from? I do not see that in my data and I'm getting errors in my codes.
Thanks very much for your efforts.
Would you consider doing omics analysis tutorials with heavy focus on graphical visualization of omics data. As you know, visualization is very key in omics bioinformatics. Thanks.
Great suggestion! There's a lot of great visualizations that one can do for analyzing such datasets. I'll definitely to some data viz video for some of the bioinformatics datasets that is provided in this Bioinformatics tutorial series.
Hello sir, at 9:49, where you have used codeocean's code to calculate the descriptors, while working a dataset of EGFR, iam geting this error:
ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors._CalcMolWt(NoneType)
did not match C++ signature:
_CalcMolWt(RDKit::ROMol mol, bool onlyHeavy=False)
Can you please tell me a way, how to resolve it?
Hi, this seems to be similar to your error
github.com/rdkit/conda-rdkit/issues/87
I'm just a newbie to bioinformatics. I watch your videos everyday before going to bed though I don't understand much of your coding. 😊😊😊. Hope that next year, I can finish my own bioinformatics project. Thank you so much for giving me this much motivation to follow my dream.
Awesome, thank you! I'll also make more introductory level contents. I have just get an introductory lecture on computational drug discovery and will be uploading to the channel soon.
Great Video, working my way through the series. Was just wondering, how come 'standard_value' >1x 10^8 are not just discarded? .
Great question, although they are extreme outliers, they also can serve as data samples that we can use as compounds belonging to the inactive class.
Really thankful to you for sharing very helpful and informative content. Can you also make some tutorials on deep learning methods like GANs on bio sequential data?
One more great video on data science.
I am beginner and learning a lot from your RUclips channel, it's really helping me a lot. Thank you so much.
I have a kind request to you, could you please provide me some research papers on artificial intelligence in inventory and supply chain management with python code.
Thank you again....
Thanks for watching and glad you're finding the contents on this channel helpful. I would recommend to check out paperswithcode.com/ and for papers with code try paperswithcode.com/search?q_meta=&q=supply+chain
Hope this helps.
@@DataProfessor Thank you so much.
Ohhhh wowwwww lovely, great work keep it up...
Thanks!
@@DataProfessor Pleasure is all mine..
Can we get dopamine & serotonin levels using any electrodes?
If possible can you please let me know the electrode which can be used for this please.
very nice ,thanks .
Thank you for watching 😃
Such a great videos..... I am following this series sincerely.... I have a question, Is there any way that we can directly call the csv file from first colab file to the current colab file instead of downloading from the first file and then using in the other file.... ??
Thank you so much for giving so much knowledge to needy totally free of charge....
Thanks for watching! That's a great question, correct me if I'm wrong but I don't think it is possible to download file from the first colab file. It is however possible to save files generated from one Colab notebook directly into your Google Drive, from which you can access to them in your subsequent Colab notebooks.
@@DataProfessor I am not sure, but i think, convert the file from .ipynb to.py and then run the file in the current colab file will give you the saved file and then we can use.... I don't know whether it will work....
Thank you! Could you please add more content regarding "R programming to solving biological problems"? This would be a great help.
Thanks for suggestion, I am planning on releasing more R content. I have just filmed the first episode, stay tuned for that.
@@DataProfessor I must say this channel is great. The teaching method and contents are amazing.
@@GardeningWorld857 Appreciate the kind words, 😊
thanks so much sir for this great tutorial
You are most welcome
What to do if the numbers dont match exactly with the ones shown in the video
Thank you very much for this helpful series
Thanks Hassan for the kind words and glad you’ve found the series helpful😃
Thanks, it is very helpful to me.
Glad it was helpful!
Hi, can you please tell me where can I learn all the basics to understand your videos better? I'm a newbie to this field and I'm looking forward to learn the same. Please suggest books or articles to understand your project even better.
That's a great question! To better answer your question, I am writing a Medium article on Bioinformatics 101 that is due to be release soon (my Medium profile is medium.com/@chanin.nantasenamat). In the meantime you might want to check out the 2-Part introductory video on Bioinformatics 101 that I made, there's also a 2-hour long lecture that provides the basics of Computational Drug Discovery. They are all at the Playlist bit.ly/dataprofessor-bioinformatics
Dear Professor, plz make a video on how to down load whole SNP's data, preprocessing nd further analysis etc.
I used a different data set. I searched compounds associated with estrogen receptors. and I picked human estrogen receptor alpha...
I keep getting an error from one line in your lipinski function.
mol = Chem.MolFromSmiles(elem)
TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
I think there is an issue with the smiles data...but I am not sure how to fix it since there are about 3000 items...
Hi, could you try running in increments. You can subset the data such as df[0:100], df[100:200], etc.
@@DataProfessor Thank you. That definitely helped. I was able to eliminate 30 rows that were a problem. I am really sorry to keep asking but after that, I developed another issue.
I think something is not in the right format or not the right type.
this is the error code:
ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors._CalcMolWt(NoneType)
did not match C++ signature:
_CalcMolWt(RDKit::ROMol mol, bool onlyHeavy=False)
@@DataProfessor I figured out the issue! So no worries. Thanks again for this awesome tutorial. I am going to submit my project for an interview. hopefully, I get the job.
@@hyrunnisa997 Awesome, good luck with the interview 😊
I would like to learn ngs data analysis ...can you do some tutorials on that?
Hi, thanks for the suggestion, I am afraid that would be outside my expertise, sorry. However, I may show some basic sequence analysis using the BioPython library in the future.
@@DataProfessor okay sir ..thanks for your reply
@@DataProfessor I am waiting for the Biopython video.
Professor, if one decided to use a different standard value such as potency, how would that change the application of this project?
I have two questions about this process. First, if no "active' compounds exist for a given protein, can these statistical tests be performed between the intermediate and inactive compounds instead? Secondly, if we 'fail to reject H0' on all of the statistical Mann-Whitney tests, should we discard the data and choose a different protein instead of progressing to the next steps? Thanks.
Thank you Professor for this great lecture. I have a question about the interpretation of Lipinski's descriptor. After doing it by myself I found that all of the descriptor shows no significant difference. What does it means? Thank you in advance prof.
Hi, that's a great question! The Lipinski's descriptors usually don't contribute to highly predictive models of biological activities. For that there are several descriptor types such as the PubChem fingerprint (computed via PADEL Descriptor software). The Lipinski descriptors are used to evaluate the drug-likeness property of compounds. The 4 descriptors provides a rough idea as to whether compounds are "drug-like" or not.
@@DataProfessor thank you so much for your explanation prof. Have a great day.
me too, all the Lipinski descriptor shows no significant difference between the active and inactive, even the pIC50
Please help I am getting an error
TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
i am having the same problem...were your able to fix it :??
@@fernandoarmandomartinezurr3921 yes
The problem was with the SMILE notation
@@meenavinaykumar452 how did you fix it?
Quick question, when you removed the intermediate class, how comes it read 119 columns by 8 columns but from the data it remained the same 133 by 8?
Right, probably have to check that once we get 119 columns after removal of intermediate class, that we should also assign that to a variable for which we could use for subsequent steps.
Thank you so much for this video!!
min() arg is an empty sequence"
i am getting this error during plots .. what to do ?
can you please answer my question? in the step of generating the Box plot for the pIC50 there wasn't a threshold value between active and inactive classes so after that when I applied the Mann-Whitney u test it showed the "same distribution for all the Lipinski descriptors" did I miss something or its normal to show this result?
Thank You
✨
There were 2 threshold values used to define the active and inactive classes, most likely less than 1 uM and greater than 100 um, respectively
@@DataProfessor can I send you my results? I noticed that the data set has been changed from the one you're using in the video.
@Data Professor Please where can I get access to the csv file (The one in the df variable) ?
I need one help your one code is not working, it shows that no any registered c++ convert smiles. Only this can't works. It's imp single line code. Pl reply me
Can you please explain why did you write the three line of codes before '! conda install -c rdkit rdkit -y'. Also I am using mac so what lines should i use since mine is not linus. My code is showing 'rdkit' module not found when i try to import it. Please help!
Hi those 3 lines are for installing conda in Linux (Ubuntu) for the Google Colab since the notebook resets every logout. If you already have conda installed, you can ignore those 3 lines. If you have installed rdkit on your mac, then you can also delete that line as well.
Can I normalize custom columns using MinMaxScaler( ) or StandardScaler( ) instead of calculating pIC50?
The problem with my data is that all reported compounds are active. What to do?
No worries, if all reported compounds are active that is great news for you, you have a set of potent compounds. In this case, I would recommend to build regression models.
@@DataProfessor So skip Mann Whitney and do a regression analysis? Is that what you mean?
while importing rdkit, I am getting error /usr/local/lib/python3.7/site-packages/rdkit/__init_.py import error undefined symbol: py__open, any idea where is the path of rdkit defined?
I am using colab and already installed conda and rdkit as per instructions. I also installed miniconda3 in home directory (but not sure how to change path for program to look for rdkt in my home directory instead of python3.7/site-packages)
any idea for calculating pka and LogD values??
thank you
When I run the Lipinski descriptors function on the dataset (df_lipinski), i'm gettin this error: TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
Can someone help me solve this? I cant understand what the problem is
apart from bioinformatics what are other branches in science where data science is most relevant ?Can you tell me
Thanks Mandar for the question. I would say that data science can be used in all branches of science, healthcare, engineering and businesses. The reason for saying this is because all of these generate data in one form or another (text, numbers, images, videos, audio, etc.). As patterns are inherently presented in these big data it follows that data science has great potential in making sense of these data. Hope this helps.
Is the standard value for each compound already in nM?
why is it that for some it is written that the units are uM, and for some nM
Hi Professor, I've tried changing my target protein but in part 2 I'm getting an error and got stuck there. It will be helpful if you kindly help me with the problem : df_lipinski = lipinski(df_clean_smiles.canonical_smiles) in this cell I'm getting error which says as follows: ArgumentError Traceback (most recent call last)
in ()
----> 1 df_lipinski = lipinski(df_clean_smiles.canonical_smiles)
ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors._CalcMolWt(NoneType)
did not match C++ signature:
_CalcMolWt(RDKit::ROMol mol, bool onlyHeavy=False)
I had the same error. In my case, the problem was NAN in canonical_smiles column. I solved it removing rows with NA values in the cannonical_smiles column with something like df3 = df2[df2.canonical_smiles.notna()]
df3. i hope it help you
Dear sir, How to change and to install the another source-types or libraries of database? Thank you.
For example, $ ! pip install PubChemPy
$ import pandas as pd
$ from pubchempy import *
Is this correct?
This is literally awesome and a very very useful tutorial. Could you please make some tutorials on various packages such as protr, PyBiomed, propy, iFeature? Thanks would be helpful.
Noted, thanks for the kind suggestions.
AttributeError: 'DataFrame' object has no attribute 'bioactivity_class' i hav this proplem and i donnot know how to fix it
Hi professor I'm trying to do a project on this topic. Everything thing was going well until I stumble on
df_norm = norm_value(df_combined)
df_norm
This code does not run please help me fix this problem. Thank you
Please check the df_combined variable to see if the dataframe is as expected, also check to see the norm_value function to by going through line by line if it is generating expected data transformation
Professor, how do we download the final data frame to be used in Part3?
How do I use these codes in windows 10 operating system? Kindly let me know about this
Hi, these codes can be used from within Python environment which can be used on any OS (Windows, Linux, Mac OSX). You can install Python via anaconda, mini-conda or directly from python.org. On top of Python, you can code in a Jupyter notebook (by first installing the jupyter library) or in an IDE (Spyder, PyCharm). Alternatively you can also work on the cloud using Google Colab (the Jupyter notebook on the cloud).
Amazing series on Drug discovery using ML. I have a question on installing rdkit in my virtual environment. Despite successful installation, import rdkit throws an error. Could you please, help me with it?
Did you use conda to install rdkit?
@@DataProfessor Yes, I did. I'm trying to install it on Ubuntu 18.04. I have set the python version of my virtual environment as 2.7.18 and then tried installing using conda.
@@sruthisampath3432 I think the problem lies in Python 2.7, can you try installing 3.7 then install rdkit using conda
@@DataProfessor I did try installing it with 3.7 in Ubuntu. Didn't work. But the same works in Windows😄
@@sruthisampath3432 Which commands did you use to install rdkit is it: conda install -c conda-forge rdkit
why using u test here instead of other test?
Can you make a video on data science for genomics.
Thanks for suggestion, will certainly consider that for future video
Dear professor, do you have any corresponding literature in this Drug Discovery video? Thanks
This article from our research group would be relevant peerj.com/articles/2322/
@@DataProfessor Thank you very much ,thanks!.
Error in Calculate descriptors step:
TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
How to solve?
@dataprofessor I also experienced this error. Please, how can it be resolved?
Thank you very much. When will you release the part 3?
Filming of Part 3 is finished and it is currently being edited. Should be out soon hopefully by tomorrow.
Hi, Can you please tell me where I can find a research paper with code? Thanks
Here it is paperswithcode.com 😃
@@DataProfessor That's quick. Thanks for adding useful videos. Can you please make videos of analyzing from PubMed gene data
Thanks for suggestion, I’ll see what I can do 😃
I am getting traceback for the code after defining Lipinski function. What to do? Someone please help.
Are you running from the Google Colab? Try running again in sequential order, if the error is still there, can you paste the entire error message here?
@@DataProfessor I tried running it again in sequential order but still getting traceback.
TypeError Traceback (most recent call last)
in ()
----> 1 df_lipinski = lipinski(df.canonical_smiles)
in lipinski(smiles, verbose)
5 moldata= []
6 for elem in smiles:
----> 7 mol=Chem.MolFromSmiles(elem)
8 moldata.append(mol)
9
TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
@@maithilipokle4437 Was rdkit installed? Was this ran on a Colab or local Jupyter notebook?
@@DataProfessor yes I did install rdkit and conda (the first set of code). I ran it on Colab.
@@maithilipokle4437 Hi, there’s 2 parts to this tutorial. Here’s what you need to do:
1. Run Part 1 and save the generated bioactivity data CSV file to your local computer
2. Run Part 2 and make sure to upload the CSV file from Part 2 to it. Make sure to do the upload before running the reading of the bioactivity data file.
Hopefully this should work.
sir plzz give an explanation for the code from link
Hello! Thank you for this amazing tutorial! I just have a small problem that I would like some help with. Whenever I run the norm_value() function I keep getting the error below. Does anyone know how I can solve this?
Cell In[53], line 1
----> 1 df_norm = norm_value(df_combined)
2 df_norm
Cell In[49], line 10, in norm_value(input)
7 norm.append(i)
9 input['standard_value_norm'] = norm
---> 10 x = input.drop('standard_value', 1)
12 return x
TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given
The drop version used here was deprecated. Use this instead: x = input.drop(columns='standard_value')
@@psychotropicalfunkcan u please plz write the code for that. I m little confused
Dear Data Professor,
Thank you for your great lecture.
I need your help.
Once I run the following input:
df_norm = norm_value(df_combined)
df_norm
The output Is the following:
TypeError Traceback (most recent call last)
in ()
----> 1 df_norm = norm_value(df_combined)
2 df_norm
in norm_value(input)
3
4 for i in input['standard_value']:
----> 5 if i > 100000000:
6 i = 100000000
7 norm.append(i)
TypeError: '>' not supported between instances of 'str' and 'int'
Thank you in advance for your help
Python is confused when the input which is a string data type is used with the > operator. What you need to do is to first convert the input to an integer then try again. Hint: int(i)
@@DataProfessor thank you very much!! I solved that problem typing the following input:
df_combined.standard_value.describe()
pd.to_numeric(df_combined.standard_value, errors='coerce').fillna(0, downcast='infer').
Now when I run:
df_norm.standard_value_norm.describe()
the output is:
count 1.320000e+02
mean 2.126145e+07
std 4.101124e+07
min 5.000000e+01
25% 1.070000e+04
50% 2.415000e+04
75% 3.000000e+05
max 1.000000e+08
Name: standard_value_norm, dtype: float64
Is different from your lecture because I used another dataset but now it is working well!! Many thanks.
May I suggest some new lectures in chemoinformatics, in particular in protein-ligand interaction or protein structure prediction. This topic is very hot!!
Have a nice weekend and thank you again!!
@@gabrieleblanca4702 Glad it's working out for you. Thanks for the suggestion. Protein-ligand interaction by means of Molecular docking, I presume, will definitely try to package the topic to be of wide interest and relevant to data science.Will definitely consider this for future videos.
@@gabrieleblanca4702 Glad it's working out for you. Thanks for the suggestion. Protein-ligand interaction by means of Molecular docking, I presume, will definitely try to package the topic to be of wide interest and relevant to data science.Will definitely consider this for future videos.
@@DataProfessor absolutely yes! Molecular docking should be of great relevance! I can't wait to watch it!!
Thank you again for your availability. Have a nice day💪
This is so great. I juz started learning data science and ML and these r the projects i am hoping to do. ALso, I reached out to u in instagram. Plz kindly check it out, I have some questions regarding it. Thank u so much.
Thanks for the kind comment Kelvin. Okay I will check the instagram and get back.
Do you have a video explaining rdkit for conformational search and energy minimization? I would be very usefull for me!
Not yet, I'll consider this for future videos.
Running 'CDD_ML_Part_2_Acetylcholinesterase_Exploratory_Data_Analysis.ipynb' on a windows machine and followinf 'df_norm = norm_value(df_combined) am getting
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[49], line 1
----> 1 df_norm = norm_value(df_combined)
2 df_norm
Cell In[47], line 10, in norm_value(input)
7 norm.append(i)
9 input['standard_value_norm'] = norm
---> 10 x = input.drop('standard_value', 1)
12 return x
TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given
This also occurs with the corona virus example in this same set. Any help/solution would be appreciated.
Thanks
19:58 time stamp what is happening with the active and inactive classes? 🥲