Bioinformatics Project from Scratch - Drug Discovery Part 2 (Exploratory Data Analysis)

Поделиться
HTML-код
  • Опубликовано: 6 фев 2025

Комментарии • 195

  • @DataProfessor
    @DataProfessor  3 года назад +6

    👉Watch this video next (How to learn data science in 2021) ruclips.net/video/oR670Txwh88/видео.html
    Support this Channel 👇👇👇
    🌟 Buy me a coffee www.buymeacoffee.com/dataprofessor
    🌟 Download Kite for FREE www.kite.com/get-kite/?
    👉 Subscribe to this RUclips channel ruclips.net/user/dataprofessor
    👉 Join the Newsletter of Data Professor newsletter.dataprofessor.org
    👉 Blogs on Medium medium.dataprofessor.org/

  • @misganamengistu5503
    @misganamengistu5503 4 года назад +13

    professor you really saved my life. I am biotech and I was desperate about my M.Sc. dissertation since all labs are still closed due to covid. Bless you

    • @DataProfessor
      @DataProfessor  4 года назад +3

      Glad the contents are helpful in your Masters thesis 😊

  • @danielgiraldo3692
    @danielgiraldo3692 11 месяцев назад +7

    for anyone having the error when importing rdkit, just install it manually in a new cell and run !pip install rdkit , the run again the original cell for getting Chem

  • @tiamat1628
    @tiamat1628 Год назад +1

    Bro giving master thesis project in a golden plate, bless you prof

  • @abdulmujeebonawole3608
    @abdulmujeebonawole3608 2 года назад +2

    If it were possible to like this video a thousand times. I would. Thank you so much, Data Professor.

  • @KenJee_ds
    @KenJee_ds 4 года назад +5

    Finally part 2! Great stuff. Loved the EDA!

    • @DataProfessor
      @DataProfessor  4 года назад

      Thanks Ken for the encouragement!

    • @jesussaves5556
      @jesussaves5556 4 года назад

      @@DataProfessor Hi Professor. I really appreciate all that you do to teach us in data science and Bioinformatics. I just wish I learned of your channel a year ago.
      I was going through the code posted for Part 2 and I noticed that when I tried to run the code. The part under "Removing the "intermediate" bioactivity class" returned intermediate values as NaN. I got 133 rows instead of 119 rows. These intermediates with NaN values were not removed as the code says. Then I changed my command to remove the intermediate bioactivity class as "df_2class = df_final[df_final.bioactivity_class.notna()]
      df_2class" by deleting the "df_2class = df_final[df_final.bioactivity_class != "intermediate"]
      df_2class". Using the former helped me remove all the NaN values which I assumed to be intermediates. This is also when I saw only 119 rows instead of 133 rows. I also noticed that my scatterplot has fewer active dots and the pIC50 values for active and inactive values are different from your box plot. My p value are also significantly different from yours 0.061356. I have no idea what it is I did wrong. My github is cnewton1428/Bioinformatics-python "Copy_of_CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb". Any help would be very appreciated!

    • @eyupbilgi3191
      @eyupbilgi3191 3 года назад

      @@jesussaves5556 In concised version, else statement of the classification #ed so if you delete # and turn it into a code statement and run the code you will be able to get the right output.

  • @krzheph7373
    @krzheph7373 4 года назад +1

    Best hands on bioinformatics YT tutorial - Jing jing! ขอบคุณมาก! :)

    • @DataProfessor
      @DataProfessor  4 года назад

      Wow, thanks for the kind comments! ขอบคุณครับ

  • @sakshichaudhary9869
    @sakshichaudhary9869 Год назад +1

    Thankyou for making such effort to create content like this. It means alot! . Love from India

  • @acidic_ph
    @acidic_ph 4 года назад +2

    Thank you for a wonderful walk-through and great tips on how to use RDKit! Looking forward to similar educational videos in the realm of drug discovery!

    • @DataProfessor
      @DataProfessor  4 года назад +1

      Thanks Ebrahim for watching, more drug discovery video in the pipeline 😊

  • @aashishkatyal
    @aashishkatyal 3 года назад +1

    The Difficult Concepts Made Easy by DataProfessor... Thank you, Sir.

  • @jubayerhossain8812
    @jubayerhossain8812 4 года назад +1

    Professor, I love your lecture. Thank you much from a Bangladeshi learner.

  • @leowu9906
    @leowu9906 3 года назад +2

    Thank you, Professor, for making this excellent video to start to learn Drug Discovery.
    I had a few questions regarding the Converting IC50 to pIC50 part. Could you direct me in a direction to study further?
    1. You cap the value of IC50 before converting to pIC50. Will that affect the analyzing result? Or, just when the value is great enough, can we treat them as the same thing?
    2. Why would we want to avoid negative values? How does It affect the analysis?

  • @petitmodel111
    @petitmodel111 2 года назад +2

    Good evening Data Professor, I am currently a student in Machine learning in Cameroon and as part of a project I would like to set up a model capable of predicting the behavior of proteins expressed by cancer cells when they are subjected to certain drugs. But I am a bit lost on the approach to adopt, I would like to please have some advice (What is the dataset to use?, the most appropriate models?, how to manage negative examples? etc)

  • @dr.surajitdebnath38
    @dr.surajitdebnath38 4 года назад +1

    Thank you professor for your videos ..absolute blessing in resource limited setups where i work

  • @nojerama788
    @nojerama788 16 дней назад

    Absolutely love the video! Apologies for posting on such an old video, but I have one question, I'm getting an error when compiling df_norm around 13:15, saying that it's giving too many arguments for the function but I have not added any arguments to the code. This is the exact error: "TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given" but I'm unsure where the extra argument could have come from

  • @meenavinaykumar452
    @meenavinaykumar452 2 года назад +2

    Dear professor
    I have one query regarding the statistical significance exhibited while checking the Logp values(mannwhitney test). if there is no difference in active and inactive compounds , what does that mean?

    • @DataProfessor
      @DataProfessor  2 года назад +1

      We are testing whether there is statistically significant difference between the actives and inactives for a descriptor of interest, if there is a difference then it means that this particular descriptor is crucial for a compound being active or inactive. In a nutshell we are figuring out which descriptors are important for a compound being active (and inactive)

    • @meenavinaykumar452
      @meenavinaykumar452 2 года назад +1

      @@DataProfessor Thank you so much for your time.

  • @ImportData1
    @ImportData1 4 года назад +2

    From my experience, I remember if you don't apply log transformation then the scatter plot looks very skewed which makes us hard to see the pattern! So we apply log transformation so that the scatter plot is interpretable!

    • @DataProfessor
      @DataProfessor  4 года назад +1

      Thanks for sharing your answer 😃Right, the distribution becomes more uniform after log transformation.

    • @ImportData1
      @ImportData1 4 года назад +1

      @@DataProfessor 😬👍

  • @yinyang6058
    @yinyang6058 4 года назад

    Thank you, professor! This is exactly what I need for now!

  • @PutaCaliente
    @PutaCaliente 4 года назад +2

    Great video, thanks for publishing this quality content! A question - I was curious as to whether this project would be considered chemoinformatics more than bioinformatics due to the focus on molecules and chemical descriptors?

    • @DataProfessor
      @DataProfessor  4 года назад

      Yes, you are absolutely correct. These are indeed cheminformatics.

  • @Kane9530
    @Kane9530 3 года назад +1

    Quick and dirty python code: Log-transformation reduces the skew of the IC50 distribution:
    ***
    fig, axes = plt.subplots(1,2, figsize = (10,4), sharey = True)
    x1 = df_final.loc[:,"standard_value"]
    x2 = df_final.loc[:, "pIC50"]
    x = [x1,x2]
    titles = ['Before_norm', 'After_norm']
    for i, ax in enumerate(axes):
    ax.hist(x[i], color = 'red', bins = 10)
    ax.set_title(titles[i])
    ***

  • @priyankapawar538
    @priyankapawar538 3 года назад +1

    Thank you for this amazing tutorial. I am getting an error at removing the intermediate bioactivity class as:
    AttributeError: 'DataFrame' object has no attribute 'bioactivity_class'
    Could you please tell me how to solve this error?

    • @kudakwashenyambo6023
      @kudakwashenyambo6023 3 года назад

      I am also facing the same problem @Data Professor may you kindly assist. Thank you.

  • @theodoreguo1531
    @theodoreguo1531 2 года назад

    Very helpful video! One question. Where are the intermediate classes coming from? I do not see that in my data and I'm getting errors in my codes.

  • @remia5
    @remia5 4 года назад +1

    Thanks very much for your efforts.
    Would you consider doing omics analysis tutorials with heavy focus on graphical visualization of omics data. As you know, visualization is very key in omics bioinformatics. Thanks.

    • @DataProfessor
      @DataProfessor  4 года назад

      Great suggestion! There's a lot of great visualizations that one can do for analyzing such datasets. I'll definitely to some data viz video for some of the bioinformatics datasets that is provided in this Bioinformatics tutorial series.

  • @RohitKumar-vr8yt
    @RohitKumar-vr8yt 3 года назад +1

    Hello sir, at 9:49, where you have used codeocean's code to calculate the descriptors, while working a dataset of EGFR, iam geting this error:
    ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors._CalcMolWt(NoneType)
    did not match C++ signature:
    _CalcMolWt(RDKit::ROMol mol, bool onlyHeavy=False)
    Can you please tell me a way, how to resolve it?

    • @DataProfessor
      @DataProfessor  3 года назад +1

      Hi, this seems to be similar to your error
      github.com/rdkit/conda-rdkit/issues/87

  • @huynhthichung9497
    @huynhthichung9497 3 года назад +1

    I'm just a newbie to bioinformatics. I watch your videos everyday before going to bed though I don't understand much of your coding. 😊😊😊. Hope that next year, I can finish my own bioinformatics project. Thank you so much for giving me this much motivation to follow my dream.

    • @DataProfessor
      @DataProfessor  3 года назад +1

      Awesome, thank you! I'll also make more introductory level contents. I have just get an introductory lecture on computational drug discovery and will be uploading to the channel soon.

  • @justindreyer6110
    @justindreyer6110 3 года назад +1

    Great Video, working my way through the series. Was just wondering, how come 'standard_value' >1x 10^8 are not just discarded? .

    • @DataProfessor
      @DataProfessor  3 года назад

      Great question, although they are extreme outliers, they also can serve as data samples that we can use as compounds belonging to the inactive class.

  • @mramzanshahidkhan3917
    @mramzanshahidkhan3917 3 года назад +1

    Really thankful to you for sharing very helpful and informative content. Can you also make some tutorials on deep learning methods like GANs on bio sequential data?

  • @VLM234
    @VLM234 4 года назад +1

    One more great video on data science.
    I am beginner and learning a lot from your RUclips channel, it's really helping me a lot. Thank you so much.
    I have a kind request to you, could you please provide me some research papers on artificial intelligence in inventory and supply chain management with python code.
    Thank you again....

    • @DataProfessor
      @DataProfessor  4 года назад +1

      Thanks for watching and glad you're finding the contents on this channel helpful. I would recommend to check out paperswithcode.com/ and for papers with code try paperswithcode.com/search?q_meta=&q=supply+chain
      Hope this helps.

    • @VLM234
      @VLM234 4 года назад +1

      @@DataProfessor Thank you so much.

  • @muhammadjamalahmed8664
    @muhammadjamalahmed8664 4 года назад +1

    Ohhhh wowwwww lovely, great work keep it up...

  • @mohamadsabreen2360
    @mohamadsabreen2360 Год назад

    Can we get dopamine & serotonin levels using any electrodes?
    If possible can you please let me know the electrode which can be used for this please.

  • @pcliang2693
    @pcliang2693 4 года назад +2

    very nice ,thanks .

  • @VLM234
    @VLM234 4 года назад +2

    Such a great videos..... I am following this series sincerely.... I have a question, Is there any way that we can directly call the csv file from first colab file to the current colab file instead of downloading from the first file and then using in the other file.... ??
    Thank you so much for giving so much knowledge to needy totally free of charge....

    • @DataProfessor
      @DataProfessor  4 года назад +2

      Thanks for watching! That's a great question, correct me if I'm wrong but I don't think it is possible to download file from the first colab file. It is however possible to save files generated from one Colab notebook directly into your Google Drive, from which you can access to them in your subsequent Colab notebooks.

    • @VLM234
      @VLM234 4 года назад

      @@DataProfessor I am not sure, but i think, convert the file from .ipynb to.py and then run the file in the current colab file will give you the saved file and then we can use.... I don't know whether it will work....

  • @GardeningWorld857
    @GardeningWorld857 3 года назад +1

    Thank you! Could you please add more content regarding "R programming to solving biological problems"? This would be a great help.

    • @DataProfessor
      @DataProfessor  3 года назад +1

      Thanks for suggestion, I am planning on releasing more R content. I have just filmed the first episode, stay tuned for that.

    • @GardeningWorld857
      @GardeningWorld857 3 года назад +1

      @@DataProfessor I must say this channel is great. The teaching method and contents are amazing.

    • @DataProfessor
      @DataProfessor  3 года назад +1

      @@GardeningWorld857 Appreciate the kind words, 😊

  • @tallyme3625
    @tallyme3625 3 года назад +1

    thanks so much sir for this great tutorial

  • @alexwayne5368
    @alexwayne5368 Год назад

    What to do if the numbers dont match exactly with the ones shown in the video

  • @hassanbelarbi5185
    @hassanbelarbi5185 4 года назад +1

    Thank you very much for this helpful series

    • @DataProfessor
      @DataProfessor  4 года назад

      Thanks Hassan for the kind words and glad you’ve found the series helpful😃

  • @taosun8549
    @taosun8549 4 года назад +1

    Thanks, it is very helpful to me.

  • @alhenaminhaz
    @alhenaminhaz 4 года назад +1

    Hi, can you please tell me where can I learn all the basics to understand your videos better? I'm a newbie to this field and I'm looking forward to learn the same. Please suggest books or articles to understand your project even better.

    • @DataProfessor
      @DataProfessor  4 года назад

      That's a great question! To better answer your question, I am writing a Medium article on Bioinformatics 101 that is due to be release soon (my Medium profile is medium.com/@chanin.nantasenamat). In the meantime you might want to check out the 2-Part introductory video on Bioinformatics 101 that I made, there's also a 2-hour long lecture that provides the basics of Computational Drug Discovery. They are all at the Playlist bit.ly/dataprofessor-bioinformatics

  • @madhavanjn
    @madhavanjn 4 года назад

    Dear Professor, plz make a video on how to down load whole SNP's data, preprocessing nd further analysis etc.

  • @hyrunnisa997
    @hyrunnisa997 3 года назад +2

    I used a different data set. I searched compounds associated with estrogen receptors. and I picked human estrogen receptor alpha...
    I keep getting an error from one line in your lipinski function.
    mol = Chem.MolFromSmiles(elem)
    TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
    I think there is an issue with the smiles data...but I am not sure how to fix it since there are about 3000 items...

    • @DataProfessor
      @DataProfessor  3 года назад +1

      Hi, could you try running in increments. You can subset the data such as df[0:100], df[100:200], etc.

    • @hyrunnisa997
      @hyrunnisa997 3 года назад

      @@DataProfessor Thank you. That definitely helped. I was able to eliminate 30 rows that were a problem. I am really sorry to keep asking but after that, I developed another issue.
      I think something is not in the right format or not the right type.
      this is the error code:
      ArgumentError: Python argument types in
      rdkit.Chem.rdMolDescriptors._CalcMolWt(NoneType)
      did not match C++ signature:
      _CalcMolWt(RDKit::ROMol mol, bool onlyHeavy=False)

    • @hyrunnisa997
      @hyrunnisa997 3 года назад +1

      @@DataProfessor I figured out the issue! So no worries. Thanks again for this awesome tutorial. I am going to submit my project for an interview. hopefully, I get the job.

    • @DataProfessor
      @DataProfessor  3 года назад +1

      @@hyrunnisa997 Awesome, good luck with the interview 😊

  • @gayathrielangovan2225
    @gayathrielangovan2225 4 года назад +2

    I would like to learn ngs data analysis ...can you do some tutorials on that?

    • @DataProfessor
      @DataProfessor  4 года назад +1

      Hi, thanks for the suggestion, I am afraid that would be outside my expertise, sorry. However, I may show some basic sequence analysis using the BioPython library in the future.

    • @gayathrielangovan2225
      @gayathrielangovan2225 4 года назад

      @@DataProfessor okay sir ..thanks for your reply

    • @shwetaredkar734
      @shwetaredkar734 4 года назад

      @@DataProfessor I am waiting for the Biopython video.

  • @abolajishiwoku32
    @abolajishiwoku32 4 года назад

    Professor, if one decided to use a different standard value such as potency, how would that change the application of this project?

  • @hunterfitzpatrick8389
    @hunterfitzpatrick8389 3 года назад

    I have two questions about this process. First, if no "active' compounds exist for a given protein, can these statistical tests be performed between the intermediate and inactive compounds instead? Secondly, if we 'fail to reject H0' on all of the statistical Mann-Whitney tests, should we discard the data and choose a different protein instead of progressing to the next steps? Thanks.

  • @alfonlinata6797
    @alfonlinata6797 4 года назад +1

    Thank you Professor for this great lecture. I have a question about the interpretation of Lipinski's descriptor. After doing it by myself I found that all of the descriptor shows no significant difference. What does it means? Thank you in advance prof.

    • @DataProfessor
      @DataProfessor  4 года назад +2

      Hi, that's a great question! The Lipinski's descriptors usually don't contribute to highly predictive models of biological activities. For that there are several descriptor types such as the PubChem fingerprint (computed via PADEL Descriptor software). The Lipinski descriptors are used to evaluate the drug-likeness property of compounds. The 4 descriptors provides a rough idea as to whether compounds are "drug-like" or not.

    • @alfonlinata6797
      @alfonlinata6797 4 года назад

      @@DataProfessor thank you so much for your explanation prof. Have a great day.

    • @wilzgaming777
      @wilzgaming777 3 года назад

      me too, all the Lipinski descriptor shows no significant difference between the active and inactive, even the pIC50

  • @meenavinaykumar452
    @meenavinaykumar452 3 года назад +1

    Please help I am getting an error
    TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float

    • @fernandoarmandomartinezurr3921
      @fernandoarmandomartinezurr3921 3 года назад

      i am having the same problem...were your able to fix it :??

    • @meenavinaykumar452
      @meenavinaykumar452 3 года назад

      @@fernandoarmandomartinezurr3921 yes
      The problem was with the SMILE notation

    • @mahima5781
      @mahima5781 2 года назад

      @@meenavinaykumar452 how did you fix it?

  • @virginiahperiah5777
    @virginiahperiah5777 4 года назад +1

    Quick question, when you removed the intermediate class, how comes it read 119 columns by 8 columns but from the data it remained the same 133 by 8?

    • @DataProfessor
      @DataProfessor  4 года назад

      Right, probably have to check that once we get 119 columns after removal of intermediate class, that we should also assign that to a variable for which we could use for subsequent steps.

  • @musicalrea5433
    @musicalrea5433 4 года назад

    Thank you so much for this video!!

  • @ShiwanaGhai
    @ShiwanaGhai Год назад

    min() arg is an empty sequence"
    i am getting this error during plots .. what to do ?

  • @ingyghanm5402
    @ingyghanm5402 2 месяца назад

    can you please answer my question? in the step of generating the Box plot for the pIC50 there wasn't a threshold value between active and inactive classes so after that when I applied the Mann-Whitney u test it showed the "same distribution for all the Lipinski descriptors" did I miss something or its normal to show this result?
    Thank You

    • @DataProfessor
      @DataProfessor  2 месяца назад

      There were 2 threshold values used to define the active and inactive classes, most likely less than 1 uM and greater than 100 um, respectively

    • @ingyghanm5402
      @ingyghanm5402 2 месяца назад

      @@DataProfessor can I send you my results? I noticed that the data set has been changed from the one you're using in the video.

  • @benjamintwumasi2480
    @benjamintwumasi2480 2 года назад

    @Data Professor Please where can I get access to the csv file (The one in the df variable) ?

  • @mayurjoshi1452
    @mayurjoshi1452 11 месяцев назад

    I need one help your one code is not working, it shows that no any registered c++ convert smiles. Only this can't works. It's imp single line code. Pl reply me

  • @tejiyo
    @tejiyo 3 года назад +1

    Can you please explain why did you write the three line of codes before '! conda install -c rdkit rdkit -y'. Also I am using mac so what lines should i use since mine is not linus. My code is showing 'rdkit' module not found when i try to import it. Please help!

    • @DataProfessor
      @DataProfessor  3 года назад

      Hi those 3 lines are for installing conda in Linux (Ubuntu) for the Google Colab since the notebook resets every logout. If you already have conda installed, you can ignore those 3 lines. If you have installed rdkit on your mac, then you can also delete that line as well.

  • @sanam6866
    @sanam6866 2 года назад

    Can I normalize custom columns using MinMaxScaler( ) or StandardScaler( ) instead of calculating pIC50?

  • @maithilipokle4437
    @maithilipokle4437 3 года назад +1

    The problem with my data is that all reported compounds are active. What to do?

    • @DataProfessor
      @DataProfessor  3 года назад +1

      No worries, if all reported compounds are active that is great news for you, you have a set of potent compounds. In this case, I would recommend to build regression models.

    • @maithilipokle4437
      @maithilipokle4437 3 года назад

      @@DataProfessor So skip Mann Whitney and do a regression analysis? Is that what you mean?

  • @sumitkohli8628
    @sumitkohli8628 Год назад

    while importing rdkit, I am getting error /usr/local/lib/python3.7/site-packages/rdkit/__init_.py import error undefined symbol: py__open, any idea where is the path of rdkit defined?
    I am using colab and already installed conda and rdkit as per instructions. I also installed miniconda3 in home directory (but not sure how to change path for program to look for rdkt in my home directory instead of python3.7/site-packages)

  • @rochishrvevo6136
    @rochishrvevo6136 3 года назад

    any idea for calculating pka and LogD values??

  • @SanamenteInsano
    @SanamenteInsano 2 года назад +1

    thank you

  • @gustavovenanciodasilva1174
    @gustavovenanciodasilva1174 2 года назад

    When I run the Lipinski descriptors function on the dataset (df_lipinski), i'm gettin this error: TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
    Can someone help me solve this? I cant understand what the problem is

  • @mandarvaidya7947
    @mandarvaidya7947 4 года назад +1

    apart from bioinformatics what are other branches in science where data science is most relevant ?Can you tell me

    • @DataProfessor
      @DataProfessor  4 года назад +1

      Thanks Mandar for the question. I would say that data science can be used in all branches of science, healthcare, engineering and businesses. The reason for saying this is because all of these generate data in one form or another (text, numbers, images, videos, audio, etc.). As patterns are inherently presented in these big data it follows that data science has great potential in making sense of these data. Hope this helps.

  • @sofijastanojlovic1761
    @sofijastanojlovic1761 3 месяца назад

    Is the standard value for each compound already in nM?

    • @sofijastanojlovic1761
      @sofijastanojlovic1761 3 месяца назад

      why is it that for some it is written that the units are uM, and for some nM

  • @nehabiswas9921
    @nehabiswas9921 3 года назад +1

    Hi Professor, I've tried changing my target protein but in part 2 I'm getting an error and got stuck there. It will be helpful if you kindly help me with the problem : df_lipinski = lipinski(df_clean_smiles.canonical_smiles) in this cell I'm getting error which says as follows: ArgumentError Traceback (most recent call last)
    in ()
    ----> 1 df_lipinski = lipinski(df_clean_smiles.canonical_smiles)
    ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors._CalcMolWt(NoneType)
    did not match C++ signature:
    _CalcMolWt(RDKit::ROMol mol, bool onlyHeavy=False)

    • @jaimegonzalez8187
      @jaimegonzalez8187 3 года назад +1

      I had the same error. In my case, the problem was NAN in canonical_smiles column. I solved it removing rows with NA values in the cannonical_smiles column with something like df3 = df2[df2.canonical_smiles.notna()]
      df3. i hope it help you

  • @tkanchanok
    @tkanchanok 3 года назад

    Dear sir, How to change and to install the another source-types or libraries of database? Thank you.
    For example, $ ! pip install PubChemPy
    $ import pandas as pd
    $ from pubchempy import *
    Is this correct?

  • @shwetaredkar734
    @shwetaredkar734 4 года назад +4

    This is literally awesome and a very very useful tutorial. Could you please make some tutorials on various packages such as protr, PyBiomed, propy, iFeature? Thanks would be helpful.

    • @DataProfessor
      @DataProfessor  4 года назад +1

      Noted, thanks for the kind suggestions.

  • @moa4f223
    @moa4f223 3 года назад

    AttributeError: 'DataFrame' object has no attribute 'bioactivity_class' i hav this proplem and i donnot know how to fix it

  • @bikashthapa8622
    @bikashthapa8622 7 месяцев назад

    Hi professor I'm trying to do a project on this topic. Everything thing was going well until I stumble on
    df_norm = norm_value(df_combined)
    df_norm
    This code does not run please help me fix this problem. Thank you

    • @DataProfessor
      @DataProfessor  7 месяцев назад

      Please check the df_combined variable to see if the dataframe is as expected, also check to see the norm_value function to by going through line by line if it is generating expected data transformation

  • @kylledennicesaligumba821
    @kylledennicesaligumba821 4 года назад

    Professor, how do we download the final data frame to be used in Part3?

  • @nishthamalhotra8027
    @nishthamalhotra8027 3 года назад +1

    How do I use these codes in windows 10 operating system? Kindly let me know about this

    • @DataProfessor
      @DataProfessor  3 года назад

      Hi, these codes can be used from within Python environment which can be used on any OS (Windows, Linux, Mac OSX). You can install Python via anaconda, mini-conda or directly from python.org. On top of Python, you can code in a Jupyter notebook (by first installing the jupyter library) or in an IDE (Spyder, PyCharm). Alternatively you can also work on the cloud using Google Colab (the Jupyter notebook on the cloud).

  • @sruthisampath3432
    @sruthisampath3432 3 года назад +1

    Amazing series on Drug discovery using ML. I have a question on installing rdkit in my virtual environment. Despite successful installation, import rdkit throws an error. Could you please, help me with it?

    • @DataProfessor
      @DataProfessor  3 года назад +1

      Did you use conda to install rdkit?

    • @sruthisampath3432
      @sruthisampath3432 3 года назад

      @@DataProfessor Yes, I did. I'm trying to install it on Ubuntu 18.04. I have set the python version of my virtual environment as 2.7.18 and then tried installing using conda.

    • @DataProfessor
      @DataProfessor  3 года назад

      @@sruthisampath3432 I think the problem lies in Python 2.7, can you try installing 3.7 then install rdkit using conda

    • @sruthisampath3432
      @sruthisampath3432 3 года назад

      @@DataProfessor I did try installing it with 3.7 in Ubuntu. Didn't work. But the same works in Windows😄

    • @DataProfessor
      @DataProfessor  3 года назад

      @@sruthisampath3432 Which commands did you use to install rdkit is it: conda install -c conda-forge rdkit

  • @Nolan_daily
    @Nolan_daily Год назад

    why using u test here instead of other test?

  • @microscience3879
    @microscience3879 3 года назад +1

    Can you make a video on data science for genomics.

    • @DataProfessor
      @DataProfessor  3 года назад

      Thanks for suggestion, will certainly consider that for future video

  • @pcliang2693
    @pcliang2693 4 года назад

    Dear professor, do you have any corresponding literature in this Drug Discovery video? Thanks

    • @DataProfessor
      @DataProfessor  4 года назад

      This article from our research group would be relevant peerj.com/articles/2322/

    • @pcliang2693
      @pcliang2693 4 года назад

      @@DataProfessor Thank you very much ,thanks!.

  • @sumanjoseph-w8c
    @sumanjoseph-w8c Год назад

    Error in Calculate descriptors step:
    TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float
    How to solve?

    • @ojochenemienejoh8209
      @ojochenemienejoh8209 Год назад

      @dataprofessor I also experienced this error. Please, how can it be resolved?

  • @MrEnriquefirst
    @MrEnriquefirst 4 года назад

    Thank you very much. When will you release the part 3?

    • @DataProfessor
      @DataProfessor  4 года назад

      Filming of Part 3 is finished and it is currently being edited. Should be out soon hopefully by tomorrow.

  • @padmashreerao9162
    @padmashreerao9162 4 года назад +1

    Hi, Can you please tell me where I can find a research paper with code? Thanks

    • @DataProfessor
      @DataProfessor  4 года назад

      Here it is paperswithcode.com 😃

    • @padmashreerao9162
      @padmashreerao9162 4 года назад +1

      @@DataProfessor That's quick. Thanks for adding useful videos. Can you please make videos of analyzing from PubMed gene data

    • @DataProfessor
      @DataProfessor  4 года назад +1

      Thanks for suggestion, I’ll see what I can do 😃

  • @maithilipokle4437
    @maithilipokle4437 4 года назад +1

    I am getting traceback for the code after defining Lipinski function. What to do? Someone please help.

    • @DataProfessor
      @DataProfessor  4 года назад

      Are you running from the Google Colab? Try running again in sequential order, if the error is still there, can you paste the entire error message here?

    • @maithilipokle4437
      @maithilipokle4437 4 года назад +3

      @@DataProfessor I tried running it again in sequential order but still getting traceback.
      TypeError Traceback (most recent call last)
      in ()
      ----> 1 df_lipinski = lipinski(df.canonical_smiles)
      in lipinski(smiles, verbose)
      5 moldata= []
      6 for elem in smiles:
      ----> 7 mol=Chem.MolFromSmiles(elem)
      8 moldata.append(mol)
      9
      TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string from this Python object of type float

    • @DataProfessor
      @DataProfessor  4 года назад

      @@maithilipokle4437 Was rdkit installed? Was this ran on a Colab or local Jupyter notebook?

    • @maithilipokle4437
      @maithilipokle4437 4 года назад +1

      @@DataProfessor yes I did install rdkit and conda (the first set of code). I ran it on Colab.

    • @DataProfessor
      @DataProfessor  4 года назад

      @@maithilipokle4437 Hi, there’s 2 parts to this tutorial. Here’s what you need to do:
      1. Run Part 1 and save the generated bioactivity data CSV file to your local computer
      2. Run Part 2 and make sure to upload the CSV file from Part 2 to it. Make sure to do the upload before running the reading of the bioactivity data file.
      Hopefully this should work.

  • @shivamdubey4783
    @shivamdubey4783 2 года назад

    sir plzz give an explanation for the code from link

  • @habib139
    @habib139 Год назад

    Hello! Thank you for this amazing tutorial! I just have a small problem that I would like some help with. Whenever I run the norm_value() function I keep getting the error below. Does anyone know how I can solve this?
    Cell In[53], line 1
    ----> 1 df_norm = norm_value(df_combined)
    2 df_norm
    Cell In[49], line 10, in norm_value(input)
    7 norm.append(i)
    9 input['standard_value_norm'] = norm
    ---> 10 x = input.drop('standard_value', 1)
    12 return x
    TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given

    • @psychotropicalfunk
      @psychotropicalfunk Год назад

      The drop version used here was deprecated. Use this instead: x = input.drop(columns='standard_value')

    • @krishna-s4w
      @krishna-s4w 22 дня назад

      @@psychotropicalfunkcan u please plz write the code for that. I m little confused

  • @gabrieleblanca4702
    @gabrieleblanca4702 4 года назад +1

    Dear Data Professor,
    Thank you for your great lecture.
    I need your help.
    Once I run the following input:
    df_norm = norm_value(df_combined)
    df_norm
    The output Is the following:
    TypeError Traceback (most recent call last)
    in ()
    ----> 1 df_norm = norm_value(df_combined)
    2 df_norm
    in norm_value(input)
    3
    4 for i in input['standard_value']:
    ----> 5 if i > 100000000:
    6 i = 100000000
    7 norm.append(i)
    TypeError: '>' not supported between instances of 'str' and 'int'
    Thank you in advance for your help

    • @DataProfessor
      @DataProfessor  4 года назад +1

      Python is confused when the input which is a string data type is used with the > operator. What you need to do is to first convert the input to an integer then try again. Hint: int(i)

    • @gabrieleblanca4702
      @gabrieleblanca4702 4 года назад +1

      @@DataProfessor thank you very much!! I solved that problem typing the following input:
      df_combined.standard_value.describe()
      pd.to_numeric(df_combined.standard_value, errors='coerce').fillna(0, downcast='infer').
      Now when I run:
      df_norm.standard_value_norm.describe()
      the output is:
      count 1.320000e+02
      mean 2.126145e+07
      std 4.101124e+07
      min 5.000000e+01
      25% 1.070000e+04
      50% 2.415000e+04
      75% 3.000000e+05
      max 1.000000e+08
      Name: standard_value_norm, dtype: float64
      Is different from your lecture because I used another dataset but now it is working well!! Many thanks.
      May I suggest some new lectures in chemoinformatics, in particular in protein-ligand interaction or protein structure prediction. This topic is very hot!!
      Have a nice weekend and thank you again!!

    • @DataProfessor
      @DataProfessor  4 года назад

      @@gabrieleblanca4702 Glad it's working out for you. Thanks for the suggestion. Protein-ligand interaction by means of Molecular docking, I presume, will definitely try to package the topic to be of wide interest and relevant to data science.Will definitely consider this for future videos.

    • @DataProfessor
      @DataProfessor  4 года назад +1

      @@gabrieleblanca4702 Glad it's working out for you. Thanks for the suggestion. Protein-ligand interaction by means of Molecular docking, I presume, will definitely try to package the topic to be of wide interest and relevant to data science.Will definitely consider this for future videos.

    • @gabrieleblanca4702
      @gabrieleblanca4702 4 года назад +1

      @@DataProfessor absolutely yes! Molecular docking should be of great relevance! I can't wait to watch it!!
      Thank you again for your availability. Have a nice day💪

  • @kelvinlin5629
    @kelvinlin5629 4 года назад +1

    This is so great. I juz started learning data science and ML and these r the projects i am hoping to do. ALso, I reached out to u in instagram. Plz kindly check it out, I have some questions regarding it. Thank u so much.

    • @DataProfessor
      @DataProfessor  4 года назад

      Thanks for the kind comment Kelvin. Okay I will check the instagram and get back.

  • @sebastianjorgecastro2452
    @sebastianjorgecastro2452 4 года назад +1

    Do you have a video explaining rdkit for conformational search and energy minimization? I would be very usefull for me!

    • @DataProfessor
      @DataProfessor  4 года назад +1

      Not yet, I'll consider this for future videos.

  • @davidcovell-rn9sw
    @davidcovell-rn9sw 2 месяца назад

    Running 'CDD_ML_Part_2_Acetylcholinesterase_Exploratory_Data_Analysis.ipynb' on a windows machine and followinf 'df_norm = norm_value(df_combined) am getting
    ---------------------------------------------------------------------------
    TypeError Traceback (most recent call last)
    Cell In[49], line 1
    ----> 1 df_norm = norm_value(df_combined)
    2 df_norm
    Cell In[47], line 10, in norm_value(input)
    7 norm.append(i)
    9 input['standard_value_norm'] = norm
    ---> 10 x = input.drop('standard_value', 1)
    12 return x
    TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given
    This also occurs with the corona virus example in this same set. Any help/solution would be appreciated.
    Thanks

  • @tinaparker9974
    @tinaparker9974 Год назад

    19:58 time stamp what is happening with the active and inactive classes? 🥲