Multiple Sequence Alignment (MSA) in R (Bioinformatics S11E2)

Поделиться
HTML-код
  • Опубликовано: 12 сен 2024

Комментарии • 32

  • @mateuslemos126
    @mateuslemos126 15 дней назад +1

    I'm binge watching all your lectures!

  • @douglaslima9637
    @douglaslima9637 Год назад +1

    Very good video. I'm now starting to study sequence alignment and this video clarified a lot of things for me. Thanks!

    • @DannyArends
      @DannyArends  Год назад +1

      Happy to hear it was helpful! Thanks for leaving a comment 👍

  • @metarasouli
    @metarasouli Год назад +2

    Thanks for sharing this video. It was so helpful.

    • @DannyArends
      @DannyArends  Год назад +1

      Glad it was helpful!

    • @metarasouli
      @metarasouli Год назад +1

      @@DannyArends You made life live, Man.

  • @DannyArends
    @DannyArends  2 года назад

    Skip directly to the R part at 29:06 if your not interested in theory behind Multiple Sequence Alignment

  • @andyderek3021
    @andyderek3021 2 года назад +1

    @Danny Arends this is very informative lecture. I have been searching for a similar lectures of this Type. I am very much interested in sequence alignment for my Thesis. Please, I would appreciate if you could brief me on a specific area to work on. maybe Multiple seq. alignment: I want to know what should my research on precisely focus on. My Bachelor Back ground is Computer Science and currently M.Sc Life Science Informatics.

    • @DannyArends
      @DannyArends  2 года назад +1

      Hey Andy, thanks for leaving a comment, about your question, I think multiple sequence alignment is still an active field of development and that there are still many computational improvements that can made to the current aligners out there. Look at the latest papers of clustalOmega and Muscle and check out there code repositories. Often it's better to join an existing team of developers for e.g. a MSc project and contribute to an existing piece of software than to start your own aligner from scratch (more of a PhD level task). Furthermore I think there currently is a need and research opportunity for aligners that can deal with alignment against a pangenome compared to current aligners that use a reference genome.
      Good luck with your MSc and if you have more questions feel free to come back to me here or drop me an email.

  • @kushunadkat9087
    @kushunadkat9087 5 месяцев назад +1

    How can I select a few of the most common sequences from the available sequences?

    • @DannyArends
      @DannyArends  5 месяцев назад +1

      What do you mean by 'the most common sequences'? In theory you could make a dendrogram of the output from MSA and extract the clusters from it (using hcut on the dendrogram to cut it into sub-trees). Sequences within a cluster are more related to each other, so this would give you an overview of related sequences and different groups of sequences in your data.

    • @kushunadkat9087
      @kushunadkat9087 5 месяцев назад +1

      @@DannyArends I considered that option but I was wondering if there was a way (maybe by using BioPython or R) to, say, get the sequences with the highest scores or compare them to a pairwise score to find the most similar protein sequences.
      I’m sorry if the question is dumb or doesn’t make sense, I’m still a student and I’m not a bioinformatics or computer science major. 😅

    • @DannyArends
      @DannyArends  5 месяцев назад +1

      No dumb questions, just trying to figure out your goal. It sounds like you want to score the sequences in the input set, for their distance to the consensus sequence. I would run an initial alignment, get the consensus sequence and add that to the input, and align again. Then cluster the results,and visualize this should allow you to identify the sequences with the closest distance to the consensus sequence.

    • @kushunadkat9087
      @kushunadkat9087 5 месяцев назад +1

      @@DannyArends thanks so much!
      There’s no goal, I’m just trying to understand. As a follow up question, wouldn’t that be time consuming and tedious for a bunch of sequences?

    • @DannyArends
      @DannyArends  5 месяцев назад +1

      Not really, it would just require running the msa twice, time wise it depends on the number of sequences, the length, and complexity (DNA vs Amino acids). Depending on the goal there might be other ways around it to prevent having to run the MSA twice.

  • @szymonjakubowski3574
    @szymonjakubowski3574 Год назад +1

    How are the gaps between contigs filled when constructing a scaffold?

    • @DannyArends
      @DannyArends  Год назад +1

      If the length is known from paired end sequencing, generally using N (matches every base) to keep the genome length/position in order. The gap is the resolved using primers on either end and sanger sequencing of the PCR product.

    • @szymonjakubowski3574
      @szymonjakubowski3574 Год назад +2

      @@DannyArends Thank you for the answer. I'm really enjoying your lectures in this course :)

  • @nirjhar9725
    @nirjhar9725 Год назад +1

    How to read from multi fasta file like
    >seq_1
    AAAAA
    >seq_2
    TTAA
    and so on

    • @DannyArends
      @DannyArends  Год назад +1

      Interesting question, in R we can do the following:
      mdata

  • @tovarvonbrandt7157
    @tovarvonbrandt7157 2 года назад +1

    For large sequences what should I use? Muscle?

    • @DannyArends
      @DannyArends  2 года назад +2

      It depends on the sequence length, number of sequences, DNA or protein, and the divergence between them. I don't think you can make a general statement, but in my experience Muscle is faster than Clustal when you have a large number of large sequences.

    • @tovarvonbrandt7157
      @tovarvonbrandt7157 2 года назад +1

      @@DannyArends I actually managed to replicate a mega tree using some parts of this code. Just had to open file via system instead of copy and pasting it and change the cladogram.
      Tried both with ClustalO and Muscle. For this case both yielded the same tree for a fasta file with 22 nucleotide sequences (27k to 35k basepairs.)

    • @tovarvonbrandt7157
      @tovarvonbrandt7157 2 года назад +1

      @@DannyArends is there any way to see the AICs of a MSA done via a multisequence fasta file? Like you do using Mega (models/find best DNA protein models)
      Also, when doing msa on R, compared to Mega how do you actually do the phylogeny for maximum likelihood and so on?
      Is that done and choose automatically by the aligner?

    • @DannyArends
      @DannyArends  2 года назад +1

      Good to hear it worked, larger sequences go smoother from the HDD indeed.
      As far as I am aware the MSA has no 'goodness of fit' parameter to compare different alignments because the alignment is always the optimal on based on the input sequences, parameters and substitutions matrix chosen. In the end the researcher is responsible for aligning sensible things.
      The distance between individual sequences is generally computed after alignment using clustering. This step can of course be done using different methods (hclust, kmean, etc) both supervised and unsupervised. At this part of the process you could probably use a maximum likelihood approach (e.g. hierarchical maximum likelihood clustering) which would give you a goodness of fit score for the tree (not the alignment)

    • @tovarvonbrandt7157
      @tovarvonbrandt7157 2 года назад +1

      @@DannyArends thank you for your reply. After reading R documentation for AICs and BIC I think I need some protein models first in order to compare it via AIC.
      Will be attending to your twitch lectures as I studied pure data science and now studying biomed.
      It is hard to program my brain again to see things more in a biological way than pure mathematics.

  • @zhenfeixie7286
    @zhenfeixie7286 2 года назад

    How to set the sequence reference for alignment ? Is there any way that can spilt all the alignment sequence results into data.frame? So each character of the sequence will occupy an unit of the data.frame?

    • @DannyArends
      @DannyArends  2 года назад

      In multiple sequence alignment there is no reference, sequences get aligned based on distances to each other, that is why it is important to only align sequences which are similar to begin with.
      The results from the msa() call in R will contain the aligned sequences (and the consensus sequence):
      library(msa)
      library(seqinr)
      strings

  • @alexhernandezherrera5198
    @alexhernandezherrera5198 2 года назад

    the msa package canot be installed, please can you help me

    • @DannyArends
      @DannyArends  2 года назад

      Without knowing the error it will be difficult, could you send me an email with a screenshot of the error you are getting?