Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting Unmapped Contigs

    Hello,

    I've assembled some contigs from a sequenced strain and aligned them back to the reference with Mauve. I'm interested in studying those contigs that did not align. How can I extract those contigs?

    Mauve can export a file with the positions of all those contigs in the input file so I can probably write a script to get the sequences, but I thought there's probably an easier way. Also, if you know a better software than Mauve, please tell me. My purpose is comparing several de-novo-assembled strains with each other and with the reference genome (~20MB) to see sequence variations.

    Thanks!

  • #2
    Well, Mauve is usually used for genome-to-genome alignments. Although it can be used to map a draft genome with many contigs to a more finished genome. Depending on the size of your contigs and the reference perhaps a mapper such as BBMap might work better -- you would at least get a BAM file that you could work with.

    But since you are already using Mauve, well, it seems simple to me to write a short script via 'cut', 'grep', 'paste', 'diff' etc. to pull out the non-mapping contigs.

    I know my response isn't much help. Sorry about that. Just random thoughts at the end of a long day.

    Comment


    • #3
      I don't know why I went and installed Mauve while I already had the BB package. But now that I tried BBMap, I don't see how a BAM file is better (I'm not even sure this output is BAM). Is there an option in BBMap to retain the contigs that don't get mapped?

      I'll be doing this for several assemblies so it'd be easier and faster if there was a simple flag for it, rather than having to massage the data every time with unix tools.

      Comment


      • #4
        BBMap can separate mapped and unmapped sequences, and keep them in fasta format, like this:

        mapPacBio.sh in=contigs.fasta ref=reference.fasta outm=mapped.fasta outu=unmapped.fasta

        That has a sequence length limit of 6000bp; longer sequences will be broken into fragments (bbmap.sh has a length limit of 600bp while mapPacBio is 6000bp). On the other hand, BBDuk is alignment-free and can do the same thing using kmer matching, without breaking up the input contigs:

        bbduk.sh in=contigs.fa ref=reference.fa outm=matching.fa outu=unmatching.fa k=31 mcf=0.7

        "mcf" means "min covered fraction", so this will make input sequences in which at least 70% of their bases are covered by reference kmers go to outm, and the others will go to outu.

        Another way of doing it is to shred the input (or reference) contigs and map them to the other, get a bed file of the resulting sam, and extract the covered regions. There are a lot of ways, I guess, depending on your specific goal.
        Last edited by Brian Bushnell; 10-13-2015, 02:09 PM.

        Comment


        • #5
          Thanks Brian! About BBMap, does that mean that I shouldn't be using it (or even mapPacBio) to map contigs with n50=91k into the reference (s. pombe)?

          Also could you please please go into the details of your last method?

          Comment


          • #6
            Originally posted by antifolate View Post
            Thanks Brian! About BBMap, does that mean that I shouldn't be using it (or even mapPacBio) to map contigs with n50=91k into the reference (s. pombe)?
            It really depends on your purpose, but for anything that requires the contigs to remain intact, it's not ideal; Seal (or an aligner for unlimited-length reads) would be better.

            Also could you please please go into the details of your last method?
            For example -

            bbduk.sh in=contigs.fa ref=reference.fa out=masked_contigs.fa kmask=N k=31

            That would produce a file like the contigs.fa, but with all bases covered by reference kmers masked (converted to N). This is handy because it still works even if both fastas contain a complete, single-contig genome, without shredding anything or changing any coordinates. Filtering methods would not really work well for single-contig genomes.

            Comment


            • #7
              @antifolate: You should post this question on Mauve list. Aaron will generally address mauve questions in a day or two. It may be possible to extract the info you need from one of the mauve files.

              Comment


              • #8
                @Brian Bushnell Is there a way to make bbduk.sh not count mismatches resulting from Ns towards the 0.7?

                Also, what is k?

                @GenoMax Can you link me to the Mauve list? I can't find it.

                Comment


                • #9
                  Originally posted by antifolate View Post
                  @Brian Bushnell Is there a way to make bbduk.sh not count mismatches resulting from Ns towards the 0.7?
                  You cannot do that with the "mcf" (min covered fraction) flag, but you can with the "mkf" (min kmer fraction) flag.

                  mcf=0.7: reference kmers must cover 70% of the sequence's bases.

                  mkf=0.7: sequence must share 70% of its valid kmers with the reference.

                  The distinction is subtle, but mkf takes into account Ns - sequence with Ns do not have valid kmers, so it is ignored with respect to the 70%. mcf does not take Ns into account. Not currently, at least. I might change that in the future.

                  Also, what is k?
                  "k" is the kmer length. 31 is BBDuk's maximum; you can make it shorter if you want, but specificity starts declining rapidly once you drop below 20.

                  Comment


                  • #10
                    God bless Brian Bushnell! Thanks for the quick reply!

                    Comment


                    • #11
                      Originally posted by antifolate View Post
                      @GenoMax Can you link me to the Mauve list? I can't find it.
                      mauve-users at lists.sourceforge.net

                      You may need to become a member first before you can post.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      7 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      7 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      66 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X