Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using BBMap to determine possible repeat scaffolds/contigs

    Hello,

    I used BBmap to find the coverage of a draft genome I have with this command:
    Code:
    bbmap.sh in1=reads1.fq in2=reads2.fq ref=scaffolds.fasta covstats=covstats.txt
    Now I'd like to use the coverage on each scaffold (reported in the output covstats.txt) to identify scaffolds that might be repeats.

    The way I want to do this is to look at the average coverage on all scaffolds (which I get from the stdout of BBMap), for example 70x, and see which scaffolds have double that coverage, 140x, or triple, 210x, and so on, implying those scaffolds are repeated once and twice, respectively. Do you think this is a reasonable approach to determine repeat scaffolds from an assembly?

    Please let me know what you think.

  • #2
    @Brian can tell you if that strategy is logical but if you feel they are repeats (contained within the scaffolds) then doing old fashioned dot plots may be one option to check on.

    Comment


    • #3
      Thanks @GenoMax. I'm more looking into those repeats that the assembly missed. Like instead of assembling two different scaffolds with average coverage, it assembles one scaffold with double the average coverage.

      I haven't worked with dot plots that much, except MUMmer's. Would a dot plot help me with this?

      The problem I'm having with my approach is that some scaffolds have a much higher coverage than the average. One of the scaffolds has 14k coverage (the average is 79) and is 22 kbp long. It's unlikely this scaffold is supposed to be assembled 176 more times...

      Comment


      • #4
        Coverage depth is a useful metric for detecting scaffold-sized repeats. Just remember that, depending upon your sample, higher-than-average coverage can also arise from plasmids, viruses, and/or microbial (or other) contamination. It might be useful to BLAST some of the high-coverage scaffolds and see what matches.

        Note that shorter repeats (≥read length or insert size) can also break scaffolds, but may be too short to assemble. An alternative approach is to estimate genome size and the amount of repetitive sequence from k-mer counts. Consistency with your scaffold data would provide more confidence that your analysis is correct.

        Comment


        • #5
          I just blasted the scaffolds with the highest coverage depth and two of them turned out to be mitochondrial. I guess I'll simply filter those two out. Thanks!

          I'm not sure I completely understand your suggestion with the kmer counts. Could you elaborate please?

          Comment


          • #6
            Originally posted by antifolate View Post
            Hello,

            I used BBmap to find the coverage of a draft genome I have with this command:
            Code:
            bbmap.sh in1=reads1.fq in2=reads2.fq ref=scaffolds.fasta covstats=covstats.txt
            Now I'd like to use the coverage on each scaffold (reported in the output covstats.txt) to identify scaffolds that might be repeats.

            The way I want to do this is to look at the average coverage on all scaffolds (which I get from the stdout of BBMap), for example 70x, and see which scaffolds have double that coverage, 140x, or triple, 210x, and so on, implying those scaffolds are repeated once and twice, respectively. Do you think this is a reasonable approach to determine repeat scaffolds from an assembly?

            Please let me know what you think.
            As long as your coverage is fairly unbiased (ideally, unamplified), this is a good strategy.

            The ability to accurately determine the coverage of a contig depends on its length, though - it won't be very accurate with 500bp contigs, while 10kb+ contigs should be quite accurate. Sometimes operating at the kmer level can give more accurate coverage for very short contigs; some assemblers put the kmer-based coverage in the contig name (e.g. "node=55,len=752,cov=23.4"). If you assemble from normalized data that number is not useful, though.

            I haven't worked with dot plots that much, except MUMmer's. Would a dot plot help me with this?
            JGI uses MUMmer all the time, particularly for identifying repeats in PacBio assemblies (they have the potential to produce very long repeats, which are usually correct, but not always) and it is very effective. But that's only if the repeat was actually assembled in multiple copies. Kmer-based assemblers will generally collapse repeats that are longer than the read length (or insert size, or largest kmer) into a single copy. Some more advanced assemblers (like Spades) may try to analyze the graph and make multiple copies of repeats, but sometimes that goes horribly wrong (like when assembling viruses). So MUMmer will not usually help you find repeats when aligning an Illumina draft assembly to itself, though it is useful when aligning a PacBio assembly to itself or a finished reference to itself. Or a draft to a very-closely-related finished assembly.

            The problem I'm having with my approach is that some scaffolds have a much higher coverage than the average. One of the scaffolds has 14k coverage (the average is 79) and is 22 kbp long. It's unlikely this scaffold is supposed to be assembled 176 more times...
            Try BLASTing it - I bet it's a mitochondria Or more precisely, part of a mito. The whole thing is probably bigger. If you are working on a prokaryote, then as HESmith suggests, a plasmid/virus/other contaminant is a possibility... though I've never seen a plasmid with such a high copy-count, and it's hard to get 22kbp contigs from wild (and sometimes even cultured) viruses without a lot of effort. If you are working with a eukaryote, it could also be another organelle such as a chloroplast, but a 176:1 ratio and 22kbp length matches many fungal mitochondrial contigs I've assembled. The length of the whole mitochondria is normally more like 50kbp-90kbp.

            P.S. Rats, I got ninja'd
            Last edited by Brian Bushnell; 01-06-2016, 07:32 PM.

            Comment


            • #7
              Originally posted by antifolate View Post
              I just blasted the scaffolds with the highest coverage depth and two of them turned out to be mitochondrial. I guess I'll simply filter those two out. Thanks!
              It's very useful for assembly releases to contain the mitochondria, especially if they are annotated as mitochondria. That is, if you plan to submit the assembly to NCBI or some other database. Even if you don't, it's useful to isolate the mitochondria for your own purposes. It's worth noting that mitochondria often share part of the host ribosomal DNA (which is one reason they don't assemble into a single contig, since that's a repeat). Presumably there is at least one other gene shared with the host, which is why they tend to form at least two contigs in a DeBruijn graph. The point being that if you filter out mito contigs, mito reads from the regions shared with the main genome will only map to the main genome, and thus give you inflated numbers.

              I'm not sure I completely understand your suggestion with the kmer counts. Could you elaborate please?
              You can run this:

              kmercountexact.sh in1=reads1.fq in2=reads2.fq khist=khist.txt peaks=peaks.txt

              You can plot the kmer histogram (khist) to see a visual representation of the amount of genome with various copy-counts, or look at the peaks file for a summary. The peaks file will (if the coverage distribution is unbiased) tell you exactly how much DNA is in each cell ("genome size"), the heterozygousity rate, repeat fraction, and so forth. It doesn't tell you which contigs are repeats, though, since it doesn't do assembly. For that, it's better to rely on the assembler's claimed fold coverage of each contig.
              Last edited by Brian Bushnell; 01-06-2016, 07:51 PM.

              Comment


              • #8
                Very helpful and informative, as always; thank you @Brian! I just tried kmercountexact.sh and the output is definitely useful. It's also consistent with what I got by plotting the coverage of each scaffold.

                Thanks for help everyone!

                Comment


                • #9
                  if you filter out mito contigs, mito reads from the regions shared with the main genome will only map to the main genome, and thus give you inflated numbers.
                  Do you mean filtering the mito contigs would inflate my numbers in the coverage analysis with BBMap? That makes sense. I meant I would filter them later, for my later steps which are mainly comparing the assembly to a reference.

                  Do you have a better way to filter the mitochondria data? I assumed filtering the mito reads out would risk loosing some data.

                  Comment


                  • #10
                    Originally posted by antifolate View Post
                    Do you mean filtering the mito contigs would inflate my numbers in the coverage analysis with BBMap? That makes sense. I meant I would filter them later, for my later steps which are mainly comparing the assembly to a reference.

                    Do you have a better way to filter the mitochondria data? I assumed filtering the mito reads out would risk loosing some data.
                    What I mean is, if you remove mito contigs and map all reads to the genome contigs only, some mito reads will map to genomic contigs (such as the ones with ribosomes) and inflate their coverage.

                    As for "a better way"... why do you want to filter out the mito? If you are looking for duplicate sequences, well, mito sequences ARE duplicated in large numbers. So, map to everything to generate coverage, then ignore the contigs you flagged as mitochondrial. If that's what you meant by "filter them later", then that's good.

                    Comment


                    • #11
                      @Brian I got you. I wantto filter the mitos out because, well, I just don't care about them; I want to study structural variation along different parts of the chromosomes. Finding possible duplicate scaffolds is something I thought would help me identify more SVs, but of course I want those to be accurate.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X