Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • antifolate
    Member
    • Aug 2015
    • 52

    Using BBMap to determine possible repeat scaffolds/contigs

    Hello,

    I used BBmap to find the coverage of a draft genome I have with this command:
    Code:
    bbmap.sh in1=reads1.fq in2=reads2.fq ref=scaffolds.fasta covstats=covstats.txt
    Now I'd like to use the coverage on each scaffold (reported in the output covstats.txt) to identify scaffolds that might be repeats.

    The way I want to do this is to look at the average coverage on all scaffolds (which I get from the stdout of BBMap), for example 70x, and see which scaffolds have double that coverage, 140x, or triple, 210x, and so on, implying those scaffolds are repeated once and twice, respectively. Do you think this is a reasonable approach to determine repeat scaffolds from an assembly?

    Please let me know what you think.
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    @Brian can tell you if that strategy is logical but if you feel they are repeats (contained within the scaffolds) then doing old fashioned dot plots may be one option to check on.

    Comment

    • antifolate
      Member
      • Aug 2015
      • 52

      #3
      Thanks @GenoMax. I'm more looking into those repeats that the assembly missed. Like instead of assembling two different scaffolds with average coverage, it assembles one scaffold with double the average coverage.

      I haven't worked with dot plots that much, except MUMmer's. Would a dot plot help me with this?

      The problem I'm having with my approach is that some scaffolds have a much higher coverage than the average. One of the scaffolds has 14k coverage (the average is 79) and is 22 kbp long. It's unlikely this scaffold is supposed to be assembled 176 more times...

      Comment

      • HESmith
        Senior Member
        • Oct 2009
        • 512

        #4
        Coverage depth is a useful metric for detecting scaffold-sized repeats. Just remember that, depending upon your sample, higher-than-average coverage can also arise from plasmids, viruses, and/or microbial (or other) contamination. It might be useful to BLAST some of the high-coverage scaffolds and see what matches.

        Note that shorter repeats (≥read length or insert size) can also break scaffolds, but may be too short to assemble. An alternative approach is to estimate genome size and the amount of repetitive sequence from k-mer counts. Consistency with your scaffold data would provide more confidence that your analysis is correct.

        Comment

        • antifolate
          Member
          • Aug 2015
          • 52

          #5
          I just blasted the scaffolds with the highest coverage depth and two of them turned out to be mitochondrial. I guess I'll simply filter those two out. Thanks!

          I'm not sure I completely understand your suggestion with the kmer counts. Could you elaborate please?

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            Originally posted by antifolate View Post
            Hello,

            I used BBmap to find the coverage of a draft genome I have with this command:
            Code:
            bbmap.sh in1=reads1.fq in2=reads2.fq ref=scaffolds.fasta covstats=covstats.txt
            Now I'd like to use the coverage on each scaffold (reported in the output covstats.txt) to identify scaffolds that might be repeats.

            The way I want to do this is to look at the average coverage on all scaffolds (which I get from the stdout of BBMap), for example 70x, and see which scaffolds have double that coverage, 140x, or triple, 210x, and so on, implying those scaffolds are repeated once and twice, respectively. Do you think this is a reasonable approach to determine repeat scaffolds from an assembly?

            Please let me know what you think.
            As long as your coverage is fairly unbiased (ideally, unamplified), this is a good strategy.

            The ability to accurately determine the coverage of a contig depends on its length, though - it won't be very accurate with 500bp contigs, while 10kb+ contigs should be quite accurate. Sometimes operating at the kmer level can give more accurate coverage for very short contigs; some assemblers put the kmer-based coverage in the contig name (e.g. "node=55,len=752,cov=23.4"). If you assemble from normalized data that number is not useful, though.

            I haven't worked with dot plots that much, except MUMmer's. Would a dot plot help me with this?
            JGI uses MUMmer all the time, particularly for identifying repeats in PacBio assemblies (they have the potential to produce very long repeats, which are usually correct, but not always) and it is very effective. But that's only if the repeat was actually assembled in multiple copies. Kmer-based assemblers will generally collapse repeats that are longer than the read length (or insert size, or largest kmer) into a single copy. Some more advanced assemblers (like Spades) may try to analyze the graph and make multiple copies of repeats, but sometimes that goes horribly wrong (like when assembling viruses). So MUMmer will not usually help you find repeats when aligning an Illumina draft assembly to itself, though it is useful when aligning a PacBio assembly to itself or a finished reference to itself. Or a draft to a very-closely-related finished assembly.

            The problem I'm having with my approach is that some scaffolds have a much higher coverage than the average. One of the scaffolds has 14k coverage (the average is 79) and is 22 kbp long. It's unlikely this scaffold is supposed to be assembled 176 more times...
            Try BLASTing it - I bet it's a mitochondria Or more precisely, part of a mito. The whole thing is probably bigger. If you are working on a prokaryote, then as HESmith suggests, a plasmid/virus/other contaminant is a possibility... though I've never seen a plasmid with such a high copy-count, and it's hard to get 22kbp contigs from wild (and sometimes even cultured) viruses without a lot of effort. If you are working with a eukaryote, it could also be another organelle such as a chloroplast, but a 176:1 ratio and 22kbp length matches many fungal mitochondrial contigs I've assembled. The length of the whole mitochondria is normally more like 50kbp-90kbp.

            P.S. Rats, I got ninja'd
            Last edited by Brian Bushnell; 01-06-2016, 07:32 PM.

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              Originally posted by antifolate View Post
              I just blasted the scaffolds with the highest coverage depth and two of them turned out to be mitochondrial. I guess I'll simply filter those two out. Thanks!
              It's very useful for assembly releases to contain the mitochondria, especially if they are annotated as mitochondria. That is, if you plan to submit the assembly to NCBI or some other database. Even if you don't, it's useful to isolate the mitochondria for your own purposes. It's worth noting that mitochondria often share part of the host ribosomal DNA (which is one reason they don't assemble into a single contig, since that's a repeat). Presumably there is at least one other gene shared with the host, which is why they tend to form at least two contigs in a DeBruijn graph. The point being that if you filter out mito contigs, mito reads from the regions shared with the main genome will only map to the main genome, and thus give you inflated numbers.

              I'm not sure I completely understand your suggestion with the kmer counts. Could you elaborate please?
              You can run this:

              kmercountexact.sh in1=reads1.fq in2=reads2.fq khist=khist.txt peaks=peaks.txt

              You can plot the kmer histogram (khist) to see a visual representation of the amount of genome with various copy-counts, or look at the peaks file for a summary. The peaks file will (if the coverage distribution is unbiased) tell you exactly how much DNA is in each cell ("genome size"), the heterozygousity rate, repeat fraction, and so forth. It doesn't tell you which contigs are repeats, though, since it doesn't do assembly. For that, it's better to rely on the assembler's claimed fold coverage of each contig.
              Last edited by Brian Bushnell; 01-06-2016, 07:51 PM.

              Comment

              • antifolate
                Member
                • Aug 2015
                • 52

                #8
                Very helpful and informative, as always; thank you @Brian! I just tried kmercountexact.sh and the output is definitely useful. It's also consistent with what I got by plotting the coverage of each scaffold.

                Thanks for help everyone!

                Comment

                • antifolate
                  Member
                  • Aug 2015
                  • 52

                  #9
                  if you filter out mito contigs, mito reads from the regions shared with the main genome will only map to the main genome, and thus give you inflated numbers.
                  Do you mean filtering the mito contigs would inflate my numbers in the coverage analysis with BBMap? That makes sense. I meant I would filter them later, for my later steps which are mainly comparing the assembly to a reference.

                  Do you have a better way to filter the mitochondria data? I assumed filtering the mito reads out would risk loosing some data.

                  Comment

                  • Brian Bushnell
                    Super Moderator
                    • Jan 2014
                    • 2709

                    #10
                    Originally posted by antifolate View Post
                    Do you mean filtering the mito contigs would inflate my numbers in the coverage analysis with BBMap? That makes sense. I meant I would filter them later, for my later steps which are mainly comparing the assembly to a reference.

                    Do you have a better way to filter the mitochondria data? I assumed filtering the mito reads out would risk loosing some data.
                    What I mean is, if you remove mito contigs and map all reads to the genome contigs only, some mito reads will map to genomic contigs (such as the ones with ribosomes) and inflate their coverage.

                    As for "a better way"... why do you want to filter out the mito? If you are looking for duplicate sequences, well, mito sequences ARE duplicated in large numbers. So, map to everything to generate coverage, then ignore the contigs you flagged as mitochondrial. If that's what you meant by "filter them later", then that's good.

                    Comment

                    • antifolate
                      Member
                      • Aug 2015
                      • 52

                      #11
                      @Brian I got you. I wantto filter the mitos out because, well, I just don't care about them; I want to study structural variation along different parts of the chromosomes. Finding possible duplicate scaffolds is something I thought would help me identify more SVs, but of course I want those to be accurate.

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        Yesterday, 10:05 AM
                      • SEQadmin2
                        Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                        by SEQadmin2


                        With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                        Introduction

                        Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                        05-22-2026, 06:42 AM
                      • SEQadmin2
                        Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                        by SEQadmin2

                        Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                        Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                        05-06-2026, 09:04 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Yesterday, 12:03 PM
                      0 responses
                      19 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, Yesterday, 11:40 AM
                      0 responses
                      14 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 05-28-2026, 11:40 AM
                      0 responses
                      29 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 05-26-2026, 10:12 AM
                      0 responses
                      31 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...