Unconfigured Ad

**GenoMax** · 01-06-2016, 03:12 PM

@Brian can tell you if that strategy is logical but if you feel they are repeats (contained within the scaffolds) then doing old fashioned dot plots may be one option to check on.

**antifolate** · 01-06-2016, 04:02 PM

Thanks @GenoMax. I'm more looking into those repeats that the assembly missed. Like instead of assembling two different scaffolds with average coverage, it assembles one scaffold with double the average coverage.

I haven't worked with dot plots that much, except MUMmer's. Would a dot plot help me with this?

The problem I'm having with my approach is that some scaffolds have a much higher coverage than the average. One of the scaffolds has 14k coverage (the average is 79) and is 22 kbp long. It's unlikely this scaffold is supposed to be assembled 176 more times...

**HESmith** · 01-06-2016, 04:39 PM

Coverage depth is a useful metric for detecting scaffold-sized repeats. Just remember that, depending upon your sample, higher-than-average coverage can also arise from plasmids, viruses, and/or microbial (or other) contamination. It might be useful to BLAST some of the high-coverage scaffolds and see what matches.

Note that shorter repeats (≥read length or insert size) can also break scaffolds, but may be too short to assemble. An alternative approach is to estimate genome size and the amount of repetitive sequence from k-mer counts. Consistency with your scaffold data would provide more confidence that your analysis is correct.

**antifolate** · 01-06-2016, 07:05 PM

I just blasted the scaffolds with the highest coverage depth and two of them turned out to be mitochondrial. I guess I'll simply filter those two out. Thanks!

I'm not sure I completely understand your suggestion with the kmer counts. Could you elaborate please?

**Brian Bushnell** · 01-06-2016, 07:29 PM

Originally posted by antifolate View Post

Hello,

I used BBmap to find the coverage of a draft genome I have with this command:

Code:

bbmap.sh in1=reads1.fq in2=reads2.fq ref=scaffolds.fasta covstats=covstats.txt

Now I'd like to use the coverage on each scaffold (reported in the output covstats.txt) to identify scaffolds that might be repeats.

The way I want to do this is to look at the average coverage on all scaffolds (which I get from the stdout of BBMap), for example 70x, and see which scaffolds have double that coverage, 140x, or triple, 210x, and so on, implying those scaffolds are repeated once and twice, respectively. Do you think this is a reasonable approach to determine repeat scaffolds from an assembly?

Please let me know what you think.

As long as your coverage is fairly unbiased (ideally, unamplified), this is a good strategy.

The ability to accurately determine the coverage of a contig depends on its length, though - it won't be very accurate with 500bp contigs, while 10kb+ contigs should be quite accurate. Sometimes operating at the kmer level can give more accurate coverage for very short contigs; some assemblers put the kmer-based coverage in the contig name (e.g. "node=55,len=752,cov=23.4"). If you assemble from normalized data that number is not useful, though.

I haven't worked with dot plots that much, except MUMmer's. Would a dot plot help me with this?

JGI uses MUMmer all the time, particularly for identifying repeats in PacBio assemblies (they have the potential to produce very long repeats, which are usually correct, but not always) and it is very effective. But that's only if the repeat was actually assembled in multiple copies. Kmer-based assemblers will generally collapse repeats that are longer than the read length (or insert size, or largest kmer) into a single copy. Some more advanced assemblers (like Spades) may try to analyze the graph and make multiple copies of repeats, but sometimes that goes horribly wrong (like when assembling viruses). So MUMmer will not usually help you find repeats when aligning an Illumina draft assembly to itself, though it is useful when aligning a PacBio assembly to itself or a finished reference to itself. Or a draft to a very-closely-related finished assembly.

The problem I'm having with my approach is that some scaffolds have a much higher coverage than the average. One of the scaffolds has 14k coverage (the average is 79) and is 22 kbp long. It's unlikely this scaffold is supposed to be assembled 176 more times...

Try BLASTing it - I bet it's a mitochondria

Or more precisely, part of a mito. The whole thing is probably bigger. If you are working on a prokaryote, then as HESmith suggests, a plasmid/virus/other contaminant is a possibility... though I've never seen a plasmid with such a high copy-count, and it's hard to get 22kbp contigs from wild (and sometimes even cultured) viruses without a lot of effort. If you are working with a eukaryote, it could also be another organelle such as a chloroplast, but a 176:1 ratio and 22kbp length matches many fungal mitochondrial contigs I've assembled. The length of the whole mitochondria is normally more like 50kbp-90kbp.

P.S. Rats, I got ninja'd

**Brian Bushnell** · 01-06-2016, 07:41 PM

Originally posted by antifolate View Post

I just blasted the scaffolds with the highest coverage depth and two of them turned out to be mitochondrial. I guess I'll simply filter those two out. Thanks!

It's very useful for assembly releases to contain the mitochondria, especially if they are annotated as mitochondria. That is, if you plan to submit the assembly to NCBI or some other database. Even if you don't, it's useful to isolate the mitochondria for your own purposes. It's worth noting that mitochondria often share part of the host ribosomal DNA (which is one reason they don't assemble into a single contig, since that's a repeat). Presumably there is at least one other gene shared with the host, which is why they tend to form at least two contigs in a DeBruijn graph. The point being that if you filter out mito contigs, mito reads from the regions shared with the main genome will only map to the main genome, and thus give you inflated numbers.

I'm not sure I completely understand your suggestion with the kmer counts. Could you elaborate please?

You can run this:

kmercountexact.sh in1=reads1.fq in2=reads2.fq khist=khist.txt peaks=peaks.txt

You can plot the kmer histogram (khist) to see a visual representation of the amount of genome with various copy-counts, or look at the peaks file for a summary. The peaks file will (if the coverage distribution is unbiased) tell you exactly how much DNA is in each cell ("genome size"), the heterozygousity rate, repeat fraction, and so forth. It doesn't tell you which contigs are repeats, though, since it doesn't do assembly. For that, it's better to rely on the assembler's claimed fold coverage of each contig.

**antifolate** · 01-06-2016, 09:01 PM

Very helpful and informative, as always; thank you @Brian! I just tried kmercountexact.sh and the output is definitely useful. It's also consistent with what I got by plotting the coverage of each scaffold.

Thanks for help everyone!

**antifolate** · 01-06-2016, 09:19 PM

if you filter out mito contigs, mito reads from the regions shared with the main genome will only map to the main genome, and thus give you inflated numbers.

Do you mean filtering the mito contigs would inflate my numbers in the coverage analysis with BBMap? That makes sense. I meant I would filter them later, for my later steps which are mainly comparing the assembly to a reference.

Do you have a better way to filter the mitochondria data? I assumed filtering the mito reads out would risk loosing some data.

**Brian Bushnell** · 01-06-2016, 09:43 PM

Originally posted by antifolate View Post

Do you mean filtering the mito contigs would inflate my numbers in the coverage analysis with BBMap? That makes sense. I meant I would filter them later, for my later steps which are mainly comparing the assembly to a reference.

Do you have a better way to filter the mitochondria data? I assumed filtering the mito reads out would risk loosing some data.

What I mean is, if you remove mito contigs and map all reads to the genome contigs only, some mito reads will map to genomic contigs (such as the ones with ribosomes) and inflate their coverage.

As for "a better way"... why do you want to filter out the mito? If you are looking for duplicate sequences, well, mito sequences ARE duplicated in large numbers. So, map to everything to generate coverage, then ignore the contigs you flagged as mitochondrial. If that's what you meant by "filter them later", then that's good.

**antifolate** · 01-06-2016, 11:14 PM

@Brian I got you. I wantto filter the mitos out because, well, I just don't care about them; I want to study structural variation along different parts of the chromosomes. Finding possible duplicate scaffolds is something I thought would help me identify more SVs, but of course I want those to be accurate.

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, Yesterday, 12:03 PM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 Yesterday, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, Yesterday, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

Using BBMap to determine possible repeat scaffolds/contigs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News