SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
mask contigs in scaffolds by length Wallysb01 Bioinformatics 0 04-20-2013 08:36 AM
break scaffolds with many Ns to contigs yzzhang Bioinformatics 2 02-01-2013 07:50 AM
Contigs Vs Scaffolds for Assembly Analysis narain Bioinformatics 5 10-14-2011 07:15 AM
How can I determine order of scaffolds from de novo sequencing? odysseus Bioinformatics 0 03-21-2010 08:05 PM
How can I determine order of scaffolds from de novo sequencing? odysseus Introductions 0 03-21-2010 04:45 AM

Reply
 
Thread Tools
Old 01-06-2016, 10:53 AM   #1
antifolate
Member
 
Location: Chesterfield, MO

Join Date: Aug 2015
Posts: 52
Default Using BBMap to determine possible repeat scaffolds/contigs

Hello,

I used BBmap to find the coverage of a draft genome I have with this command:
Code:
bbmap.sh in1=reads1.fq in2=reads2.fq ref=scaffolds.fasta covstats=covstats.txt
Now I'd like to use the coverage on each scaffold (reported in the output covstats.txt) to identify scaffolds that might be repeats.

The way I want to do this is to look at the average coverage on all scaffolds (which I get from the stdout of BBMap), for example 70x, and see which scaffolds have double that coverage, 140x, or triple, 210x, and so on, implying those scaffolds are repeated once and twice, respectively. Do you think this is a reasonable approach to determine repeat scaffolds from an assembly?

Please let me know what you think.
antifolate is offline   Reply With Quote
Old 01-06-2016, 02:12 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

@Brian can tell you if that strategy is logical but if you feel they are repeats (contained within the scaffolds) then doing old fashioned dot plots may be one option to check on.
GenoMax is offline   Reply With Quote
Old 01-06-2016, 03:02 PM   #3
antifolate
Member
 
Location: Chesterfield, MO

Join Date: Aug 2015
Posts: 52
Default

Thanks @GenoMax. I'm more looking into those repeats that the assembly missed. Like instead of assembling two different scaffolds with average coverage, it assembles one scaffold with double the average coverage.

I haven't worked with dot plots that much, except MUMmer's. Would a dot plot help me with this?

The problem I'm having with my approach is that some scaffolds have a much higher coverage than the average. One of the scaffolds has 14k coverage (the average is 79) and is 22 kbp long. It's unlikely this scaffold is supposed to be assembled 176 more times...
antifolate is offline   Reply With Quote
Old 01-06-2016, 03:39 PM   #4
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 498
Default

Coverage depth is a useful metric for detecting scaffold-sized repeats. Just remember that, depending upon your sample, higher-than-average coverage can also arise from plasmids, viruses, and/or microbial (or other) contamination. It might be useful to BLAST some of the high-coverage scaffolds and see what matches.

Note that shorter repeats (≥read length or insert size) can also break scaffolds, but may be too short to assemble. An alternative approach is to estimate genome size and the amount of repetitive sequence from k-mer counts. Consistency with your scaffold data would provide more confidence that your analysis is correct.
HESmith is offline   Reply With Quote
Old 01-06-2016, 06:05 PM   #5
antifolate
Member
 
Location: Chesterfield, MO

Join Date: Aug 2015
Posts: 52
Default

I just blasted the scaffolds with the highest coverage depth and two of them turned out to be mitochondrial. I guess I'll simply filter those two out. Thanks!

I'm not sure I completely understand your suggestion with the kmer counts. Could you elaborate please?
antifolate is offline   Reply With Quote
Old 01-06-2016, 06:29 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by antifolate View Post
Hello,

I used BBmap to find the coverage of a draft genome I have with this command:
Code:
bbmap.sh in1=reads1.fq in2=reads2.fq ref=scaffolds.fasta covstats=covstats.txt
Now I'd like to use the coverage on each scaffold (reported in the output covstats.txt) to identify scaffolds that might be repeats.

The way I want to do this is to look at the average coverage on all scaffolds (which I get from the stdout of BBMap), for example 70x, and see which scaffolds have double that coverage, 140x, or triple, 210x, and so on, implying those scaffolds are repeated once and twice, respectively. Do you think this is a reasonable approach to determine repeat scaffolds from an assembly?

Please let me know what you think.
As long as your coverage is fairly unbiased (ideally, unamplified), this is a good strategy.

The ability to accurately determine the coverage of a contig depends on its length, though - it won't be very accurate with 500bp contigs, while 10kb+ contigs should be quite accurate. Sometimes operating at the kmer level can give more accurate coverage for very short contigs; some assemblers put the kmer-based coverage in the contig name (e.g. "node=55,len=752,cov=23.4"). If you assemble from normalized data that number is not useful, though.

Quote:
I haven't worked with dot plots that much, except MUMmer's. Would a dot plot help me with this?
JGI uses MUMmer all the time, particularly for identifying repeats in PacBio assemblies (they have the potential to produce very long repeats, which are usually correct, but not always) and it is very effective. But that's only if the repeat was actually assembled in multiple copies. Kmer-based assemblers will generally collapse repeats that are longer than the read length (or insert size, or largest kmer) into a single copy. Some more advanced assemblers (like Spades) may try to analyze the graph and make multiple copies of repeats, but sometimes that goes horribly wrong (like when assembling viruses). So MUMmer will not usually help you find repeats when aligning an Illumina draft assembly to itself, though it is useful when aligning a PacBio assembly to itself or a finished reference to itself. Or a draft to a very-closely-related finished assembly.

Quote:
The problem I'm having with my approach is that some scaffolds have a much higher coverage than the average. One of the scaffolds has 14k coverage (the average is 79) and is 22 kbp long. It's unlikely this scaffold is supposed to be assembled 176 more times...
Try BLASTing it - I bet it's a mitochondria Or more precisely, part of a mito. The whole thing is probably bigger. If you are working on a prokaryote, then as HESmith suggests, a plasmid/virus/other contaminant is a possibility... though I've never seen a plasmid with such a high copy-count, and it's hard to get 22kbp contigs from wild (and sometimes even cultured) viruses without a lot of effort. If you are working with a eukaryote, it could also be another organelle such as a chloroplast, but a 176:1 ratio and 22kbp length matches many fungal mitochondrial contigs I've assembled. The length of the whole mitochondria is normally more like 50kbp-90kbp.

P.S. Rats, I got ninja'd

Last edited by Brian Bushnell; 01-06-2016 at 06:32 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-06-2016, 06:41 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by antifolate View Post
I just blasted the scaffolds with the highest coverage depth and two of them turned out to be mitochondrial. I guess I'll simply filter those two out. Thanks!
It's very useful for assembly releases to contain the mitochondria, especially if they are annotated as mitochondria. That is, if you plan to submit the assembly to NCBI or some other database. Even if you don't, it's useful to isolate the mitochondria for your own purposes. It's worth noting that mitochondria often share part of the host ribosomal DNA (which is one reason they don't assemble into a single contig, since that's a repeat). Presumably there is at least one other gene shared with the host, which is why they tend to form at least two contigs in a DeBruijn graph. The point being that if you filter out mito contigs, mito reads from the regions shared with the main genome will only map to the main genome, and thus give you inflated numbers.

Quote:
I'm not sure I completely understand your suggestion with the kmer counts. Could you elaborate please?
You can run this:

kmercountexact.sh in1=reads1.fq in2=reads2.fq khist=khist.txt peaks=peaks.txt

You can plot the kmer histogram (khist) to see a visual representation of the amount of genome with various copy-counts, or look at the peaks file for a summary. The peaks file will (if the coverage distribution is unbiased) tell you exactly how much DNA is in each cell ("genome size"), the heterozygousity rate, repeat fraction, and so forth. It doesn't tell you which contigs are repeats, though, since it doesn't do assembly. For that, it's better to rely on the assembler's claimed fold coverage of each contig.

Last edited by Brian Bushnell; 01-06-2016 at 06:51 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-06-2016, 08:01 PM   #8
antifolate
Member
 
Location: Chesterfield, MO

Join Date: Aug 2015
Posts: 52
Default

Very helpful and informative, as always; thank you @Brian! I just tried kmercountexact.sh and the output is definitely useful. It's also consistent with what I got by plotting the coverage of each scaffold.

Thanks for help everyone!
antifolate is offline   Reply With Quote
Old 01-06-2016, 08:19 PM   #9
antifolate
Member
 
Location: Chesterfield, MO

Join Date: Aug 2015
Posts: 52
Default

Quote:
if you filter out mito contigs, mito reads from the regions shared with the main genome will only map to the main genome, and thus give you inflated numbers.
Do you mean filtering the mito contigs would inflate my numbers in the coverage analysis with BBMap? That makes sense. I meant I would filter them later, for my later steps which are mainly comparing the assembly to a reference.

Do you have a better way to filter the mitochondria data? I assumed filtering the mito reads out would risk loosing some data.
antifolate is offline   Reply With Quote
Old 01-06-2016, 08:43 PM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by antifolate View Post
Do you mean filtering the mito contigs would inflate my numbers in the coverage analysis with BBMap? That makes sense. I meant I would filter them later, for my later steps which are mainly comparing the assembly to a reference.

Do you have a better way to filter the mitochondria data? I assumed filtering the mito reads out would risk loosing some data.
What I mean is, if you remove mito contigs and map all reads to the genome contigs only, some mito reads will map to genomic contigs (such as the ones with ribosomes) and inflate their coverage.

As for "a better way"... why do you want to filter out the mito? If you are looking for duplicate sequences, well, mito sequences ARE duplicated in large numbers. So, map to everything to generate coverage, then ignore the contigs you flagged as mitochondrial. If that's what you meant by "filter them later", then that's good.
Brian Bushnell is offline   Reply With Quote
Old 01-06-2016, 10:14 PM   #11
antifolate
Member
 
Location: Chesterfield, MO

Join Date: Aug 2015
Posts: 52
Default

@Brian I got you. I wantto filter the mitos out because, well, I just don't care about them; I want to study structural variation along different parts of the chromosomes. Finding possible duplicate scaffolds is something I thought would help me identify more SVs, but of course I want those to be accurate.
antifolate is offline   Reply With Quote
Reply

Tags
assembly, bbmap, coverage, repeats

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO