Hello,
I used BBmap to find the coverage of a draft genome I have with this command:
Now I'd like to use the coverage on each scaffold (reported in the output covstats.txt) to identify scaffolds that might be repeats.
The way I want to do this is to look at the average coverage on all scaffolds (which I get from the stdout of BBMap), for example 70x, and see which scaffolds have double that coverage, 140x, or triple, 210x, and so on, implying those scaffolds are repeated once and twice, respectively. Do you think this is a reasonable approach to determine repeat scaffolds from an assembly?
Please let me know what you think.
I used BBmap to find the coverage of a draft genome I have with this command:
Code:
bbmap.sh in1=reads1.fq in2=reads2.fq ref=scaffolds.fasta covstats=covstats.txt
The way I want to do this is to look at the average coverage on all scaffolds (which I get from the stdout of BBMap), for example 70x, and see which scaffolds have double that coverage, 140x, or triple, 210x, and so on, implying those scaffolds are repeated once and twice, respectively. Do you think this is a reasonable approach to determine repeat scaffolds from an assembly?
Please let me know what you think.
Or more precisely, part of a mito. The whole thing is probably bigger. If you are working on a prokaryote, then as HESmith suggests, a plasmid/virus/other contaminant is a possibility... though I've never seen a plasmid with such a high copy-count, and it's hard to get 22kbp contigs from wild (and sometimes even cultured) viruses without a lot of effort. If you are working with a eukaryote, it could also be another organelle such as a chloroplast, but a 176:1 ratio and 22kbp length matches many fungal mitochondrial contigs I've assembled. The length of the whole mitochondria is normally more like 50kbp-90kbp.
Comment