SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   BBMAP mapping tool (http://seqanswers.com/forums/showthread.php?t=76444)

mido1951 06-07-2017 06:40 AM

BBMAP mapping tool
 
Hello,
I wanted to know if bbmap gives me the following informations because I have not found in the tool's description.
Does BBmap only give me statistics on the mapping of reads on the reference genome?
Can it give me information if the reads cover the entire reference genome ?
thanks

HESmith 06-07-2017 07:49 AM

You can determine how much of your genome is covered, and at what read depth, using BEDTools 'genomecov' command (see here for description).

mido1951 06-07-2017 07:59 AM

so BBmap do not do that?
i must use BEDToos?
Thanks

HESmith 06-07-2017 08:05 AM

Try BBMap's 'covstats' command - I think it provides a summary of coverage.

Brian Bushnell 06-07-2017 09:44 AM

Yep - you can use "covstats=covstats.txt covhist=covhist.txt" to print the coverage information of the whole genome as well as on a per-scaffold basis, and also give a distribution of base coverage over the genome. For more information see the "Coverage output parameters" section of bbmap.sh. Once the reads are mapped, using the sam or bam file, you can do the same thing with pileup.sh.

mido1951 06-08-2017 03:04 AM

thank you for your response.
Can you explain me this output please.

Average coverage: 25,49
Standard deviation: 6,38
Percent scaffolds with any coverage: 100,00
Percent of reference bases covered: 99,85


and in covstats.txt:

Code:

#ID        Avg_fold        Length        Ref_GC        Covered_percent        Covered_bases        Plus_reads        Minus_reads        Read_GC        Median_fold        Std_Dev
gi|556503834|ref|NC_000913.3| Escherichia coli str. K-12 substr. MG1655, complete genome        26,6251        4641652        0,5079        99,8479        4634590        123276        118657        0,5042        26        6,38


GenoMax 06-08-2017 04:43 AM

Did you inspect the alignment using IGV or a genome browser of some kind? It should be reasonably clear what those numbers mean. Looks like you have a small fraction of the genome (0.15%) that does not have at least one read covering it.

Differences like this could be due to your strain being slightly different than the reference sequence.

mido1951 06-08-2017 04:50 AM

I did a mapping of the long reads on a reference genome using bbmap (with covstats option). I wanted to know if all the reference genome is covred by the long reads.

I do not mean what the different words mean: "Average coverage, Standard deviation, Percent scaffolds with any coverage, Percent of reference bases covered".??

How can I know if the whole reference genomeis covered by long reads?

Thanks.

GenoMax 06-08-2017 04:54 AM

Since the Percent of reference bases covered: 99,85 is not 100% there are at least 0.15% of bases that are not covered by one read, long or otherwise.

Brian Bushnell 06-08-2017 10:04 AM

Bacteria have circular genomes, represented in fasta files as linear with a breakpoint somewhere. Mapping (and thus coverage calculation) is less accurate at the ends due to the artificial break; BBMap will place a read spanning the break on either the left end or the right end, but not both. You might want to look at the ends to see if the uncovered bases are there.

mido1951 06-09-2017 12:44 AM

hello Brian,
Can you explain me the algorithm of BBmap please?
Thanks

Brian Bushnell 06-09-2017 10:52 AM

That would require weeks of work and is out of the scope of this forum... but I suggest you read this paper:

http://bib.irb.hr/datoteka/773712.Ig..._diplomski.pdf

mido1951 06-11-2017 09:31 AM

ok. thank you for the link.
I have one more question to the output file of BBmap.
This is an output file:
Code:

Read 1 data:            pct reads      num reads      pct bases          num bases

mapped:                  99,7746%        2365043        99,8117%        1168097245
unambiguous:            96,5758%        2289219        96,6551%        1131155596
ambiguous:                3,1988%          75824        3,1566%          36941649
low-Q discards:          0,0000%              0        0,0000%                  0

perfect best site:      36,0494%          854511        36,4545%          426626981
semiperfect site:        36,0505%          854537        36,4556%          426639736

Match Rate:                  NA              NA        99,5357%        1166772129
Error Rate:              63,8415%        1510509        0,4634%            5431521
Sub Rate:                13,2884%          314407        0,0677%            794173
Del Rate:                56,8578%        1345272        0,3512%            4117267
Ins Rate:                15,1778%          359111        0,0444%            520081
N Rate:                  0,0128%            302        0,0009%              10862

Average coverage:                      96,08
Standard deviation:                    26,28
Percent scaffolds with any coverage:    100,00
Percent of reference bases covered:    99,28

can you explain me "perfect best site" and "semiperfect site"
and
if "pct reads" is the percent of mapped reads. are there "num reads" are the number of mapped reads? because if t'is the number of mapped reads i haven't 2365043 reads?

and why the Error rate for "pct reads" is high?
thanks for your help.

Brian Bushnell 06-11-2017 12:39 PM

"perfect" means every base in the read matched the reference (no mismatches, indels, or Ns). "semiperfect" is similar but it allows Ns in the reference.

The "num reads" column in "mapped" row indicates the number of reads that were mapped.

Don't worry about the high error rate. That just indicates that 63.8% of your reads had at least one error (mismatching base, indel, etc). The more important number is the % of bases, which is only 0.46%, so the reads have roughly 99.5% identity to the reference.

mido1951 06-11-2017 02:11 PM

Quote:

Originally Posted by Brian Bushnell (Post 208305)
The "num reads" column in "mapped" row indicates the number of reads that were mapped.

Sorry but i had juste ~27000 reads not 2365043??
thank your for your response.

Brian Bushnell 06-12-2017 09:29 AM

Quote:

Originally Posted by mido1951 (Post 208306)
Sorry but i had juste ~27000 reads not 2365043??
thank your for your response.

Oh - I see the confusion. BBMap is not capable of mapping unlimited-length reads. It can only map reads up to 600bp. mapPacBio.sh is a version of BBMap that supports longer reads, up to 6000bp. Reads longer than that will be shredded into shorter pieces which are mapped and tracked individually. So it's not ideal for long read platforms for Nanopore when you need full-length alignments.


All times are GMT -8. The time now is 03:48 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.