Seqanswers Leaderboard Ad

**Brian Bushnell** · 11-16-2015, 11:12 AM

Originally posted by sk8bro View Post

You wrote: "But it should be "unmapped+mapped1+mapped2=used"

I just wanted to confirm that it is okay that the numbers also don't fit this formula.

611706+374+844 = 612924 != 613590

Yes, I know

It only works for single-ended reads.

Then, if I wanted to report the mapping rate, is this the correct formula?

mapped = used-unmapped/used

Depends on what you want to report, but yes, the number of reads that get sent to outm are "used-unmapped". The number of reads that actually mapped (meaning they match the reference) is mapped1+mapped2, which is slightly different (for paired reads).

Is there a way to export to outm only if both pairs map? Default I think is to export if either pair maps?

Yes, you can use the "pairedonly" flag, which will a pair mapped only if both reads map, and in the correct orientation (same contig, opposite strands, within the maximum insert size).

**sk8bro** · 11-16-2015, 11:43 AM

That all sounds great

Have you ever compared to SNAP in terms of human read removal?

**Brian Bushnell** · 11-16-2015, 11:59 AM

SNAP is faster than BBMap, but much less sensitive. I would never use it for most mapping tasks, but for contamination removal, it might be acceptable because you want fairly low sensitivity anyway, to avoid false positives. A more fair comparison in that case might be BBDuk vs SNAP, as they both use long kmers, and would probably have similar speed and sensitivity.

But no, I have not specifically looked at SNAP for contamination removal.

**sk8bro** · 11-17-2015, 10:13 AM

Thank you for the explanation and your extensive suite of tools.

**sk8bro** · 11-19-2015, 08:46 AM

Hi Brian, I was wondering about this... If bam/bai is what I want at the end of the day, do you still recommend fastq output and then conversion?

"BBSplit is recommended for fastq and fasta output, not for sam/bam output"

It seems that sam output with this option would give me what I'm after without any extra work:

bs=<file> Write a shell script to 'file' that will turn the sam output into a sorted, indexed bam file.

**Brian Bushnell** · 11-19-2015, 09:35 AM

BBSplit can output sam files, and my original goal was to have it do so, for convenience. And it works fine for reads that only map to one of the references. But unfortunately, with "ambig2=all", reads that map to multiple references will be printed to each sam file with the same mapping information. In other words, if a read mapped to both e.coli and human, it would be printed in both the e.coli and human outputs (which is correct), but in both cases it would be reported as mapping to the same organism (e.g. in both files it would be listed as mapping to e.coli). Which is probably not what you want.

With "ambiguous2=toss" or "best" this is not a problem, but with "all" or "split" it is. And it turns out this is rather tricky to fix, so I haven't allocated time to do it. Since my use-cases (and those of JGI in general) use BBSplit to produce fastq output anyway, it seems low-priority. Obviously, with fastq, this is not an issue because fastq output does not retain the mapping information.

So, the easiest way to ensure that people get correct results is to only use BBSplit for fasta/fastq output. Then if you want sam/bam, map the individual output files to their individual genome.

Sorry for the inconvenience!

**sk8bro** · 12-18-2015, 12:30 PM

Hi Brian, a few more questions.

Can BBsplit output the unmapped reads? It didn't look in the description like it could, so I have been doing a parallel BBmap to the concatenation of each of BBsplit's reference fastas to get the unmapped output. However, even with what I think are the same parameters specified to each program, I get slightly different results when I count up how many reads map in each case. Here is a comparison, and you can see that a few samples are a few read pairs off.

BBmap BBsplit
72 72
46 46
408 408
37800 37800
281614 281614
6854 6854
114028 114034
40788 40788
6508 6508
193476 193480

Here are the 2 commands with options

bbsplit.sh
k=8
in=BBmap/Sample_1P.fastq
in2=BBmap/Sample_2P.fastq
ref="$amplicon_dir"
ambiguous2=split
basename=BBsplit/Sample_%_#P.fastq
refstats=Amplicon_counts_Sample.txt
minid=0.95
idfilter=0.95
vslow
maxindel=100
nzo=f
pairedonly

bbmap.sh
k=8
minid=0.95
idfilter=0.95
maxindel=100
vslow
minhits=1
ambiguous=best
path="$agg_file"
pairedonly
build=1
printunmappedcount
-Xmx23g
in=BBmap/Sample_#P.fastq
outm=BBmap/Sample.mapped.sam
outu=BBmap/Sample.unmapped.sam

Also, in investigating the unmapped reads, it is not clear to me why some read pairs are unmapped. While I realize I have required 95% identity for both reads, I find many pairs that satisfy this in the unmapped output.

Thanks!

**Brian Bushnell** · 12-18-2015, 04:55 PM

BBSplit supports the "outu" flag, so yes, it can capture unmapped reads. The number of reads mapping can vary slightly when mapping paired reads multithreaded, particularly with the "pairedonly" flag or a high minid/idfilter, from run to run; this is because the pair distance is calculated dynamically on a per-thread basis, and how close a pair distance is to the average influences the mapping score, which determines whether or not the reads are mapped (or paired). When multithreaded, which reads are processed by which thread is nondeterministic, so the exact value of the average pair distance will be slightly different. I assume this is causing the discrepancy.

You have set both a 95% identity cutoff and "pairedonly", meaning both reads must be at least 95% identity and they must be in a properly-paired orientation. Perhaps you can map the unmapped reads with BBMap using default settings, and add the flag "idtag" which will print the calculated percent identity in the sam output? I would be interested in seeing if there are indeed pairs, in a proper orientation, with both at least 95% identity as calculated by BBMap.

**sk8bro** · 12-18-2015, 10:53 PM

I am trying that now.. but I changed BBsplit command by reducing indel size to 20, removing vslow (what does that do anyway?), and minid is default (idfilter is strong than minid right?).

Still running, but I notice slightly higher mapping rates with these changes... can you explain?

Thank you for your help.

**Brian Bushnell** · 12-19-2015, 06:08 PM

vslow increases sensitivity, by not discarding very common kmers. This is not very important with default K=13, but becomes more important with short kmers, since there are fewer of them.

minid is loose, and mainly affects speed. "minid=90", for example, does not ensure that there will be no hits with under 90% identity; rather, it ensures that any hit as low as 90% identity, and possibly lower, will be reported. Idfilter is strict; "idfilter=X" ensures that any alignment with ID under X will be terminated.

**sk8bro** · 12-19-2015, 10:25 PM

Okay, so I should probably keep vslow in there, since I use 8mers. Here are some examples of default BBmap mapping of my unmapped by BBsplit reads:

HWI-D00294:202:C7FK3ANXX:6:2213:12108:58098 1:N:0:AAGAGGCACTCTCTAT 97 XXX, complete genome 743 1X117= XXX, complete genome 274 0 XXX [email protected]>F0C=C.FG=FGEEEB.@@BBBGBGG NM:i:1 AM:i:43 YI:f:99.15

HWI-D00294:202:C7FK3ANXX:6:2213:12108:58098 1:N:0:AAGAGGCACTCTCTAT 145 XXX, complete genome 274 3 107=1X7= XXX, complete genome 7 0 XXX 8C

DGGG<BBGGGEGEGGBGD>=GGBDGGGGGCGDEBDEGDFGD>GFEGGGGGGGDDGGDGC1GDGGFFGGEGGGGGAGGGGDG<BGEEGG@FGEGFDF1GGFC>?;GGBA>@A XT:A:R NM:i:1 AM:i:3 YI:f:99.13

HWI-D00294:202:C7FK3ANXX:6:2212:12674:82633 1:N:0:AAGAGGCACTCTCTAT 99 XXX, complete genome 140 6=1X35=1X59=1X21= = 264 388 XXX BBB@@<FGGGGGEGGDGGGGGGGGGGGFFFGGGGGGEG1FGGGGGGGGGGGGGGGGGGGGGGBG;:EFGGFGGGGGF>EFGGGGDFFCCCCFGGGGGG8A@FCGGCGGGBGGGG=@GGGGGGGG NM:i:3 AM:i:40 YI:f:97.58

HWI-D00294:202:C7FK3ANXX:6:2212:12674:82633 1:N:0:AAGAGGCACTCTCTAT 147 XXX, complete genome 264 3 53=1X33=1X22=1X14= = 1 -388 XXX DGG:GE:.<GCDGGGGGDDDBGGGGDEGGGGGCGGFBGGGGGGGGGFGGGDGGGGCGGGGGGGGGGGGGGEGGGGFGGGFF:EGGGB/GGGGGGGGGGGFEFF1FFDGGGGGGGGGGGGGBBBBB XT:A:R NM:i:3AM:i:3 YI:f:97.60

The second pair, seems proper so I don't understand how it is unmapped. The first pair, I think the problem may be that my reference is actually a concatenation of sub-references (fasta file with multiple > entries). If Read1 aligns best to sub-ref1, while Read2 aligns best to sub-ref2, is the insert size too big and therefore not proper? Is there a way to get around this behavior to keep the best read alignment within acceptable insert size?

Thanks,

Also, those changes to BBsplit gave me ~1.5M more reads mapping. Still seems that I didn't change much that should have mattered... 127,208,572 vs 125,675,984

**sk8bro** · 12-20-2015, 03:59 AM

sooo. Ok, I tried increasing both pairlen and rescuedist to 200,000. That gave another small increase in mapped read count. I realllly hope I can keep with BBsplit, but it seems I either need to change my reference or find a way to get the following:

I do want both pairs to map, and to different strand,, and each with 95% identity.. but the following are all fine given my reference of similar concatenated sub-reference sequences. At worst case, I have 603 similar sub-references for a reference, so I really would rather not split them out and generate outputs for each.

---> <---
<--- --->
---> //huge insert// <---
<--- //huge insert// --->

Should I get rid of the 'pairedonly' and output as SAM and filter on SAM flags? I know I would need to lose the ambiguous2=split for SAM output.

Thanks for letting me know what you think to try!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 50 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News