SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fastq sample files from Illumina and IonTorrent gmaster Bioinformatics 1 08-11-2015 01:23 AM
How to fix fastq files from a bacteria sample? ymc Bioinformatics 2 05-30-2015 08:47 AM
How to randomly remove portions of the raw reads from the FASTQ file choijae3 Bioinformatics 5 01-08-2014 07:27 AM
Split Large FASTQ file in small FASTQ files with user defined number of reads Windows deepbiomed Bioinformatics 3 04-04-2013 07:14 AM
How to filter a possible DNA contamination in fastq sample reads Jluis Bioinformatics 2 07-10-2012 02:14 AM

Reply
 
Thread Tools
Old 11-19-2015, 08:35 AM   #21
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBSplit can output sam files, and my original goal was to have it do so, for convenience. And it works fine for reads that only map to one of the references. But unfortunately, with "ambig2=all", reads that map to multiple references will be printed to each sam file with the same mapping information. In other words, if a read mapped to both e.coli and human, it would be printed in both the e.coli and human outputs (which is correct), but in both cases it would be reported as mapping to the same organism (e.g. in both files it would be listed as mapping to e.coli). Which is probably not what you want.

With "ambiguous2=toss" or "best" this is not a problem, but with "all" or "split" it is. And it turns out this is rather tricky to fix, so I haven't allocated time to do it. Since my use-cases (and those of JGI in general) use BBSplit to produce fastq output anyway, it seems low-priority. Obviously, with fastq, this is not an issue because fastq output does not retain the mapping information.

So, the easiest way to ensure that people get correct results is to only use BBSplit for fasta/fastq output. Then if you want sam/bam, map the individual output files to their individual genome.

Sorry for the inconvenience!
Brian Bushnell is offline   Reply With Quote
Old 12-18-2015, 11:30 AM   #22
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

Hi Brian, a few more questions.

Can BBsplit output the unmapped reads? It didn't look in the description like it could, so I have been doing a parallel BBmap to the concatenation of each of BBsplit's reference fastas to get the unmapped output. However, even with what I think are the same parameters specified to each program, I get slightly different results when I count up how many reads map in each case. Here is a comparison, and you can see that a few samples are a few read pairs off.

BBmap BBsplit
72 72
46 46
408 408
37800 37800
281614 281614
6854 6854
114028 114034
40788 40788
6508 6508
193476 193480

Here are the 2 commands with options

bbsplit.sh
k=8
in=BBmap/Sample_1P.fastq
in2=BBmap/Sample_2P.fastq
ref="$amplicon_dir"
ambiguous2=split
basename=BBsplit/Sample_%_#P.fastq
refstats=Amplicon_counts_Sample.txt
minid=0.95
idfilter=0.95
vslow
maxindel=100
nzo=f
pairedonly

bbmap.sh
k=8
minid=0.95
idfilter=0.95
maxindel=100
vslow
minhits=1
ambiguous=best
path="$agg_file"
pairedonly
build=1
printunmappedcount
-Xmx23g
in=BBmap/Sample_#P.fastq
outm=BBmap/Sample.mapped.sam
outu=BBmap/Sample.unmapped.sam

Also, in investigating the unmapped reads, it is not clear to me why some read pairs are unmapped. While I realize I have required 95% identity for both reads, I find many pairs that satisfy this in the unmapped output.

Thanks!

Last edited by sk8bro; 12-18-2015 at 11:38 AM.
sk8bro is offline   Reply With Quote
Old 12-18-2015, 03:55 PM   #23
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBSplit supports the "outu" flag, so yes, it can capture unmapped reads. The number of reads mapping can vary slightly when mapping paired reads multithreaded, particularly with the "pairedonly" flag or a high minid/idfilter, from run to run; this is because the pair distance is calculated dynamically on a per-thread basis, and how close a pair distance is to the average influences the mapping score, which determines whether or not the reads are mapped (or paired). When multithreaded, which reads are processed by which thread is nondeterministic, so the exact value of the average pair distance will be slightly different. I assume this is causing the discrepancy.

You have set both a 95% identity cutoff and "pairedonly", meaning both reads must be at least 95% identity and they must be in a properly-paired orientation. Perhaps you can map the unmapped reads with BBMap using default settings, and add the flag "idtag" which will print the calculated percent identity in the sam output? I would be interested in seeing if there are indeed pairs, in a proper orientation, with both at least 95% identity as calculated by BBMap.
Brian Bushnell is offline   Reply With Quote
Old 12-18-2015, 09:53 PM   #24
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

I am trying that now.. but I changed BBsplit command by reducing indel size to 20, removing vslow (what does that do anyway?), and minid is default (idfilter is strong than minid right?).

Still running, but I notice slightly higher mapping rates with these changes... can you explain?

Thank you for your help.
sk8bro is offline   Reply With Quote
Old 12-19-2015, 05:08 PM   #25
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

vslow increases sensitivity, by not discarding very common kmers. This is not very important with default K=13, but becomes more important with short kmers, since there are fewer of them.

minid is loose, and mainly affects speed. "minid=90", for example, does not ensure that there will be no hits with under 90% identity; rather, it ensures that any hit as low as 90% identity, and possibly lower, will be reported. Idfilter is strict; "idfilter=X" ensures that any alignment with ID under X will be terminated.
Brian Bushnell is offline   Reply With Quote
Old 12-19-2015, 09:25 PM   #26
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

Okay, so I should probably keep vslow in there, since I use 8mers. Here are some examples of default BBmap mapping of my unmapped by BBsplit reads:

HWI-D00294:202:C7FK3ANXX:6:2213:12108:58098 1:N:0:AAGAGGCACTCTCTAT 97 XXX, complete genome 743 1X117= XXX, complete genome 274 0 XXX BFGGFGGGGGGGGEBGGGFGGF@FDGFGGGGGEDGG...EDGGG>F0C=C.FG=FGEEEB.@@BBBGBGG NM:i:1 AM:i:43 YI:f:99.15

HWI-D00294:202:C7FK3ANXX:6:2213:12108:58098 1:N:0:AAGAGGCACTCTCTAT 145 XXX, complete genome 274 3 107=1X7= XXX, complete genome 7 0 XXX 8CDGGG<BBGGGEGEGGBGD>=GGBDGGGGGCGDEBDEGDFGD>GFEGGGGGGGDDGGDGC1GDGGFFGGEGGGGGAGGGGDG<BGEEGG@FGEGFDF1GGFC>?;GGBA>@A XT:A:R NM:i:1 AM:i:3 YI:f:99.13

HWI-D00294:202:C7FK3ANXX:6:2212:12674:82633 1:N:0:AAGAGGCACTCTCTAT 99 XXX, complete genome 140 6=1X35=1X59=1X21= = 264 388 XXX BBB@@<FGGGGGEGGDGGGGGGGGGGGFFFGGGGGGEG1FGGGGGGGGGGGGGGGGGGGGGGBG;:EFGGFGGGGGF>EFGGGGDFFCCCCFGGGGGG8A@FCGGCGGGBGGGG=@GGGGGGGG NM:i:3 AM:i:40 YI:f:97.58

HWI-D00294:202:C7FK3ANXX:6:2212:12674:82633 1:N:0:AAGAGGCACTCTCTAT 147 XXX, complete genome 264 3 53=1X33=1X22=1X14= = 1 -388 XXX DGG:GE:.<GCDGGGGGDDDBGGGGDEGGGGGCGGFBGGGGGGGGGFGGGDGGGGCGGGGGGGGGGGGGGEGGGGFGGGFF:EGGGB/GGGGGGGGGGGFEFF1FFDGGGGGGGGGGGGGBBBBB XT:A:R NM:i:3AM:i:3 YI:f:97.60


The second pair, seems proper so I don't understand how it is unmapped. The first pair, I think the problem may be that my reference is actually a concatenation of sub-references (fasta file with multiple > entries). If Read1 aligns best to sub-ref1, while Read2 aligns best to sub-ref2, is the insert size too big and therefore not proper? Is there a way to get around this behavior to keep the best read alignment within acceptable insert size?

Thanks,

Also, those changes to BBsplit gave me ~1.5M more reads mapping. Still seems that I didn't change much that should have mattered... 127,208,572 vs 125,675,984
sk8bro is offline   Reply With Quote
Old 12-20-2015, 02:59 AM   #27
sk8bro
Member
 
Location: boston

Join Date: Feb 2012
Posts: 30
Default

sooo. Ok, I tried increasing both pairlen and rescuedist to 200,000. That gave another small increase in mapped read count. I realllly hope I can keep with BBsplit, but it seems I either need to change my reference or find a way to get the following:

I do want both pairs to map, and to different strand,, and each with 95% identity.. but the following are all fine given my reference of similar concatenated sub-reference sequences. At worst case, I have 603 similar sub-references for a reference, so I really would rather not split them out and generate outputs for each.

---> <---
<--- --->
---> //huge insert// <---
<--- //huge insert// --->

Should I get rid of the 'pairedonly' and output as SAM and filter on SAM flags? I know I would need to lose the ambiguous2=split for SAM output.

Thanks for letting me know what you think to try!
sk8bro is offline   Reply With Quote
Reply

Tags
demultiplexing, human reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:44 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO