SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 03:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 07:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 04:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 02:10 AM

Reply
 
Thread Tools
Old 05-11-2015, 10:30 AM   #201
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

1) LMP maps as "outie", so the left read is reverse and the right read is forward. Fragment maps as "innie" like a normal fragment library, so left read forward and right read reverse.
2) The insert size of the fragment pairs depends on your shearing length, while the insert size of the LMP pairs depends on something else, like the transposase concentration. In my data, frag peaked at 319bp and LMP peaked at 2605bp (see image).

3) This was the output for one of our libraries:
Code:
Long Mate Pairs: 851440 reads (42.57%) 98864654 bases (32.74%)
Fragment Pairs: 655758 reads (32.79%) 82605742 bases (27.35%)
Unknown Pairs: 496638 reads (24.83%) 74992338 bases (24.83%)
Singletons: 170008 reads (8.50%) 9856956 bases (3.26%)

Adapters Detected: 751681 (75.17%)
Bases Recovered: 266319690 (88.19%)
I think 40% as fragment is reasonable; I assume the exact amount depends on reagent concentrations and times.

I designed splitnextera.sh to recover more pairs as LMP and fewer as fragments, compared to NXTrim - you have some leeway in deciding which one you want to recover. In general, it should run faster, recover more reads as LMP, and generally recover a higher number of total reads and bases. But that's just based on my reading the NXTrim description; I have not actually tested them on the same dataset, so I'd be interested if you could give us some numbers.

Incidentally, I'm considering upgrading it to increase the number of fragments and reduce the amount of unknown, but the difference would probably be slight since unknown is already mostly LMP.
Attached Images
File Type: png nextera_lmp.png (34.0 KB, 141 views)

Last edited by Brian Bushnell; 05-11-2015 at 10:33 AM.
Brian Bushnell is offline   Reply With Quote
Old 05-11-2015, 01:09 PM   #202
blsfoxfox
Member
 
Location: auburn

Join Date: Jan 2013
Posts: 12
Default

Quote:
Originally Posted by Brian Bushnell View Post
1) LMP maps as "outie", so the left read is reverse and the right read is forward. Fragment maps as "innie" like a normal fragment library, so left read forward and right read reverse.
2) The insert size of the fragment pairs depends on your shearing length, while the insert size of the LMP pairs depends on something else, like the transposase concentration. In my data, frag peaked at 319bp and LMP peaked at 2605bp (see image).

3) This was the output for one of our libraries:
Code:
Long Mate Pairs: 851440 reads (42.57%) 98864654 bases (32.74%)
Fragment Pairs: 655758 reads (32.79%) 82605742 bases (27.35%)
Unknown Pairs: 496638 reads (24.83%) 74992338 bases (24.83%)
Singletons: 170008 reads (8.50%) 9856956 bases (3.26%)

Adapters Detected: 751681 (75.17%)
Bases Recovered: 266319690 (88.19%)
I think 40% as fragment is reasonable; I assume the exact amount depends on reagent concentrations and times.

I designed splitnextera.sh to recover more pairs as LMP and fewer as fragments, compared to NXTrim - you have some leeway in deciding which one you want to recover. In general, it should run faster, recover more reads as LMP, and generally recover a higher number of total reads and bases. But that's just based on my reading the NXTrim description; I have not actually tested them on the same dataset, so I'd be interested if you could give us some numbers.

Incidentally, I'm considering upgrading it to increase the number of fragments and reduce the amount of unknown, but the difference would probably be slight since unknown is already mostly LMP.
Your answer is always informative and helpful!

Thanks!
blsfoxfox is offline   Reply With Quote
Old 05-11-2015, 03:46 PM   #203
Len Trigg
Registered Vendor
 
Location: New Zealand

Join Date: Jun 2011
Posts: 29
Default

Hi Brian,

Could you please clarify the license for BBMap? I thought I might try a couple of runs with it for fun, but once I installed it I saw conflicting license information. In the top level of the tar file there is a license.txt that looks like a fairly permissive license (it doesn't look like one of the standard OSI open source licenses, but that's probably fine), yet within docs/readme.txt it explicitly states that the software is to be used for non-commercial use only.

Cheers,
Len.
__________________
Len Trigg, Ph.D.
Real Time Genomics
www.realtimegenomics.com
Len Trigg is offline   Reply With Quote
Old 05-11-2015, 04:51 PM   #204
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Len,

To summarize, it appears that BBMap/BBTools is usable by any individual or organization, for any purpose (as long as the disclaimers are followed), without any need to recompense Berkeley Labs in any way. So a for-profit enterprise may run it without any restrictions.

The text about it being free for noncommercial use was added by me, to clarify, in case people were wondering, since initially it was made clear to me that that was the case by the legal department but I was not really sure what the status was with regards to commercial use (it was hard for me to get them to state things explicitly). Subsequently, however, BBMap was made part of Geneious, after they contacted Berkeley Lab and were told by the legal department that it's essentially an unrestricted license and commercial organizations do not need to pay anything. I should probably remove that line from the readme.
Brian Bushnell is offline   Reply With Quote
Old 05-11-2015, 04:59 PM   #205
Len Trigg
Registered Vendor
 
Location: New Zealand

Join Date: Jun 2011
Posts: 29
Default

Brian,

Thanks for the reply, that clears things up nicely! (I agree with you about adjusting the readme.)

Cheers,
Len.
__________________
Len Trigg, Ph.D.
Real Time Genomics
www.realtimegenomics.com
Len Trigg is offline   Reply With Quote
Old 05-12-2015, 09:27 AM   #206
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default bbmap settings

Hi Brian,
I was trying to use bbmap to remap some PacBio reads to some reference sequences in order to generate the detailed cigar string bbmap includes in SAM output. I am trying to generate mappings with sequences that are longer than a reference on one or more ends, but matching at 99% identity with the overlap portion. Previously i was using Geneious to do mappings and their tool seems to be doing the soft clipping i am looking for, but does not output the right type of cigar string. Could you help me with the specific settings to use with bbmap to include these mapping situations in the output?
lankage is offline   Reply With Quote
Old 05-12-2015, 12:49 PM   #207
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

For PacBio reads, you need to use mapPacBio.sh rather than BBMap.sh. It's the same algorithm, but with different parameters and support for longer reads.

mapPacBio.sh ref=reference.fasta in=reads.fasta out=mapped.sam idfilter=0.99 local

(if the reads are in fastq format, then also add the flag "maxlen=6000").

"idfilter=0.99" will limit the mappings to reads with 99% identity. The local flag will soft-clip the overhang bases before the idfilter is applied, so only the overlapping bases will be considered.
Brian Bushnell is offline   Reply With Quote
Old 05-15-2015, 01:47 PM   #208
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default

Hi Brian,
i took your advice with using the "local" parameter with mapPacBio.sh, but I am still seeing reads which perfectly map to a reference in the overlap and have ends that should be soft clipped be absent in the output file. Additionally, if i run mapPacBio.sh in isolation with just one read (fasta) and one reference (fasta), i get the expected output. The soft clipped perfect mappings seem to only get missed in the larger file.

output when running in isolation:
Code:
@HD     VN:1.4  SO:unsorted
@SQ     SN:16120_FcGR1_776delG_Ctg25    LN:894
@PG     ID:BBMap        PN:BBMap        VN:33.65        CL:java -Djava.library.path=/Users/mgraham/Desktop/PB6-FCGR/bbmap/jni/ -ea -Xmx10g BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 ref=776delG.fasta in=lbc1_read.fasta out=test.sam outputunmapped=f idfilter=0.99 maxlen=6000 local -Xmx10g
lbc1_read_1165  0       16120_FcGR1_776delG_Ctg25       1       23      14S894=245S     *       0       0
       AGCTTGGAGACAACATGTGGTTCTTGACAGCTCTGCTCCTTTGGGTTCCAGTTGATGGGCAAGTGGATACCACAAAGGCAGTGATCACTTTGCAGCCTCCATGGGTCAGCGTGTTCCAAGAGGAAACTGTAACCTTACAGTGTGAGGTGCCCCGTCTGCCTGGGAGCAGCTCCACACAGTGGTTTCTCAATGGCACAGCCACTCAGACCTCGACTCCCAGCTACAGAATCACCTCTGCCAGTGTCAAGGACAGTGGTGAATACAGGTGCCAGAGAGGTCCCTCAGGGCGAAGTGACCCCATACAGCTGGAAATCCACAGAGACTGGCTACTACTGCAGGTATCCAGCAGAGTCTTCACAGAAGGAGAACCTCTGGCCTTGAGGTGTCATGCATGGAAGGATAAGCTGGTGTACAATGTGCTTTACTATCAAAATGGCAAAGCCTTTAAGTTTTTCTACCGGAATTCTCAACTCACCATTCTGAAAACCAACATAAGTCACAACGGCGCCTACCACTGCTCAGGCATGGGAAAGCATCGCTACACATCAGCAGGAGTATCTGTCACTGTGAAAGAGCTATTTCCAGCTCCAGTGCTGAATGCATCCGTGACATCCCCGCTCCTGGAGGGGAATCTGGTCACCCTGAGCTGTGAAACAAAGTTGCTTCTGCAGAGGCCTGGTTTGCAGCTTTACTTCTCCTTCTACATGGGCAGCAAGACCCTGCGAGGCAGGAACACGTCCTCTGAATACCAAATACTAACTGCTAGAAGAGAAGACTCTGGTTTTACTGGTGCGAGGCCACCACAGAAGACGGAAATGTCCTTAAGCGCAGCCCTGAGTTGGAGCTTCAAGTGCTTGGCCTCCAGTTACCAACTCCTGTCTGGCTTCATGTCCTTTTCTATCTGGTAGTGGGAATAATGTTTTTAGTGAACACTGTTCTCTGGGTGACAATACGTAAAGAACTGAAAAGAAAGAAAAAGTGGAATTTAGAAATCTCTTTGGATTCTGCTCATGAGAAGAAGGTAACTTCCAGCCTTCAAGAAGACAGACATTTAGAAGAAGAGCTGAAGAGTCAGGAACAAGAATAAAGAACAGCTGCAGGAAGGGGTGCACCGGAAGGAGCCTGAGGAGGCCAAGTAGCAGCAGCTCAGTGG       *       NM:i:0  AM:i:23
I can email a compressed set of files to demonstrate this if you'd be willing to take the time to look into it.

Last edited by Brian Bushnell; 05-15-2015 at 02:06 PM.
lankage is offline   Reply With Quote
Old 05-15-2015, 02:08 PM   #209
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Yes, please do email it, and I'll take a look.
Brian Bushnell is offline   Reply With Quote
Old 05-15-2015, 03:18 PM   #210
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

First off, I'm using v34.94 and you're using v33.64, so there may be a bit of a discrepancy. But it looks like BBMap is refusing to map to that location (which has an 894bp exact match, cigar 14S894=245S) because there is another location that scores so much higher: 1073 matches and 1bp deletion, cigar 14S765=1D308=66S.

BBMap has some heuristics that skip searching for sites that are substantially lower scoring than others, and it considers 1073 matches, 1 deletion, and 80 soft-clipped bases to be much higher scoring than 894 matches and 259 soft-clipped bases. You can make the read align there several ways:

Code:
mapPacBio.sh in=lbc1_read.fasta out=mapped2.sam ow strictmaxindel=0

mapPacBio.sh in=lbc1_read.fasta out=mapped2.sam ow semiperfectmode

bbmapskimmer.sh in=lbc1_read.fasta out=mapped2.sam ow kfilter=800
Any of those will give you the alignment in question, but they each have other disadvantages - the first one will not allow any indels, the second does not allow any indels or mismatches, just clipping (it's very fast, though), and the third requires a minimum of 800bp consecutive exact match. So they work in this case but none will ensure you get all 99% identity alignments.

You can also forcibly align a single query sequence to everything in the reference like this:
Code:
msa.sh in=full_reference.fasta out=mapped.sam msa=MultiStateAligner9PacBio query=AGCTTGGAGACAACATGTGGTTCTTGACAGCTCTGCTCCTTTGGGTTCCAGTTGATGGGCAAGTGGATACCACAAAGGCAGTGATCACTTTGCAGCCTCCATGGGTCAGCGTGTTCCAAGAGGAAACTGTAACCTTACAGTGTGAGGTGCCCCGTCTGCCTGGGAGCAGCTCCACACAGTGGTTTCTCAATGGCACAGCCACTCAGACCTCGACTCCCAGCTACAGAATCACCTCTGCCAGTGTCAAGGACAGTGGTGAATACAGGTGCCAGAGAGGTCCCTCAGGGCGAAGTGACCCCATACAGCTGGAAATCCACAGAGACTGGCTACTACTGCAGGTATCCAGCAGAGTCTTCACAGAAGGAGAACCTCTGGCCTTGAGGTGTCATGCATGGAAGGATAAGCTGGTGTACAATGTGCTTTACTATCAAAATGGCAAAGCCTTTAAGTTTTTCTACCGGAATTCTCAACTCACCATTCTGAAAACCAACATAAGTCACAACGGCGCCTACCACTGCTCAGGCATGGGAAAGCATCGCTACACATCAGCAGGAGTATCTGTCACTGTGAAAGAGCTATTTCCAGCTCCAGTGCTGAATGCATCCGTGACATCCCCGCTCCTGGAGGGGAATCTGGTCACCCTGAGCTGTGAAACAAAGTTGCTTCTGCAGAGGCCTGGTTTGCAGCTTTACTTCTCCTTCTACATGGGCAGCAAGACCCTGCGAGGCAGGAACACGTCCTCTGAATACCAAATACTAACTGCTAGAAGAGAAGACTCTGGTTTTACTGGTGCGAGGCCACCACAGAAGACGGAAATGTCCTTAAGCGCAGCCCTGAGTTGGAGCTTCAAGTGCTTGGCCTCCAGTTACCAACTCCTGTCTGGCTTCATGTCCTTTTCTATCTGGTAGTGGGAATAATGTTTTTAGTGAACACTGTTCTCTGGGTGACAATACGTAAAGAACTGAAAAGAAAGAAAAAGTGGAATTTAGAAATCTCTTTGGATTCTGCTCATGAGAAGAAGGTAACTTCCAGCCTTCAAGAAGACAGACATTTAGAAGAAGAGCTGAAGAGTCAGGAACAAGAATAAAGAACAGCTGCAGGAAGGGGTGCACCGGAAGGAGCCTGAGGAGGCCAAGTAGCAGCAGCTCAGTGG
...although this is not really what I designed that program for.

So, overall, I don't think you can make BBMap do exactly what you want in general because its opinions of "best alignment" are slightly different than yours, but you could map everything in semiperfectmode and then just re-map the unmapped reads (which you separate using outm and outu streams) allowing 99% identity. That's probably the closest you can get.
Brian Bushnell is offline   Reply With Quote
Old 05-22-2015, 11:29 AM   #211
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default

Hi Brian,
Thanks for helping with the PacBio mapping issues. I have another, unrelated question about usage of bbmap. Is there a way to have bbmap output ALL mappings that meet a certain threshold specified with options such as subfilter=X? Using the ambiguous=all parameter, I cant get all equivalent mappings reported for a read, but can I use options to return all mappings that score >= a threshold specified with options like subfilter? Although this isn't the intended usage of the program, I would like to be able to search large sequence sets with primer sequences and look for opportunities where primer pairs sit in proximity to each other while allowing some mismatching.
For example, if a set of primers is designed to perfectly match an intended target site, can I use bbmap to explore less perfect sites for hybridization opportunities?

Last edited by lankage; 05-22-2015 at 11:32 AM.
lankage is offline   Reply With Quote
Old 05-22-2015, 11:40 AM   #212
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

BBMap is designed to find the best mapping, and heuristics will cause it to ignore mappings that are valid but substantially worse. Therefore, I made a different version of it, BBMapSkimmer, which is designed to find all of the mappings above a certain threshold. The shellscript is bbmapskimmer.sh and the usage is similar to bbmap.sh or mapPacBio.sh. For primers, which I assume will be short, you may wish to use a lower than default K of, say, 10 or 11, and add the "slow" flag.

I also wrote another pair of programs specifically for working with primer pairs, msa.sh and cutprimers.sh. msa.sh will forcibly align a primer sequence (or a set of primer sequences) against a set of reference sequences to find the single best matching location per reference sequence - in other words, if you have 3 primers and 100 ref sequences, it will output a sam file with exactly 100 alignments - one per ref sequence, using the primer sequence that matched best. Of course you can also just run it with 1 primer sequence.

So you run msa twice - once for the left primer, and once for the right primer - and generate 2 sam files. Then you feed those into cutprimers.sh, which will create a new fasta file containing the sequence between the primers, for each reference sequence. We used these programs to synthetically cut V4 out of full-length 16S sequences.

I should say, though, that the primer sites identified are based on the normal BBMap scoring, which is not necessarily the same as where the primers would bind naturally, though with highly conserved regions there should be no difference.
Brian Bushnell is offline   Reply With Quote
Old 05-26-2015, 09:11 AM   #213
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default BBmapskimmer.sh

Hi Brian,
I have been experimenting with bbmapskimmer.sh and had a question about flags in the SAM output. I have been looking at a pair of primers matched against the Macaca fasicularis reference genome V5 for the potential to amplify unwanted sites. I have some python code that reads in sam output and draws a visualization of a primer to subject sequence alignment. I noticed in one case, the sam output has the 256 flag for a sub alignment of a sequence that would seem to suggest that the reverse complement of the query sequence (primer) maps to chromosome 12 at the given position. When i pull out Macaca reference genomic sequence at the given coordinates, I see a match to the 5' - 3' FWD orientation of the primer, rather than the reverse complement of the primer. Is there not a way to tell what orientation the query sequence matches the reference if the query hit is stored as a sub mapping?

Sam output:
@HD VN:1.4 SO:unsorted
@SQ SN:10 LN:96509753
@SQ SN:11 LN:137757926
@SQ SN:12 LN:132586672
@SQ SN:13 LN:111193037
@SQ SN:14 LN:130733371
@SQ SN:15 LN:112612857
@SQ SN:16 LN:80997621
@SQ SN:17 LN:96864807
@SQ SN:18 LN:75711847
@SQ SN:19 LN:59248254
@SQ SN:1 LN:227556264
@SQ SN:20 LN:78541002
@SQ SN:2 LN:192460366
@SQ SN:3 LN:192294377
@SQ SN:4 LN:170955103
@SQ SN:5 LN:189454096
@SQ SN:6 LN:181584905
@SQ SN:7 LN:171882078
@SQ SN:8 LN:146850525
@SQ SN:9 LN:133195287
@SQ SN:MT LN:16575
@SQ SN:X LN:152835861
@PG ID:BBMap PN:BBMap VN:34.92 CL:java -Djava.library.path=/slipstream/home/graham/bbmap/jni/ -ea -Xmx160531m align2.BBMapPacBioSkimmer build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambig=all minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=primers.fas ref=mafa5.fa out=primerhits.4.sam k=6 slow subfilter=4
FCGR2A.3F 16 1 90182861 40 19= * 0 0 ACTGTGCCAACGTCCAGTC * NM:i:0 AM:i:40 NH:i:2
FCGR2A.3F 256 12 127476172 32 1=1X17= * 0 0 * * NM:i:1 AM:i:32 NH:i:2
FCGR2A.3R 0 1 90169340 40 20= * 0 0 TTGTCATCCACTCAGCAAGC * NM:i:0 AM:i:40 NH:i:2
FCGR2A.3R 256 3 184441394 32 9=1X10= * 0 0 * * NM:i:1 AM:i:32 NH:i:2

This line in question:
FCGR2A.3F 256 12 127476172 32 1=1X17= * 0 0 * * NM:i:1 AM:i:32 NH:i:2



My output:
[PRE]
*** FCGR2A.3F 1 RC 90182861 19=
ACTGTGCCAACGTCCAGTC
|||||||||||||||||||
ACTGTGCCAACGTCCAGTCGGGTT
*** FCGR2A.3F 12 RC 127476172 1=1X17=
ACTGTGCCAACGTCCAGTC
| |||||||||||||||||
GGCTGGACGTTGGCACAGTGACCA

*** FCGR2A.3R 1 FWD 90169340 20=
TTGTCATCCACTCAGCAAGC
||||||||||||||||||||
TTGTCATCCACTCAGCAAGCTGAGA
*** FCGR2A.3R 3 FWD 184441394 9=1X10=
TTGTCATCCACTCAGCAAGC
||||||||| ||||||||||
TTGTCATCCTCTCAGCAAGCCCAAG[/PRE]

Last edited by lankage; 05-26-2015 at 09:13 AM.
lankage is offline   Reply With Quote
Old 05-26-2015, 10:35 AM   #214
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

The first alignment has bitflag 16, indicating it is reverse-complemented, and the second has bitflag 256, indicating it is forward (but a secondary alignment). Since the primary alignment is reverse-complemented, the bases are displayed reverse-complemented (required by SAM specification). So, they need to be reverse-complemented again before displaying them for the second alignment.

You can optionally add the flag "saa=f" (secondaryalignmentasterisks=false) which will make BBMap print out the query bases in the correct orientation for every alignment, rather than just the primary.
Brian Bushnell is offline   Reply With Quote
Old 05-28-2015, 11:36 AM   #215
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default mapPacBio.sh - soft clipped mappings

Hi Brian, yet again another question about mapPacBio.sh. I would like to use mapPacBio.sh to map full length MHC allele sequences from pac bio sequencing against a reference dataset which contains partial MHC allele sequences, and separate out those that represent extensions of alleles in the dataset, and those that are putative novels. I would like to be able to specify perfect identity to a reference within the overlap portion but when i use the "idfilter=1 local" parameter, I get no mappings. If i use idfilter=.99, i can see the type of mapping i expect to find as below with only soft clipped bases, but also have mappings with mismatches. Is there a way to restrict the mapping results to those with ONLY soft clipped bases?

bbmap/mapPacBio.sh in=filtered_contigs.fasta ref=Mafa_MHC-I_coding_throughPacBio5-15.05.22.fasta outm=filtered_contigs_mapped.sam outu=filtered_contigs_unmapped.fasta idfilter=.99 local -Xmx4g


m150505_221415_42134_c100814192550000001823181311031594_s1_p0/161219/ccs/lbc13.merged.zerolen.fastq_1;size=22; 0 Mafa_B_148_01_01 1 43 9S1080= *
lankage is offline   Reply With Quote
Old 05-28-2015, 11:45 AM   #216
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Ah... that's unintended. When you set "idfilter=1" it turns on "perfectmode", which was unintentional; I'll fix that. It SHOULD turn on "semiperfectmode", which is what you need - no substitutions or indels, but going off the end of a reference is allowed.

So, in this case, take out the "idfilter" flag and add "semiperfectmode", and you'll get the results you want. Although 100% perfect, non-soft-clipped reads will still be present. You could remove those with a double-mapping, though, like this:

mapPacBio.sh in=reads.fq out=semiperfect.fq semiperfectmode

mapPacBio.sh in=semiperfect.fq outm=perfect.sam outu=clipped.sam perfectmode


-Brian
Brian Bushnell is offline   Reply With Quote
Old 05-31-2015, 12:34 PM   #217
blsfoxfox
Member
 
Location: auburn

Join Date: Jan 2013
Posts: 12
Default

Quote:
Originally Posted by Brian Bushnell View Post
Sure. For simplicity, I will assume the reads are interleaved in a single file initially (you can transform between interleaved and dual-file reads with reformat.sh if you want). First thing is to trim the adapters with BBDuk:

Code:
bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=23 mink=12 hdist=1 tbo tpe ref=nextera_LMP_adapter.fa.gz
nextera_LMP_adapter.fa.gz is in /bbmap/resources/.

Next, if you want to remove contaminants/spikeins such as PhiX, you can do so now with a filtering pass of BBDuk, but that's optional.

Finally, split the reads into LMP pairs, short pairs, and singletons:

Code:
splitnextera.sh in=trimmed.fq out=longmatepairs.fq outf=fragmentpairs.fq outu=unknownpairs.fq outs=singletons.fq mask=t
At this point you could do quality-trimming if you want to on the individual sub-libraries; it's important not to do so sooner as it interferes with the ability to detect junctions. If all you care about is the LMP data you can just keep longmatepairs.fq and throw away the others, but theoretically, you might be able to get enough fragment pairs and singletons to assemble the contigs from those, then use the LMP for scaffolding, and thus get a complete single-library assembly. That's what Illumina is encouraging, at any rate.

The majority of the reads in the "unknown" bin were LMP, not short-fragment reads, in the one library I tested (and also according to Illumina's data), but I would not really consider it safe to use for scaffolding OR contig-building, and would probably throw it away unless you don't have enough data.

Hi Brian,

Thesedays I am thinking of several assembly questions related to what you said above.
1) After removing those "pair-end" and "single-end" contamination, why can't we reverse_complement those LMP reads and treat them as pair-end reads for the contig assembly stage? Maybe due to the larger variance of the LMP insert sizes?
2) For those "unknown" reads, I think it is also helpful to assemble them as single ends, right?
blsfoxfox is offline   Reply With Quote
Old 06-01-2015, 10:15 AM   #218
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by blsfoxfox View Post
Hi Brian,

Thesedays I am thinking of several assembly questions related to what you said above.
1) After removing those "pair-end" and "single-end" contamination, why can't we reverse_complement those LMP reads and treat them as pair-end reads for the contig assembly stage? Maybe due to the larger variance of the LMP insert sizes?
You can... whether that's a good idea depends on the assembler. Some assemblers probably use LMP reads for kmers just like the pair-end reads. Particularly if you don't have enough coverage of PE reads, using the LMP reads that way is probably fine, though they will probably have a more biased coverage distribution than the PE reads, which could interfere with the heuristics of some assemblers.
Quote:
2) For those "unknown" reads, I think it is also helpful to assemble them as single ends, right?
Virtually all of the unknown-binned reads are non-overlapping LMP reads that you can't assemble into single reads (in the data I have examined). They end up in the unknown bin because the junction adapter was not visible in either read. But that usually means that the junctions were in the unsequenced portion between the two reads. This really depends on your insert size distribution (the physical insert size of the sequenced fragments, not the insert size of the long transposased pieces). If you fragment to substantially longer than 2x read length, a lot of LMP pairs will end up in the unknown bin because the junction is in the unsequenced middle. If you fragment to a shorter insert size, such that most of your pairs overlap, the unknown bin would consist more of PE reads.

The latest version of splitnextera has an option to attempt to merge the reads by overlap before looking for the junction. That way, it is better able to determine whether a pair belongs in the unknown bin - if they overlap, and do not contain a junction, they go to singleton rather than unknown; if they do contain a junction, they go to LMP; so the only pairs that end up in unknown are the ones that don't overlap AND don't have a junction adapter. So, fewer reads end up in unknown; but it will only be useful on libraries that have overlapping fragments.

Last edited by Brian Bushnell; 06-01-2015 at 10:18 AM.
Brian Bushnell is offline   Reply With Quote
Old 06-01-2015, 10:56 AM   #219
blsfoxfox
Member
 
Location: auburn

Join Date: Jan 2013
Posts: 12
Default

Quote:
Originally Posted by Brian Bushnell View Post
You can... whether that's a good idea depends on the assembler. Some assemblers probably use LMP reads for kmers just like the pair-end reads. Particularly if you don't have enough coverage of PE reads, using the LMP reads that way is probably fine, though they will probably have a more biased coverage distribution than the PE reads, which could interfere with the heuristics of some assemblers.

Virtually all of the unknown-binned reads are non-overlapping LMP reads that you can't assemble into single reads (in the data I have examined). They end up in the unknown bin because the junction adapter was not visible in either read. But that usually means that the junctions were in the unsequenced portion between the two reads. This really depends on your insert size distribution (the physical insert size of the sequenced fragments, not the insert size of the long transposased pieces). If you fragment to substantially longer than 2x read length, a lot of LMP pairs will end up in the unknown bin because the junction is in the unsequenced middle. If you fragment to a shorter insert size, such that most of your pairs overlap, the unknown bin would consist more of PE reads.

The latest version of splitnextera has an option to attempt to merge the reads by overlap before looking for the junction. That way, it is better able to determine whether a pair belongs in the unknown bin - if they overlap, and do not contain a junction, they go to singleton rather than unknown; if they do contain a junction, they go to LMP; so the only pairs that end up in unknown are the ones that don't overlap AND don't have a junction adapter. So, fewer reads end up in unknown; but it will only be useful on libraries that have overlapping fragments.
Hi Brian,

Thanks for your reply!
I understand your points of the differences between "unknown" and "single-end" reads. However, I am not going to merge the "unknown" pairs into one single read, but treat one pair of reads as two single reads only for the contig assembly step. By doing this, we neither throw the "unknown" reads away nor rely on their mate-pair information to do scaffolding as insurance. Is it valid to you?
blsfoxfox is offline   Reply With Quote
Old 06-01-2015, 11:03 AM   #220
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Oh, yes, that sounds perfectly fine. You may want to adapter-trim the unknown reads first (treating them as singletons) by specifying the junction-adapter as the adapter sequence, like this:

bbduk.sh in=unknown.fq int=f out=trimmed.fq adapter=CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG k=21 mink=5 ktrim=r hdist=1

That will ensure you trim junction adapter from the end of the read, in case it was present but not detected because it was just a few bases. Otherwise the unknown bin may be enriched for reads with junction adapter at the end, that was just too short to be positively identified. Technically it could be present on the left side, as well. Making a base-frequency histogram of the unknown bin might be useful; I have not done that.

Last edited by Brian Bushnell; 06-01-2015 at 11:10 AM.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:48 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO