View Single Post
Old 02-24-2014, 10:09 AM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by luc View Post
Dear Brian,

thanks a lot for posting your aligner package.
Thanks to the PacBio option it is not strictly a short read aligner - but for which max read read length would you recommend it (e.g are merged 500 bp Miseq reads OK for the short read version; are PacBio reads longer than the 6 kb mentioned for version 25 allowed for the latest version)?
Would you recommend any specific settings for the alignment of RNA seq reads?
Luc,

Unfortunately, the limit is still 6kb. I am working on increasing that but it may take a while. For PacBio reads, you should use the script mapPacBio8k.sh (which despite the name only goes up to 6k), and the PacBio reads should be in fasta format. Reads in fasta format that are longer than 6000bp will be automatically broken into 6000bp pieces and their names appended with "_1", "_2", and so forth. But the vast majority of the PacBio reads will be under 6kbp and thus won't be fragmented.

For any Illumina reads, including merged Miseq, use bbmap.sh. It handles reads up to 500bp. The only exception is if you want to map Illumina reads to PacBio reads (for error correction), in which case you should use mapPacBio.

bbmap.sh (which calls BBMap.class) has mutation penalties tuned for Illumina reads, while mapPacBio8k.sh (which calls BBMapPacBio.class) has mutation penalties tuned for PacBio reads and a longer maximum read length, and is slower. You can still use mapPacBio8k.sh for any really long reads, though, like when mapping one assembly to another.

For RNA-seq (using Illumina reads), I would recommend a command line like this:

bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam maxindel=100000 xstag=firststrand intronlen=10 ambig=random

This will look for introns up to 100kbp long, and flag every gap longer than 10bp with cigar symbol 'N' (skipped) rather than 'D' (deletion). Ambiguously-mapped reads will go to a random top-scoring site, which is generally best for quantification. XS tags, needed by Cufflinks, will be produced according to the "first strand" library protocol (if you don't know what protocol is used, then set it to xstag=unstranded).

The 'maxindel' flag can be increased or decreased as needed. Many plants and fungi tend to have very short introns mostly under 400bp, while human average is much higher and has a few that are even longer than 100kbp. It should be set to above the size of the largest introns you expect, but higher values decrease accuracy, so I don't recommend setting it above 500000 or so. The default is 16000 which is fine for the plants and fungi I've worked with but if you are working with vertebrates then 200000 would probably be appropriate.

Last edited by Brian Bushnell; 02-24-2014 at 03:19 PM.
Brian Bushnell is offline   Reply With Quote