View Single Post
Old 01-14-2015, 11:17 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

For RNA-seq data of a eukaryote, if you have overlapping paired reads, the best way to get the insert size distribution is by merging the reads based on overlap, as that will not be affected by introns. You can do that with BBMerge like this:

bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist_merging.txt loose

It is best in this case to NOT quality-trim, as that may eliminate some of the longer inserts. Trimming adapters is fine.

If the reads are not overlapping, I recommend BBMap, which calculates insert size in such a way that reads spanning introns still yield the correct insert size:

bbmap.sh ref=reference.fasta in1=read1.fq in2=read2.fq ihist=ihist_mapping.txt out=mapped.sam maxindel=200000

For either program you can cap it at, say, 1M read pairs with the flag "reads=1000000". This will make it stop early, since 1 million is plenty to calculate an insert size distribution, if you don't want the sam file.
Brian Bushnell is offline   Reply With Quote