Seqanswers Leaderboard Ad

**Brian Bushnell** · 01-14-2015, 11:17 PM

For RNA-seq data of a eukaryote, if you have overlapping paired reads, the best way to get the insert size distribution is by merging the reads based on overlap, as that will not be affected by introns. You can do that with BBMerge like this:

bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist_merging.txt loose

It is best in this case to NOT quality-trim, as that may eliminate some of the longer inserts. Trimming adapters is fine.

If the reads are not overlapping, I recommend BBMap, which calculates insert size in such a way that reads spanning introns still yield the correct insert size:

bbmap.sh ref=reference.fasta in1=read1.fq in2=read2.fq ihist=ihist_mapping.txt out=mapped.sam maxindel=200000

For either program you can cap it at, say, 1M read pairs with the flag "reads=1000000". This will make it stop early, since 1 million is plenty to calculate an insert size distribution, if you don't want the sam file.

**pkstarstorm05** · 01-14-2015, 11:38 PM

Reply-insert size calculation

Hi Brian ,

Sorry, I should have clarified that - I'm working on Mouse tissue.

So if I'm understanding this correctly - you're suggesting to run BBmap or BBmerge just to estimate the insert size, and then use the estimated insert size in Tophat? (Very clever if so!)

I'm not at all sure if we have overlapping reads or not. I'm assuming that we do - the output from the RNA-seq was ~30 million pairs of reads per read set, so there is bound to be overlapping reads. What am I using to determine/How do I tell if the reads are overlapping? (i.e. is this something special about the RNA-seq run or anything?)

Thanks for the clarification and help!

Cheers,
Paul

**Brian Bushnell** · 01-15-2015, 09:00 AM

Hi Paul,

BBMerge will tell you the insert size distribution in under a minute if you cap it at 1M reads and they are, in fact, overlapping. You'll have to graph the histogram to determine whether the mode was captured or if the insert size was too long for merging. If you have 2x150bp reads, BBMerge can capture insert sizes up to around 288bp with default settings. If the graph is ascending, then there is a sudden, steep dropoff at 188bp, and/or you only get, say, 20% of your reads to merge, then the inserts were too long (or reads too short) for that approach. You know the mode is captured if the graph goes up, peaks, and then starts going back down well before the sharp dropoff to zero just before 2x(read length).

BBMap is able to calculate the distribution for arbitrarily long inserts, but takes longer and of course requires a reference. You can use the insert size distribution to feed into Tophat, or you can just use BBMap's sam file directly; it is a splice-aware aligner that does not need insert size as a parameter (since it auto-detects it while running) and is substantially faster and more sensitive than Tophat anyway. To do that, and produce output with cufflinks-compatible tags, you'd also need to add the flags "xstag=unstranded intronlen=10 ambig=random" (unless your data is strand-specific, in which case you could use 'firststrand' or 'secondstrand').

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

How to correctly estimate RNA-seq mean insert size and standard distribution

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News