Unconfigured Ad

**Nicolas** · 12-28-2011, 12:42 PM

Hi,

What people typically do is to map the library (or a subset) to the genome (with Bowtie / BWA), compute an estimate of the fragment size (both mean and Standard Deviation, option --mate-std-dev).
Don't forget to substract the size of the reads to the mean fragment size and use this as your "mate-inner-dist".
A possible command-line to calculate this is: (this assume that the read length is constant)

Code:

samtools view -F 0x4 File.mapped.bam |  awk '{if ($9 >0) {sum+=$9;sumsq+=$9*$9;N+=1}} END {print "mean = " sum/N " SD=" sqrt(sumsq/N - (sum/N)**2)}'

Alternatively, if you can estimate the fragment size during the library preparation, that should also work.

I have no idea how sensitive this parameter is and its effect on the resulting mapping. Anybody has comments on that?

Hope that helps,

Nicolas

**pinki999** · 02-11-2012, 02:38 AM

Hi Nicholas,

When I used your one line command to calculate the mean fragment size, I got mean = approx. 158. I am working on PE data with 2x100bp read length. In this case, does the insert size = 158-200 or 200-158?

Thanks

**kopi-o** · 02-11-2012, 02:59 AM

Remember that for RNA-seq, estimating insert size by mapping to the genome will give misleading results (because of splicing). Perhaps mapping to a transcriptome reference would be better.

**Nicolas** · 02-11-2012, 10:00 AM

Originally posted by pinki999 View Post

Hi Nicholas,

When I used your one line command to calculate the mean fragment size, I got mean = approx. 158. I am working on PE data with 2x100bp read length. In this case, does the insert size = 158-200 or 200-158?

Thanks

Hi,

Using the command-line above will summarize the values in field 9 ($9), which, in the SAM format indicates the distance between the "left" position of each reads in the pair. So, the inner distance required in the -r option by Tophat should be 158 - 100. Substract only once the size of the reads, not twice!

[EDIT] I got it wrong, sorry. Field 9 is the oberved Template length, and therefore the inner-distance is equal to "Template length - 2x size of the reads".

**Nicolas** · 02-11-2012, 10:02 AM

Originally posted by kopi-o View Post

Remember that for RNA-seq, estimating insert size by mapping to the genome will give misleading results (because of splicing). Perhaps mapping to a transcriptome reference would be better.

Sure. The point here is to map to the genome to estimate the size of the insert (from full length alignments), and feed this parameter into Tophat, to map to the transcriptome (or at least, to allow splicing inside the reads).

The first step (mapping to the genome) can usually be done on a small subset of the raw data.

**pinki999** · 02-11-2012, 01:19 PM

Hi Nicholas,

The following is an alignment from the bowtie output:
HWI-ST751:96

07KCACXX:5:1101:2896:1974 163 chr1 15755557 255 101M = 15755677 221 TGTGCCTGCCGCCTGCCCTCCACACATCCCTGTCCCCCCAACCCGGGAACCCCTGCCCTCCTCCAGCAGGCCGCACCGCCCCTGGGGCCCCCTGCCAGCCC CCBFFFFFHHHHHJJJJJJJJJJJJJJJJJJIHJJJJJJJJJJIJHHFCDDDDDDDDDDDDDDDDDDDDDDDDDDDDBBDDDBBBDDDDDDDBDDDCCDDD XA:i:0 MD:Z:101 NM:i:0
HWI-ST751:96

07KCACXX:5:1101:2896:1974 83 chr1 15755677 255 101M = 15755557 -221 GCAGAAGAGATAGAATCAGGGCTGCCCCCACAGAGTGGGACCCAAGGGGCTAATTGGAGGCACGAGGGGACCCCTCCCCAGGGCCTTTTCCTCCTCTGCGT ###########@:AA9?9BDB<599BDC>(::5,53<DDDDDDDDDEEEFFFFFHHHHHGIHIIIJJJIGJJJJJJJJJJJJIJJJJJGHHHHFDD?FCCB XA:i:0 MD:Z:101 NM:i:0

If 9th column is the distance between the left position of each reads in the pair then the difference between their POS should be equal to the 9th column right? But here the difference between 15755557 and 15755677 is 120. Am I missing something here?

Thank you

**Nicolas** · 02-11-2012, 02:42 PM

Originally posted by pinki999 View Post

Hi Nicholas,

If 9th column is the distance between the left position of each reads in the pair then the difference between their POS should be equal to the 9th column right? But here the difference between 15755557 and 15755677 is 120. Am I missing something here?

Thank you

Oups, my bad. Field 9 is actually the observed template length, and therefore, in your case, the inner-distance is negative... I never had to run Tophat with a negative inner-distance, but you should try ;-)
If it doesn't work, set it to 0.
I never did a full test of the sensibility of this parameter, I am not sure how important this is.
If anybody knows of such a test, or is willing to try, please let us know...

**kopi-o** · 02-11-2012, 03:01 PM

Say you have an expressed transcript containing two exons separated by a 500-bp intron, and you align to the genome. Your paired-end read then aligns, using BWA or Bowtie, to exon 1 with read 1, and to exon 2 with read 2, with no overlapping of the splice junctions. Now you will estimate the fragment size to be 500 bp longer than it actually is.

I don't know how common this scenario is, but it would mess up the estimate. This would not happen with a transcriptome reference.

**Jon_Keats** · 02-12-2012, 08:39 AM

This is our standard method. Align the first 5 million reads using bwa against a transcriptome reference (ensembl has easy download). Run Picard CalcualteInsertMetrics on the bam file, extract out the insert size and standard deviation values. Subtract the combined read length from the calculated insert size to get the inner-mate distance. The pass that calculated value and the standard deviation to tophat for the alignment. On our end it uses about 20 minutes of computation time for bwa aln, bwa sampe, samtools sam to bam, samtools sort, picard insertmetrics on 5 million reads. We also use just the first million, not much effect of the numbers but if you want a pretty histogram picture it looks better.

**ajgentles** · 02-29-2012, 09:47 PM

unexpectedly small ineset sizes

Thanks for the description on estimating insert-size with bwa and picard. I just tried it with some data we have and am a bit surprised by the results. Bioanalyzer results on the sample suggested that mean RNA fragment length was around 300 bp, and we opted for 2x100 bp paired-end Illumina. Data quality is excellent, but when I follow bwa+picard tools aligning to refseq, it reports a mean insert size of ~160 bp. This seems to be significantly smaller than it should be, and implies that the reads overlap in many cases. The estimated inner mate distance would be negative (160-2*100).

Am I misinterpreting the picard output ? thanks !

**kopi-o** · 03-01-2012, 01:15 AM

Are you sure your fragment sizes from Bioanalyzer don't include adapters? That might explain the discrepancy.

I don't think you are misinterpreting the Picard output - if it gives 160 bp as mean insert size, then on average read 1 and read 2 are overlapping.

**ajgentles** · 03-01-2012, 09:40 AM

Originally posted by kopi-o View Post

Are you sure your fragment sizes from Bioanalyzer don't include adapters? That might explain the discrepancy.

I don't think so - the reported length including the adapters is >400 bp. Given that the reads appear to overlap, it's not clear to me what to tell tophat for the mate pair inner distance. Does it handle negative values ?

**kopi-o** · 03-01-2012, 10:11 AM

Yes, I believe Tophat does handle negative values.

**Jon_Keats** · 03-01-2012, 10:47 PM

The bioanalyzer trace you would get includes the adaptors so your 300bp bioanalyzer result does include the 120bp of adaptors so the average insert would be around 180bp, pretty close to the sequencer output of around 160bp. This is case and point why I do it this way as lab based estimation is often way off. Why trust the lab tech who always reports some generic value when you have millions of data points from which you can defined the true insert size and thus for tophat calculate the inner-distance.

As mention it does take negative values, the caveat being for some applications like hybrid-transcript identification the absence of a gap ruins many of the current methods.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

help for tophat -r

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News