SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Reply
 
Thread Tools
Old 12-26-2011, 10:35 PM   #1
AronaldJ
Junior Member
 
Location: whole transcriptome

Join Date: Oct 2010
Posts: 8
Default help for tophat -r

Hello,

I am interested in NCBI GEO "GSM765405:CshlLong_RnaSeq_K562_cell_longPolyA", but there is not fragment length.
a description for "readtype description: Paired 76 n.t. directed reads
readtype: 2x76D
replicate description: tier 1
replicate: 1
Instrument model: Illumina Genome Analyzer II
rnaextract description: Poly(A)+ RNA longer than 200 nt
rnaextract: longPolyA".

I did not know fragment length for tophat software, "tophat -r ."
"-r/--mate-inner-dist <int> This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs."

Generally speaking, Illumina paired-end fragment is 200bp, I confuse to select -r 50.

Thanks for any comments in advance.
I am looking forward to your letter.

AronaldJ
AronaldJ is offline   Reply With Quote
Old 12-28-2011, 11:42 AM   #2
Nicolas
Member
 
Location: new york city

Join Date: Apr 2009
Posts: 40
Default

Hi,

What people typically do is to map the library (or a subset) to the genome (with Bowtie / BWA), compute an estimate of the fragment size (both mean and Standard Deviation, option --mate-std-dev).
Don't forget to substract the size of the reads to the mean fragment size and use this as your "mate-inner-dist".
A possible command-line to calculate this is: (this assume that the read length is constant)

Code:
samtools view -F 0x4 File.mapped.bam |  awk '{if ($9 >0) {sum+=$9;sumsq+=$9*$9;N+=1}} END {print "mean = " sum/N " SD=" sqrt(sumsq/N - (sum/N)**2)}'
Alternatively, if you can estimate the fragment size during the library preparation, that should also work.

I have no idea how sensitive this parameter is and its effect on the resulting mapping. Anybody has comments on that?

Hope that helps,

Nicolas
Nicolas is offline   Reply With Quote
Old 02-11-2012, 01:38 AM   #3
pinki999
Member
 
Location: Germany

Join Date: Oct 2010
Posts: 37
Default

Hi Nicholas,

When I used your one line command to calculate the mean fragment size, I got mean = approx. 158. I am working on PE data with 2x100bp read length. In this case, does the insert size = 158-200 or 200-158?

Thanks
pinki999 is offline   Reply With Quote
Old 02-11-2012, 01:59 AM   #4
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Remember that for RNA-seq, estimating insert size by mapping to the genome will give misleading results (because of splicing). Perhaps mapping to a transcriptome reference would be better.
kopi-o is offline   Reply With Quote
Old 02-11-2012, 09:00 AM   #5
Nicolas
Member
 
Location: new york city

Join Date: Apr 2009
Posts: 40
Default

Quote:
Originally Posted by pinki999 View Post
Hi Nicholas,

When I used your one line command to calculate the mean fragment size, I got mean = approx. 158. I am working on PE data with 2x100bp read length. In this case, does the insert size = 158-200 or 200-158?

Thanks
Hi,

Using the command-line above will summarize the values in field 9 ($9), which, in the SAM format indicates the distance between the "left" position of each reads in the pair. So, the inner distance required in the -r option by Tophat should be 158 - 100. Substract only once the size of the reads, not twice!

[EDIT] I got it wrong, sorry. Field 9 is the oberved Template length, and therefore the inner-distance is equal to "Template length - 2x size of the reads".

Last edited by Nicolas; 02-11-2012 at 01:44 PM.
Nicolas is offline   Reply With Quote
Old 02-11-2012, 09:02 AM   #6
Nicolas
Member
 
Location: new york city

Join Date: Apr 2009
Posts: 40
Default

Quote:
Originally Posted by kopi-o View Post
Remember that for RNA-seq, estimating insert size by mapping to the genome will give misleading results (because of splicing). Perhaps mapping to a transcriptome reference would be better.
Sure. The point here is to map to the genome to estimate the size of the insert (from full length alignments), and feed this parameter into Tophat, to map to the transcriptome (or at least, to allow splicing inside the reads).

The first step (mapping to the genome) can usually be done on a small subset of the raw data.
Nicolas is offline   Reply With Quote
Old 02-11-2012, 12:19 PM   #7
pinki999
Member
 
Location: Germany

Join Date: Oct 2010
Posts: 37
Default

Hi Nicholas,

The following is an alignment from the bowtie output:
HWI-ST751:9607KCACXX:5:1101:2896:1974 163 chr1 15755557 255 101M = 15755677 221 TGTGCCTGCCGCCTGCCCTCCACACATCCCTGTCCCCCCAACCCGGGAACCCCTGCCCTCCTCCAGCAGGCCGCACCGCCCCTGGGGCCCCCTGCCAGCCC CCBFFFFFHHHHHJJJJJJJJJJJJJJJJJJIHJJJJJJJJJJIJHHFCDDDDDDDDDDDDDDDDDDDDDDDDDDDDBBDDDBBBDDDDDDDBDDDCCDDD XA:i:0 MD:Z:101 NM:i:0
HWI-ST751:9607KCACXX:5:1101:2896:1974 83 chr1 15755677 255 101M = 15755557 -221 GCAGAAGAGATAGAATCAGGGCTGCCCCCACAGAGTGGGACCCAAGGGGCTAATTGGAGGCACGAGGGGACCCCTCCCCAGGGCCTTTTCCTCCTCTGCGT ###########@:AA9?9BDB<599BDC>(::5,53<DDDDDDDDDEEEFFFFFHHHHHGIHIIIJJJIGJJJJJJJJJJJJIJJJJJGHHHHFDD?FCCB XA:i:0 MD:Z:101 NM:i:0

If 9th column is the distance between the left position of each reads in the pair then the difference between their POS should be equal to the 9th column right? But here the difference between 15755557 and 15755677 is 120. Am I missing something here?

Thank you
pinki999 is offline   Reply With Quote
Old 02-11-2012, 01:42 PM   #8
Nicolas
Member
 
Location: new york city

Join Date: Apr 2009
Posts: 40
Default

Quote:
Originally Posted by pinki999 View Post
Hi Nicholas,

If 9th column is the distance between the left position of each reads in the pair then the difference between their POS should be equal to the 9th column right? But here the difference between 15755557 and 15755677 is 120. Am I missing something here?

Thank you

Oups, my bad. Field 9 is actually the observed template length, and therefore, in your case, the inner-distance is negative... I never had to run Tophat with a negative inner-distance, but you should try ;-)
If it doesn't work, set it to 0.
I never did a full test of the sensibility of this parameter, I am not sure how important this is.
If anybody knows of such a test, or is willing to try, please let us know...
Nicolas is offline   Reply With Quote
Old 02-11-2012, 02:01 PM   #9
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Say you have an expressed transcript containing two exons separated by a 500-bp intron, and you align to the genome. Your paired-end read then aligns, using BWA or Bowtie, to exon 1 with read 1, and to exon 2 with read 2, with no overlapping of the splice junctions. Now you will estimate the fragment size to be 500 bp longer than it actually is.

I don't know how common this scenario is, but it would mess up the estimate. This would not happen with a transcriptome reference.
kopi-o is offline   Reply With Quote
Old 02-12-2012, 07:39 AM   #10
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

This is our standard method. Align the first 5 million reads using bwa against a transcriptome reference (ensembl has easy download). Run Picard CalcualteInsertMetrics on the bam file, extract out the insert size and standard deviation values. Subtract the combined read length from the calculated insert size to get the inner-mate distance. The pass that calculated value and the standard deviation to tophat for the alignment. On our end it uses about 20 minutes of computation time for bwa aln, bwa sampe, samtools sam to bam, samtools sort, picard insertmetrics on 5 million reads. We also use just the first million, not much effect of the numbers but if you want a pretty histogram picture it looks better.
Jon_Keats is offline   Reply With Quote
Old 02-29-2012, 08:47 PM   #11
ajgentles
Junior Member
 
Location: Planet Earth

Join Date: Jun 2011
Posts: 4
Default unexpectedly small ineset sizes

Thanks for the description on estimating insert-size with bwa and picard. I just tried it with some data we have and am a bit surprised by the results. Bioanalyzer results on the sample suggested that mean RNA fragment length was around 300 bp, and we opted for 2x100 bp paired-end Illumina. Data quality is excellent, but when I follow bwa+picard tools aligning to refseq, it reports a mean insert size of ~160 bp. This seems to be significantly smaller than it should be, and implies that the reads overlap in many cases. The estimated inner mate distance would be negative (160-2*100).

Am I misinterpreting the picard output ? thanks !
ajgentles is offline   Reply With Quote
Old 03-01-2012, 12:15 AM   #12
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Are you sure your fragment sizes from Bioanalyzer don't include adapters? That might explain the discrepancy.

I don't think you are misinterpreting the Picard output - if it gives 160 bp as mean insert size, then on average read 1 and read 2 are overlapping.
kopi-o is offline   Reply With Quote
Old 03-01-2012, 08:40 AM   #13
ajgentles
Junior Member
 
Location: Planet Earth

Join Date: Jun 2011
Posts: 4
Default

Quote:
Originally Posted by kopi-o View Post
Are you sure your fragment sizes from Bioanalyzer don't include adapters? That might explain the discrepancy.
I don't think so - the reported length including the adapters is >400 bp. Given that the reads appear to overlap, it's not clear to me what to tell tophat for the mate pair inner distance. Does it handle negative values ?
ajgentles is offline   Reply With Quote
Old 03-01-2012, 09:11 AM   #14
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Yes, I believe Tophat does handle negative values.
kopi-o is offline   Reply With Quote
Old 03-01-2012, 09:47 PM   #15
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

The bioanalyzer trace you would get includes the adaptors so your 300bp bioanalyzer result does include the 120bp of adaptors so the average insert would be around 180bp, pretty close to the sequencer output of around 160bp. This is case and point why I do it this way as lab based estimation is often way off. Why trust the lab tech who always reports some generic value when you have millions of data points from which you can defined the true insert size and thus for tophat calculate the inner-distance.

As mention it does take negative values, the caveat being for some applications like hybrid-transcript identification the absence of a gap ruins many of the current methods.
Jon_Keats is offline   Reply With Quote
Old 03-23-2012, 05:20 AM   #16
glados
Member
 
Location: Aperture Science

Join Date: Mar 2012
Posts: 59
Default

Hello all,

I have similar problems as I am currently determining my inner distance between the paired end reads and have received much help from this forum thread.

My reads are Illumina HiSeq RNA-seq paired end reads. From the Lab I received the information that the inner distance between the paired end reads is between 100-150 bp as the reads are 100 bp long and size selected for 300-350 bp.

I have mapped some reads in TopHat both with mean inner distance 125 and 150 and SD 25 and 50. I got fairly good results in my opinion. In one sample with 35 million reads, and mean inner distance of 150, 27 million were mapped and out of these, 90% were properly paired. When I used -r 125, 26 million were mapped (88% properly paired).

However, I was uncertain about the inner distance so I followed the advice given in this thread and elsewhere. So I mapped some samples with Bowtie to the transcriptome, got sam-files, used samtools to convert to bam and sort. I used both the samtools script given above and the Picard tools InsertSizeMetrics, which resulted in Mean insert size= 150 and SD= ~30. This would mean that my mean inner distance is 150-2*100 = -50. When looking manually at the coordinates in the Bowtie-produced bam-file, the paired reads give coordinates differing from each other ranging from 15-115 bp. This also points to a mean inner distance of -50, since they seem to be overlapping somewhat.

All is well so far, but when I tried mapping some samples with my newly found mean inner distance of -50, the number of reads that mapped decreased in one sample from 2.5 million to 2.3 million and the above sample from 26-27 million to 26 million. Not much but the biggest difference was in the flagstat output properly paired reads. Before it was 90%, but now with -50 mean inner distance this number was only 57%!

To conclude what I just said:
- Lab technician reports a inner distance between paired ends to 100-150.
- Picard and samtools workflow with Bowtie mapping to transcriptome reports distance as -50. But different SD values? Which is correct?
- When mapping with -r -50 instead of 150 I get slightly fewer reads mapped and a major decrease in properly paired reads. (SD=30 given the Picard output)


This has made me confused and I appreciate any help and tips I can get as to why I get these results and what the true mean inner distance and standard deviation is.

Last edited by glados; 03-27-2012 at 05:10 AM.
glados is offline   Reply With Quote
Old 03-23-2012, 03:15 PM   #17
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Couple of questions:

1) Which version of tophat are you using?

2) To calculate mapped reads from the tophat created BAM are you using samtools flagstat or Picard collect alignment metrics? If the former us the latter.
Jon_Keats is offline   Reply With Quote
Old 03-23-2012, 04:08 PM   #18
glados
Member
 
Location: Aperture Science

Join Date: Mar 2012
Posts: 59
Default

Quote:
Originally Posted by Jon_Keats View Post
Couple of questions:

1) Which version of tophat are you using?

2) To calculate mapped reads from the tophat created BAM are you using samtools flagstat or Picard collect alignment metrics? If the former us the latter.
1) I am using the latest tophat version, I think it's the 1.4.1. (am on a different computer at the moment)
2) I have used both but my summary on properly paired reads are from samtools flagstat. If I remember correctly, Picard's CollectAlignmentSummaryMetrics does not output the number of properly paired reads. Though I am still uncertain what this says and if it is important.

I wonder if it has something to do with the difference between mate pairs and paired end reads? My data is paired end, but tophat doesn't seem to distinguish between these two types. I am also amazed that tophat managed to map so well with -r 150 if the true -r is -50.

edit: I have also visualized the bam files in a genome browser and the pairs are overlapping each other very much, even though they were mapped with -r 150, so the -50 seems appropriate. Still unsure about the decrease on properly paired reads though.

edit2: Seems like this guy Jim has similar results with slightly more reads mapped and higher % properly paired reads when increasing his mean inner distance to higher than it should be. Perhaps it has something to do with multiple mappings? I have the --max-multihits option set to 10 (default is 20).

Last edited by glados; 03-27-2012 at 05:10 AM.
glados is offline   Reply With Quote
Old 03-24-2012, 10:15 PM   #19
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Just as shown in the link you provided have you looked to see if the "unique" mapping events are inconsistent? Samtools only reports the total alignment events while picard provides the unique alignment events. Alternatively, follow the links manual method "cut | sort | uniq" to get these statistics.

Are you using a GTF to do the phase I transcriptome alignment supported in tophat 1.4.x?
Jon_Keats is offline   Reply With Quote
Old 03-25-2012, 01:40 AM   #20
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

glados: I would also very much like to know why you are seeing the results that you see. The -r value of -50 is almost (but not quite) consistent with the info you are getting from the lab minus the length of the adapters. But the puzzling thing is, like you said, that you are seeing a larger percentage of properly paired reads with the larger r value.

Also, how does the -r value change the number of mapped reads? I had, perhaps na´vely, thought that the mapping per se is not affected, just the properly paired flags.
kopi-o is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:24 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO