Seqanswers Leaderboard Ad

**glados** · 03-23-2012, 05:20 AM

Hello all,

I have similar problems as I am currently determining my inner distance between the paired end reads and have received much help from this forum thread.

My reads are Illumina HiSeq RNA-seq paired end reads. From the Lab I received the information that the inner distance between the paired end reads is between 100-150 bp as the reads are 100 bp long and size selected for 300-350 bp.

I have mapped some reads in TopHat both with mean inner distance 125 and 150 and SD 25 and 50. I got fairly good results in my opinion. In one sample with 35 million reads, and mean inner distance of 150, 27 million were mapped and out of these, 90% were properly paired. When I used -r 125, 26 million were mapped (88% properly paired).

However, I was uncertain about the inner distance so I followed the advice given in this thread and elsewhere. So I mapped some samples with Bowtie to the transcriptome, got sam-files, used samtools to convert to bam and sort. I used both the samtools script given above and the Picard tools InsertSizeMetrics, which resulted in Mean insert size= 150 and SD= ~30. This would mean that my mean inner distance is 150-2*100 = -50. When looking manually at the coordinates in the Bowtie-produced bam-file, the paired reads give coordinates differing from each other ranging from 15-115 bp. This also points to a mean inner distance of -50, since they seem to be overlapping somewhat.

All is well so far, but when I tried mapping some samples with my newly found mean inner distance of -50, the number of reads that mapped decreased in one sample from 2.5 million to 2.3 million and the above sample from 26-27 million to 26 million. Not much but the biggest difference was in the flagstat output properly paired reads. Before it was 90%, but now with -50 mean inner distance this number was only 57%!

To conclude what I just said:
- Lab technician reports a inner distance between paired ends to 100-150.
- Picard and samtools workflow with Bowtie mapping to transcriptome reports distance as -50. But different SD values? Which is correct?
- When mapping with -r -50 instead of 150 I get slightly fewer reads mapped and a major decrease in properly paired reads. (SD=30 given the Picard output)

This has made me confused and I appreciate any help and tips I can get as to why I get these results and what the true mean inner distance and standard deviation is.

**Jon_Keats** · 03-23-2012, 03:15 PM

Couple of questions:

1) Which version of tophat are you using?

2) To calculate mapped reads from the tophat created BAM are you using samtools flagstat or Picard collect alignment metrics? If the former us the latter.

**glados** · 03-23-2012, 04:08 PM

Originally posted by Jon_Keats View Post

Couple of questions:

1) Which version of tophat are you using?

2) To calculate mapped reads from the tophat created BAM are you using samtools flagstat or Picard collect alignment metrics? If the former us the latter.

1) I am using the latest tophat version, I think it's the 1.4.1. (am on a different computer at the moment)
2) I have used both but my summary on properly paired reads are from samtools flagstat. If I remember correctly, Picard's CollectAlignmentSummaryMetrics does not output the number of properly paired reads. Though I am still uncertain what this says and if it is important.

I wonder if it has something to do with the difference between mate pairs and paired end reads? My data is paired end, but tophat doesn't seem to distinguish between these two types. I am also amazed that tophat managed to map so well with -r 150 if the true -r is -50.

edit: I have also visualized the bam files in a genome browser and the pairs are overlapping each other very much, even though they were mapped with -r 150, so the -50 seems appropriate. Still unsure about the decrease on properly paired reads though.

edit2: Seems like this guy Jim has similar results with slightly more reads mapped and higher % properly paired reads when increasing his mean inner distance to higher than it should be. Perhaps it has something to do with multiple mappings? I have the --max-multihits option set to 10 (default is 20).

**Jon_Keats** · 03-24-2012, 10:15 PM

Just as shown in the link you provided have you looked to see if the "unique" mapping events are inconsistent? Samtools only reports the total alignment events while picard provides the unique alignment events. Alternatively, follow the links manual method "cut | sort | uniq" to get these statistics.

Are you using a GTF to do the phase I transcriptome alignment supported in tophat 1.4.x?

**kopi-o** · 03-25-2012, 01:40 AM

glados: I would also very much like to know why you are seeing the results that you see. The -r value of -50 is almost (but not quite) consistent with the info you are getting from the lab minus the length of the adapters. But the puzzling thing is, like you said, that you are seeing a larger percentage of properly paired reads with the larger r value.

Also, how does the -r value change the number of mapped reads? I had, perhaps naïvely, thought that the mapping per se is not affected, just the properly paired flags.

**glados** · 03-26-2012, 05:29 AM

Thank you both for your thoughts and questions.
Jon Keats, tophat version was indeed 1.4.1. I have not used the GTF option in tophats. Would that be a good idea? Do you recommend it?

I got an email from the lab also. It seems like I was correct and they had forgot about the length of the adaptors.. Even more incentive to anyone else that reads this to calculate your paired read inner distance based on your reads, and don't trust the lab's information.

Here's a more detailed summary of one of my samples when mapping with different --mate-inner-dist and --mate-std-dev parameters (all other parameters are the same).

Mean=150, SD=50
26178262 mapped (100.00%:-nan%)
23314350 + 0 properly paired (89.06%:-nan%)

Mean=125, SD=25
26178201 + 0 mapped (100.00%:-nan%)
23068014 + 0 properly paired (88.12%:-nan%)

Mean=-50, SD=30
26175715 + 0 mapped (100.00%:-nan%)
15012438 + 0 properly paired (57.35%:-nan%)

As you can see, very little change in number of mapped reads but drastic decrease in properly paired reads.

This below is from the Picard tools MarkDuplicates. These numbers are only from the mapping with Mean=150 because the 125 and -50 gives very very similar numbers (almost identical percent_duplication numbers).

UNPAIRED_READS_EXAMINED = 2062968
READ_PAIRS_EXAMINED = 11419719
UNMAPPED_READS = 0
UNPAIRED_READ_DUPLICATES = 1511616
READ_PAIR_DUPLICATES = 2358529
READ_PAIR_OPTICAL_DUPLICATES = 1325726
PERCENT_DUPLICATION = 0.250123
ESTIMATED_LIBRARY_SIZE = 45900892

Am I right in interpreting the field PERCENT_DUPLICATION meaning that 0.25% of the reads are duplicates? Since all three mappings gave similar numbers of duplicates I don't think this is the reason why the number of properly paired reads decreases. What I'm now trying to figure out is how tophat determines when a read is properly paired or not, but it's not very easy to find this information.

Another question I have that you might be able to answer so I can start mapping with the correct inner distance. I subtracted the read lengths and the adaptors from the size selected fragments and got closer to the number -50 but not quite. Is there anything more that you should subtract besides both adaptor lengths? barcodes?

If you have any more ideas or suggestions, please let me know. I will keep you updated if I find out more.

**kopi-o** · 03-26-2012, 07:30 AM

Is there anything more that you should subtract besides both adaptor lengths? barcodes?

If the samples were prepped using the TruSeq kit (were they?) then the adapters should be 121 bp or something like that in total, and the barcode is included in that. I don't know about other kits.

I guess I have previously assumed that the properly paired flag is set when a pair of reads map within (mate-inner-dist +- mate-std-dev) of each other, but now I don't know.

**glados** · 03-26-2012, 08:01 AM

Yes the adaptors are 121 bp total. I also got additional information from the lab that the size selection was on fragments 280-350 bp. So if 280 - 2*100 - 121 = -41

Perhaps tophat doesn't flag the pairs as properly paired when given a negative mean inner distance value?

**icebsd** · 03-27-2012, 05:00 AM

Hi galdos,

I'm just having the same problem.

In my experiment, the fragment size is around 260bp, and the read length is 90PE.

At first my calculation was wrong. I thought the inner distance should be around 100. After that I realized the correct mean inner distance should be 260 - 121 - 90*2 ~= -40, so I run tophat again.

When I compared the results by "samtools flagstat", the proper aligned paired-ends was about 97% for inner distance 100, and about 50% for inner distance -40.

Then I compared the sam files generated by tophat (converted from bam file), I found that for most of the reads, the alignment were the same, only the flags were different. If I remember correctly, the flags generated with -r 100 were 2 more than the flags generated with -r -40, for example, for the same read, the flag is 83 for -r 100, and 81 for -r -40. According to samtools definition, a flag of 0x2 means "each segment properly aligned according to the aligner".

I'm still trying to figure our how tophat works with negative inner distance.

Originally posted by glados View Post

Thank you both for your thoughts and questions.
Jon Keats, tophat version was indeed 1.4.1. I have not used the GTF option in tophats. Would that be a good idea? Do you recommend it?

I got an email from the lab also. It seems like I was correct and they had forgot about the length of the adaptors.. Even more incentive to anyone else that reads this to calculate your paired read inner distance based on your reads, and don't trust the lab's information.

Here's a more detailed summary of one of my samples when mapping with different --mate-inner-dist and --mate-std-dev parameters (all other parameters are the same).

As you can see, very little change in number of mapped reads but drastic decrease in properly paired reads.

This below is from the Picard tools MarkDuplicates. These numbers are only from the mapping with Mean=150 because the 125 and -50 gives very very similar numbers (almost identical percent_duplication numbers).

Am I right in interpreting the field PERCENT_DUPLICATION meaning that 0.25% of the reads are duplicates? Since all three mappings gave similar numbers of duplicates I don't think this is the reason why the number of properly paired reads decreases. What I'm now trying to figure out is how tophat determines when a read is properly paired or not, but it's not very easy to find this information.

Another question I have that you might be able to answer so I can start mapping with the correct inner distance. I subtracted the read lengths and the adaptors from the size selected fragments and got closer to the number -50 but not quite. Is there anything more that you should subtract besides both adaptor lengths? barcodes?

If you have any more ideas or suggestions, please let me know. I will keep you updated if I find out more.

**Jon_Keats** · 03-27-2012, 01:27 PM

If you are using human data I'd recommend using the GTF option in general.

From your mark duplicates table you have 24,902,406 unique alignment events [(READ_PAIRS_EXAMINED x 2) + UNPAIRED_READS_EXAMINED] in the -r 150 test. Look to see if that number is also changing as you modify -r

The percent duplicates you are seeing is fairly good or typical of illumina mRNAseq libraries with > 20 million fragments.

**icebsd** · 03-29-2012, 10:18 PM

I did a comparison of a sample dataset with different inner distances.

The mean library size is 260, and reads are 90PE, so the theoretical inner distance should be around -40.
--------------------------------------------------------
dist: 150
--------------------------------------------------------
49031 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
49031 + 0 mapped (100.00%:-nan%)
49031 + 0 paired in sequencing
24512 + 0 read1
24519 + 0 read2
48236 + 0 properly paired (98.38%:-nan%)
48420 + 0 with itself and mate mapped
611 + 0 singletons (1.25%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
--------------------------------------------------------
dist: 100
--------------------------------------------------------
49029 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
49029 + 0 mapped (100.00%:-nan%)
49029 + 0 paired in sequencing
24511 + 0 read1
24518 + 0 read2
47858 + 0 properly paired (97.61%:-nan%)
48418 + 0 with itself and mate mapped
611 + 0 singletons (1.25%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
--------------------------------------------------------
dist: -40
--------------------------------------------------------
49037 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
49037 + 0 mapped (100.00%:-nan%)
49037 + 0 paired in sequencing
24515 + 0 read1
24522 + 0 read2
38204 + 0 properly paired (77.91%:-nan%)
48426 + 0 with itself and mate mapped
611 + 0 singletons (1.25%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
--------------------------------------------------------

It seems that there is trend, the longer the inner distance is, the higher the rate of properly paired will be.

**glados** · 03-30-2012, 12:29 AM

Indeed icebsd, it seems like that.
I got a response from Cole Trapnell earlier where he said that I shouldn't worry too much about the negative inner distance since both tophat and cufflinks are able to handle that well. I asked about the properly pairing rules as discussed in this thread, and he said he would forward this question to Daehwan as he has been working on that tophat code recently. I have not gotten a response from Daehwan, if I do I will update in this thread. Perhaps he is willing to comment himself in this thread.

edit: Jon Keats: The [(READ_PAIRS_EXAMINED x 2) + UNPAIRED_READS_EXAMINED] are almost identical for the three different mappings. It is less than 20 reads difference.

**Jon_Keats** · 04-01-2012, 09:02 PM

Okay so this is an important point. You need to watch the number of uniquely mapped reads NOT the total number of reads mapped, at least in my opinion (To do so use Picard CollectAlignmentMetrics over Samtools Flagstat). Second watch the percentage of aligned reads which are unique events. What I think you are seeing is that as you alter the -r value you are increasing the number of non-unique mappings (ie. reads that map to multiple locations, tophat will print the same read multiple times in the bam file) while not altering the number of uniquely mapped reads. At the end of the day you want to limit the number of multiple mapped reads as these reads will contaminate your gene expression estimates.

**glados** · 04-03-2012, 02:22 AM

Jon Keats, as I mention earlier the number of duplicates is approximately the same in the three different mappings. I have used Picards CollectAlignmentSummaryMetrics in addition to the MarkDuplicates and it does not give much information, mostly zeros in every column (which I have noticed to be the case for other users as well, so it seems to be in order). I can extract one column for you to see:

Code:

TOTAL_READS
[U]mean150[/U]	
PAIR	24902406
[U]mean125	[/U]
PAIR	24902400
[U]mean-50	[/U]
PAIR	24902359

Here's from the MarkDuplicatesMetrics:

Code:

[U]mean150[/U]
Unpaired_read_duplicates	read_pair_duplicates	read_pair_optical_duplicates	percent_duplication
1511616	2358529	1325726	0.250123			
[U]mean125	[/U]		
unpaired_read_duplicates	read_pair_duplicates	read_pair_optical_duplicates	percent_duplication
1511397	2358693	1325905	0.250128
[U]mean-50	[/U]				
unpaired_read_duplicates	read_pair_duplicates	read_pair_optical_duplicates	percent_duplication
1511513	2358631	1325846	0.250128

I still think that TopHat flags more reads as properly paired when given a higher mean inner distance, though I'm unsure of why that is. It would be best if someone from the TopHat team could give us clarity as to why, and in the meantime I think it's a good idea not to trust the reads flagged as properly paired too much.

**vkartha** · 04-07-2012, 07:11 PM

transcriptome mapping

Originally posted by Nicolas View Post

Hi,

What people typically do is to map the library (or a subset) to the genome (with Bowtie / BWA), compute an estimate of the fragment size (both mean and Standard Deviation, option --mate-std-dev).
Don't forget to substract the size of the reads to the mean fragment size and use this as your "mate-inner-dist".
A possible command-line to calculate this is: (this assume that the read length is constant)

Code:

samtools view -F 0x4 File.mapped.bam |  awk '{if ($9 >0) {sum+=$9;sumsq+=$9*$9;N+=1}} END {print "mean = " sum/N " SD=" sqrt(sumsq/N - (sum/N)**2)}'

Alternatively, if you can estimate the fragment size during the library preparation, that should also work.

I have no idea how sensitive this parameter is and its effect on the resulting mapping. Anybody has comments on that?

Hope that helps,

Nicolas

Hi Nicolas, I need to do this step, but I was wondering what the command for using 2 paired end read files (sample_R1.fastq and sample_R2.fastq) to map to a reference transcriptome would be?
How would we go about mapping a few (as you said a million or so) reads to a ref transcriptome?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News