SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Reply
 
Thread Tools
Old 03-26-2012, 06:29 AM   #21
glados
Member
 
Location: Aperture Science

Join Date: Mar 2012
Posts: 59
Default

Thank you both for your thoughts and questions.
Jon Keats, tophat version was indeed 1.4.1. I have not used the GTF option in tophats. Would that be a good idea? Do you recommend it?

I got an email from the lab also. It seems like I was correct and they had forgot about the length of the adaptors.. Even more incentive to anyone else that reads this to calculate your paired read inner distance based on your reads, and don't trust the lab's information.

Here's a more detailed summary of one of my samples when mapping with different --mate-inner-dist and --mate-std-dev parameters (all other parameters are the same).

Quote:
Mean=150, SD=50
26178262 mapped (100.00%:-nan%)
23314350 + 0 properly paired (89.06%:-nan%)

Mean=125, SD=25
26178201 + 0 mapped (100.00%:-nan%)
23068014 + 0 properly paired (88.12%:-nan%)

Mean=-50, SD=30
26175715 + 0 mapped (100.00%:-nan%)
15012438 + 0 properly paired (57.35%:-nan%)
As you can see, very little change in number of mapped reads but drastic decrease in properly paired reads.

This below is from the Picard tools MarkDuplicates. These numbers are only from the mapping with Mean=150 because the 125 and -50 gives very very similar numbers (almost identical percent_duplication numbers).
Quote:
UNPAIRED_READS_EXAMINED = 2062968
READ_PAIRS_EXAMINED = 11419719
UNMAPPED_READS = 0
UNPAIRED_READ_DUPLICATES = 1511616
READ_PAIR_DUPLICATES = 2358529
READ_PAIR_OPTICAL_DUPLICATES = 1325726
PERCENT_DUPLICATION = 0.250123
ESTIMATED_LIBRARY_SIZE = 45900892
Am I right in interpreting the field PERCENT_DUPLICATION meaning that 0.25% of the reads are duplicates? Since all three mappings gave similar numbers of duplicates I don't think this is the reason why the number of properly paired reads decreases. What I'm now trying to figure out is how tophat determines when a read is properly paired or not, but it's not very easy to find this information.

Another question I have that you might be able to answer so I can start mapping with the correct inner distance. I subtracted the read lengths and the adaptors from the size selected fragments and got closer to the number -50 but not quite. Is there anything more that you should subtract besides both adaptor lengths? barcodes?

If you have any more ideas or suggestions, please let me know. I will keep you updated if I find out more.

Last edited by glados; 03-26-2012 at 07:50 AM.
glados is offline   Reply With Quote
Old 03-26-2012, 08:30 AM   #22
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

Quote:
Is there anything more that you should subtract besides both adaptor lengths? barcodes?
If the samples were prepped using the TruSeq kit (were they?) then the adapters should be 121 bp or something like that in total, and the barcode is included in that. I don't know about other kits.

I guess I have previously assumed that the properly paired flag is set when a pair of reads map within (mate-inner-dist +- mate-std-dev) of each other, but now I don't know.
kopi-o is offline   Reply With Quote
Old 03-26-2012, 09:01 AM   #23
glados
Member
 
Location: Aperture Science

Join Date: Mar 2012
Posts: 59
Default

Yes the adaptors are 121 bp total. I also got additional information from the lab that the size selection was on fragments 280-350 bp. So if 280 - 2*100 - 121 = -41

Perhaps tophat doesn't flag the pairs as properly paired when given a negative mean inner distance value?

Last edited by glados; 03-27-2012 at 06:07 AM.
glados is offline   Reply With Quote
Old 03-27-2012, 06:00 AM   #24
icebsd
Junior Member
 
Location: ap

Join Date: Jul 2009
Posts: 3
Default

Hi galdos,

I'm just having the same problem.

In my experiment, the fragment size is around 260bp, and the read length is 90PE.

At first my calculation was wrong. I thought the inner distance should be around 100. After that I realized the correct mean inner distance should be 260 - 121 - 90*2 ~= -40, so I run tophat again.

When I compared the results by "samtools flagstat", the proper aligned paired-ends was about 97% for inner distance 100, and about 50% for inner distance -40.

Then I compared the sam files generated by tophat (converted from bam file), I found that for most of the reads, the alignment were the same, only the flags were different. If I remember correctly, the flags generated with -r 100 were 2 more than the flags generated with -r -40, for example, for the same read, the flag is 83 for -r 100, and 81 for -r -40. According to samtools definition, a flag of 0x2 means "each segment properly aligned according to the aligner".

I'm still trying to figure our how tophat works with negative inner distance.



Quote:
Originally Posted by glados View Post
Thank you both for your thoughts and questions.
Jon Keats, tophat version was indeed 1.4.1. I have not used the GTF option in tophats. Would that be a good idea? Do you recommend it?

I got an email from the lab also. It seems like I was correct and they had forgot about the length of the adaptors.. Even more incentive to anyone else that reads this to calculate your paired read inner distance based on your reads, and don't trust the lab's information.

Here's a more detailed summary of one of my samples when mapping with different --mate-inner-dist and --mate-std-dev parameters (all other parameters are the same).



As you can see, very little change in number of mapped reads but drastic decrease in properly paired reads.

This below is from the Picard tools MarkDuplicates. These numbers are only from the mapping with Mean=150 because the 125 and -50 gives very very similar numbers (almost identical percent_duplication numbers).


Am I right in interpreting the field PERCENT_DUPLICATION meaning that 0.25% of the reads are duplicates? Since all three mappings gave similar numbers of duplicates I don't think this is the reason why the number of properly paired reads decreases. What I'm now trying to figure out is how tophat determines when a read is properly paired or not, but it's not very easy to find this information.

Another question I have that you might be able to answer so I can start mapping with the correct inner distance. I subtracted the read lengths and the adaptors from the size selected fragments and got closer to the number -50 but not quite. Is there anything more that you should subtract besides both adaptor lengths? barcodes?

If you have any more ideas or suggestions, please let me know. I will keep you updated if I find out more.
icebsd is offline   Reply With Quote
Old 03-27-2012, 02:27 PM   #25
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

If you are using human data I'd recommend using the GTF option in general.

From your mark duplicates table you have 24,902,406 unique alignment events [(READ_PAIRS_EXAMINED x 2) + UNPAIRED_READS_EXAMINED] in the -r 150 test. Look to see if that number is also changing as you modify -r

The percent duplicates you are seeing is fairly good or typical of illumina mRNAseq libraries with > 20 million fragments.
Jon_Keats is offline   Reply With Quote
Old 03-29-2012, 11:18 PM   #26
icebsd
Junior Member
 
Location: ap

Join Date: Jul 2009
Posts: 3
Default

I did a comparison of a sample dataset with different inner distances.

The mean library size is 260, and reads are 90PE, so the theoretical inner distance should be around -40.
--------------------------------------------------------
dist: 150
--------------------------------------------------------
49031 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
49031 + 0 mapped (100.00%:-nan%)
49031 + 0 paired in sequencing
24512 + 0 read1
24519 + 0 read2
48236 + 0 properly paired (98.38%:-nan%)
48420 + 0 with itself and mate mapped
611 + 0 singletons (1.25%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
--------------------------------------------------------
dist: 100
--------------------------------------------------------
49029 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
49029 + 0 mapped (100.00%:-nan%)
49029 + 0 paired in sequencing
24511 + 0 read1
24518 + 0 read2
47858 + 0 properly paired (97.61%:-nan%)
48418 + 0 with itself and mate mapped
611 + 0 singletons (1.25%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
--------------------------------------------------------
dist: -40
--------------------------------------------------------
49037 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
49037 + 0 mapped (100.00%:-nan%)
49037 + 0 paired in sequencing
24515 + 0 read1
24522 + 0 read2
38204 + 0 properly paired (77.91%:-nan%)
48426 + 0 with itself and mate mapped
611 + 0 singletons (1.25%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
--------------------------------------------------------

It seems that there is trend, the longer the inner distance is, the higher the rate of properly paired will be.
icebsd is offline   Reply With Quote
Old 03-30-2012, 01:29 AM   #27
glados
Member
 
Location: Aperture Science

Join Date: Mar 2012
Posts: 59
Default

Indeed icebsd, it seems like that.
I got a response from Cole Trapnell earlier where he said that I shouldn't worry too much about the negative inner distance since both tophat and cufflinks are able to handle that well. I asked about the properly pairing rules as discussed in this thread, and he said he would forward this question to Daehwan as he has been working on that tophat code recently. I have not gotten a response from Daehwan, if I do I will update in this thread. Perhaps he is willing to comment himself in this thread.

edit: Jon Keats: The [(READ_PAIRS_EXAMINED x 2) + UNPAIRED_READS_EXAMINED] are almost identical for the three different mappings. It is less than 20 reads difference.

Last edited by glados; 03-30-2012 at 01:52 AM.
glados is offline   Reply With Quote
Old 04-01-2012, 10:02 PM   #28
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Okay so this is an important point. You need to watch the number of uniquely mapped reads NOT the total number of reads mapped, at least in my opinion (To do so use Picard CollectAlignmentMetrics over Samtools Flagstat). Second watch the percentage of aligned reads which are unique events. What I think you are seeing is that as you alter the -r value you are increasing the number of non-unique mappings (ie. reads that map to multiple locations, tophat will print the same read multiple times in the bam file) while not altering the number of uniquely mapped reads. At the end of the day you want to limit the number of multiple mapped reads as these reads will contaminate your gene expression estimates.
Jon_Keats is offline   Reply With Quote
Old 04-03-2012, 03:22 AM   #29
glados
Member
 
Location: Aperture Science

Join Date: Mar 2012
Posts: 59
Default

Jon Keats, as I mention earlier the number of duplicates is approximately the same in the three different mappings. I have used Picards CollectAlignmentSummaryMetrics in addition to the MarkDuplicates and it does not give much information, mostly zeros in every column (which I have noticed to be the case for other users as well, so it seems to be in order). I can extract one column for you to see:

Code:
TOTAL_READS
mean150	
PAIR	24902406
mean125	
PAIR	24902400
mean-50	
PAIR	24902359
Here's from the MarkDuplicatesMetrics:
Code:
mean150
Unpaired_read_duplicates	read_pair_duplicates	read_pair_optical_duplicates	percent_duplication
1511616	2358529	1325726	0.250123			
mean125			
unpaired_read_duplicates	read_pair_duplicates	read_pair_optical_duplicates	percent_duplication
1511397	2358693	1325905	0.250128
mean-50					
unpaired_read_duplicates	read_pair_duplicates	read_pair_optical_duplicates	percent_duplication
1511513	2358631	1325846	0.250128
I still think that TopHat flags more reads as properly paired when given a higher mean inner distance, though I'm unsure of why that is. It would be best if someone from the TopHat team could give us clarity as to why, and in the meantime I think it's a good idea not to trust the reads flagged as properly paired too much.
glados is offline   Reply With Quote
Old 04-07-2012, 08:11 PM   #30
vkartha
Member
 
Location: Boston

Join Date: Feb 2012
Posts: 28
Default transcriptome mapping

Quote:
Originally Posted by Nicolas View Post
Hi,

What people typically do is to map the library (or a subset) to the genome (with Bowtie / BWA), compute an estimate of the fragment size (both mean and Standard Deviation, option --mate-std-dev).
Don't forget to substract the size of the reads to the mean fragment size and use this as your "mate-inner-dist".
A possible command-line to calculate this is: (this assume that the read length is constant)

Code:
samtools view -F 0x4 File.mapped.bam |  awk '{if ($9 >0) {sum+=$9;sumsq+=$9*$9;N+=1}} END {print "mean = " sum/N " SD=" sqrt(sumsq/N - (sum/N)**2)}'
Alternatively, if you can estimate the fragment size during the library preparation, that should also work.

I have no idea how sensitive this parameter is and its effect on the resulting mapping. Anybody has comments on that?

Hope that helps,

Nicolas
Hi Nicolas, I need to do this step, but I was wondering what the command for using 2 paired end read files (sample_R1.fastq and sample_R2.fastq) to map to a reference transcriptome would be?
How would we go about mapping a few (as you said a million or so) reads to a ref transcriptome?
vkartha is offline   Reply With Quote
Old 04-10-2012, 10:55 AM   #31
Nicolas
Member
 
Location: new york city

Join Date: Apr 2009
Posts: 40
Default

Hi,
It would work the same if your File.mapped.bam has been aligned to a transcriptome index.

By the way, I have a general question related to this -r parameter:
Did you guys tried to omit this parameter (or as I did, accidentally forgot to put it)? I know the manual says "There is no default, and this parameter is required for paired end runs.", but it does work quite well if you forget it!
I tested with and without the option, on a same dataset and the difference was very marginal: 1 junction was not found without the parameter (out of >100,000) and 2000 alignments were missing (out of >700,000).
I have no idea what the default value is. The accepted_hits.bam file is well formatted and the reads are properly paired.
Similarly, I always wanted to test a bunch of values and see how sensible this parameter is. My assumption is "not very much", but I may be wrong. Did anybody tried?

Please let me know if I am the only one to get to work without the -r parameter.
Nicolas is offline   Reply With Quote
Old 04-12-2012, 03:50 AM   #32
glados
Member
 
Location: Aperture Science

Join Date: Mar 2012
Posts: 59
Default

Hi Nicolas.

TopHat doesn't have much need for the mean inner distance nowadays, and will even less in the future, according to Cole Trapnell. If you don't specify -r with a value it will default to 50 as it says in the release notes from an earlier version, they just haven't updated the manual yet.
glados is offline   Reply With Quote
Old 04-17-2012, 08:03 AM   #33
glados
Member
 
Location: Aperture Science

Join Date: Mar 2012
Posts: 59
Default

I looked again at the Picard MarkDuplicates output and I think I may have interpreted the percent_duplication=0.250123 wrong. Perhaps they mean it is as high as 25%, not 0.25% as I initially thought. But when reading about what both samtools rmdup and Picardtools MarkDuplicates actually does: Marking reads as duplicates if they have the same 5' coordinates, then it doesn't seem strange that so many reads were marked as duplicates, considering that my inner distance is -50 and my reads are very overlapping so sometimes the reads will start on the same coordinate.
glados is offline   Reply With Quote
Old 04-17-2012, 08:30 AM   #34
swebb
Junior Member
 
Location: Edinburgh

Join Date: Jun 2009
Posts: 7
Default

[QUOTE=Jon_Keats;68654]Couple of questions:

Last edited by swebb; 04-17-2012 at 08:32 AM.
swebb is offline   Reply With Quote
Old 07-09-2012, 07:30 AM   #35
lzlgboy
Junior Member
 
Location: Houston

Join Date: Jun 2011
Posts: 6
Default

I did some experiments for the same sample using Tophat 1.3.3. It seems the -r parameter of Tophat will have influence on cufflinks analysis. When I use a wrong -r parameter, the Tophat output.bam file change not much( you can see from IGV). But when it comes to cufflinks, some genes that have a FPKM value(read coverage in IGV) now have a zero FPKM. So I guess that, cufflinks will need the proper paired information to estimate gene's FPKM.
I am wondering if anyone observe that.
lzlgboy is offline   Reply With Quote
Old 12-06-2012, 02:16 AM   #36
figo1019
Member
 
Location: germany

Join Date: Jun 2012
Posts: 32
Default setting up the -r parameter in tophat

Quote:
Originally Posted by glados View Post
Hi Nicolas.

TopHat doesn't have much need for the mean inner distance nowadays, and will even less in the future, according to Cole Trapnell. If you don't specify -r with a value it will default to 50 as it says in the release notes from an earlier version, they just haven't updated the manual yet.
Hi Glados

Reading the thread and your comments. I have a query about setting up the parameter in -r. I also tried to get the information of insert size from the sequencing lab and they said me it to be 180 bp for 96 bp illumina paired end reads. So what can be the ideal set up for -r for me will it be (180- (2x94))=-8. Do I need to have the -8 (aaprox -10) as - r or Another way could be like i keep it default.

Regards
figo1019 is offline   Reply With Quote
Old 12-12-2012, 02:53 AM   #37
glados
Member
 
Location: Aperture Science

Join Date: Mar 2012
Posts: 59
Default

Dear figo1019.

Make sure the lab meant 180 as insert size and not inner distance. If I were you, I would try with both the default and -8 as inner distance. See if you get any differences in tophat output. You can view the assembly in a genome browser, like IGV, and see if the reads on average overlap approximately 8 bp. If you are certain it is -8, I would use it in tophat.
glados is offline   Reply With Quote
Old 04-09-2013, 03:38 PM   #38
hollyyang
Junior Member
 
Location: Chicago

Join Date: Apr 2013
Posts: 1
Default How to estimate the fragment length for single-end RNAseq reads

Hi Nicolas,

I have been stuck in estimating the fragment length for single-end RNAseq reads.

I applied your script (http://seqanswers.com/forums/showthread.php?t=16472) to my bowtie2 results but got no results.
my command:
samtools view -S -F 0x4 Ery_rep1.ip.sam | awk '{if ($9 >0) {sum+=$9;sumsq+=$9*$9;N+=1}} END {print "mean = " sum/N " SD=" sqrt(sumsq/N - (sum/N)**2)}'

output:
[samopen] SAM header is present: 22 sequences.
awk: (FILENAME=- FNR=18471674) fatal: division by zero attempted

My sam file looks like below, it seems that the inferred fragment lengths are to be zero. Is it because the my reads are single-end reads?

42A08AAXX:6:098:00152:01628/1 16 chr1 3011033 42 51M * 0 0 TCACCTGTTCTTCTCACTGTTGTGGCCTGAGTCAGAAC
AACTAGAGTCCTC ############################9.(<.6BA/@8>([email protected]= AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19G31 -
YT:Z:UU
42A08AAXX:6:097:00856:00542/1 16 chr1 3011742 42 51M * 0 0 AGCTCTGTGTTCTGCTTGAGCTGACTCTCTAGACAGCT
ATGTGGATATTTC >;A:B>>[email protected][email protected]<@B>BBBBBBBABBBBBCBBBBBCBBBBCBCBBAB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:U
U


Thanks,
Holly
hollyyang is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:10 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO