SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RPKM and multiple reads, tophat and cufflinks mgogol Bioinformatics 6 09-03-2012 02:46 PM
very high RPKM values from Cufflink honey Bioinformatics 21 03-08-2012 06:00 AM
generating P values from a RPKM score Kieran Mace Bioinformatics 0 09-12-2011 10:34 AM
How to find DE genes using RPKM values? casshyr Bioinformatics 2 10-08-2010 07:03 AM
tophat, not producing rpkm values with -G option warrenemmett Bioinformatics 6 06-02-2010 07:31 AM

Reply
 
Thread Tools
Old 09-23-2009, 04:14 PM   #1
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default Tophat : percentage of reads contributing to RPKM values

Regarding the accepted_hits.sam, left_kept_reads.fq.candidate_hits.sam, and the left_kept_reads.fq files. How do I interpret their meaning? I'm asking because we'd like to calculate what percentage of the original reads are contributing to the calculated RPKM values I get in the Tophat output. I assume the accepted_hits.sam file is literally all of the successfully aligned source reads from Bowtie but how should I interpret these other files?

Thanks.
sdriscoll is offline   Reply With Quote
Old 10-12-2009, 06:24 PM   #2
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default the same question

I have the same question.
In the accepted_hits.sam file, pair-end can not be distinguished. So how can I get the total reads number that contribute to the RPKM?
thanks a lot!

Last edited by pengchy; 10-12-2009 at 06:28 PM.
pengchy is offline   Reply With Quote
Old 10-12-2009, 09:26 PM   #3
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

The left_ and right_kept_reads.fq.candidate_hits.sam files are temporary files (and an upcoming version of TopHat will delete them upon exit). Essentially, they are the preliminary alignments of reads *before* mate pairs are considered when picking the final mappings for each read. In other words, if for a mate M = (l_read, r_read), l_read has 2 alignments, and r_read has one (and it's sufficiently close to one of the l_read alignments), then left_kept_reads.fq.candidate_hits.sam will contain two hits for l_read.

accepted_hits.sam contains only the best alignments for each read. So in the example above, accepted_hits.sam would contain the alignment for r_read along with the nearby alignment for l_read. Note that the pairing information *can* be recovered from this file - see the SAM specification for more details. Briefly, there are columns in each alignment for the coordinate of the mapped mate alignment, if any. Also, the flag field contains a bit describing whether the mate of an alignment was also aligned.

Thus, you should only pay attention to accepted_hits.sam, especially because the other SAM files will be eliminated in future releases as the pipeline finishes.

To use these alignments to compute expression values, I *strongly* urge you to feed them to http://cufflinks.cbcb.umd.edu, which can accept a reference GTF if you wish. TopHat's RPKM calculation will be removed in the next release, as Cufflinks does a better job, and I can't afford to maintain the similar code in two places. Also, the quantitation step in Cufflinks is quite fast - so if you decide to use a different reference GTF, you don't have to re-run the whole TopHat mapping step, you can simply re-run the quantitation on the TopHat alignments you already produced.
Cole Trapnell is offline   Reply With Quote
Old 10-13-2009, 04:23 AM   #4
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

Hi Cole,
Thanks for your explanation. Follow your instruction, I also calculated the RPKM using Cufflinks, in which sam file (accepted_hits.sam) is produced by TopHat and the gff file is same to TopHat. Curiously, the results of the two methods are greatly different. Only half genes/transcripts, say 14828 to 29280, are detected by Cufflinks compare to TopHat.
The parameter used by Cufflinks is:
cufflinks -m 200 -I 10000 -s 20 -G ...
and TopHat:
tophat -o ... -p 4 -r 200 --mate-std-dev 20 -I 10000 -G ...

Any advice is appreciated. Thanks!
pengchy is offline   Reply With Quote
Old 10-13-2009, 07:54 AM   #5
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Hmm - Note that Cufflinks expects a GTF file, not a GFF file. TopHat will eventually be moving to GTF as well.
Cole Trapnell is offline   Reply With Quote
Old 10-13-2009, 06:33 PM   #6
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

Quote:
Originally Posted by Cole Trapnell View Post
Hmm - Note that Cufflinks expects a GTF file, not a GFF file. TopHat will eventually be moving to GTF as well.
Hi Cole,
I have rerun the Cufflinks with GTF file, but the difference between the two methods is not change. 14828 and 22916(RPKM>0) genes are detected by Cufflinks and TopHat respectively. The total annotated gene number is 29857 of the analyzed organism. Furthermore, the RPKMs from TopHat are approximately twofold higher than Cufflinks.

Which result should be chosen?

Last edited by pengchy; 10-13-2009 at 06:35 PM.
pengchy is offline   Reply With Quote
Old 10-13-2009, 06:46 PM   #7
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

We should figure out what's causing the discrepancy before deciding that either of them are right, especially since Cufflinks has only been released for a week.

Can you provide an example of a gene for which the RPKM is between the two programs? Can you also post somewhere the alignments produced by TopHat that fall within that gene locus?

Also, can you provide a gene (and the alignments) that is reported as having non-zero expression by TopHat, but which is not reported at all by Cufflinks?
Cole Trapnell is offline   Reply With Quote
Old 10-13-2009, 10:10 PM   #8
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

######################
One Example that expressed in TopHat with RPKM 258.8160, but not recorded in Cufflinks
1. GTF:
scaffold402 GeneWise CDS 379956 380045 . + 0 gene_id "Gene1"; transcript_id "Gene1.1";

2. The hits in the file accepted_hits.sam with the start position in the range between 379956 and 380045.
less accepted_hits.sam |awk '$3=="scaffold402" && $4>=379956 && $4<=380045' > scaffold402.sam
wc -l scaffold402.sam
202 scaffold402.sam

2.1 example lines of scaffold402.sam
-bash-3.00$ more scaffold402.sam
FC42DB6AAXX:3:96:868:809#0 137 scaffold402 379958 255 75M * 0 0 ATTGAAGTATAACAGCAAACTGATGAAC
AGCATCTGGGGCCTCTACAACCGCTATTCAGTGCATAATTTTAAGAA Vbaaab\Oa\aa`aab`\VaaXYa\^``b^_X[a_bb`\aSX___WW\FVTVXJY___ZSWIWa^HY_a_H[aa` NM:i
:0
FC42DB6AAXX:3:36:1163:1019#0 137 scaffold402 379961 255 75M * 0 0 GAAGTATAACAGCAAACTGATGAACAGC
ATCTGGGGCCTCTACAACCGCTATTCAGTGCATAATTTTAAGAAAAA abaW`aa`b``aaa`a^`b]_b^aZ`a\a_`__aS__`]^_^X_`X\^ZX^^ZSWWPXZXW][PUZYLTWZ_U]B NM:i
:0
FC42DB6AAXX:3:10:938:1849#0 147 scaffold402 379966 255 75M = 379733 0 ATAACACAAAACTGATGAACAGCATCTG
GGGCCTCTACAACCGCTATTCAGTGCATAATTTTAAGAAAAATACTG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBWUJY_Waa^aaaaaaaaaaaaaaaa`abbbaab`aba_bba NM:i
:2
FC42DB6AAXX:3:72:1605:210#0 73 scaffold402 379966 255 75M * 0 0 ATAACAGCAAACTGATGAACAGCATCTG
GGGCCTCTACAACCGCTATTCAGTGCATAATTTTAAGAAAAATACTG aaaa`aa_`aa`aa_aa_a_a_\_^][W^_`^_]Z\Y__`]__U\\[]W\\W\ZYVY]VY_XMVWNV\[WQTRTT NM:i
:0
FC42DB6AAXX:3:114:158:914#0 147 scaffold402 379969 255 75M = 379735 0 ACAACAAACTGATAAACAGCATCTGGGG
CCTCTACAACCGCTATTCAGTGCATAATTTTAAGAAAAATACTGACG BBBBBBBBBBBBBBBBBBBBBBBBBBBVYMSSNLW`Zaa`aaaaaaa_a`baaaaaabbababbbbba_bba`ba NM:i
:2
FC42DB6AAXX:3:120:664:268#0 99 scaffold402 379970 255 75M = 380051 0 CAGCAAACTGATGAACAGCATCTGGGGC
CTCTACAACCGCTATTCAGTGCATAATTTTAAGAAAAATACTGACGC ````babab_abaaa`b_\b`b]^^ab``aaaaaa`]Xb[`baaaaaYa]aaaaaba_]]_\_`Y_a_]`]X[_P NM:i
:0
FC42DB6AAXX:3:14:1004:888#0 73 scaffold402 379970 255 75M * 0 0 CAGCAAACTGATGAACAGCATCTGGGGC
CTCTACAACCGCTATTCAGTGCATAATTTTAAGAAAAATACTGACGC abb```babbbbb_]_a``_ab]_\\___`aaa]```X^_a^a__``^a]_____Z`^^VXV\XT\[[Y`\YPTT NM:i
:0
FC42DB6AAXX:3:20:1179:1647#0 73 scaffold402 379970 255 75M * 0 0 TAGCAAACTGATGAACAGCATCTGGGGC
CTCTACAACCGCTATTCAGTGCATAATTTTAAGAAAAATACTGACGC abbabbbaa\aaaabaabaaaaaaa_aaaab`a`_aa___^V^``XUVZ`aa`U^____]_]_`_\T_WVYZ_VV NM:i
:1

################
Several Examples with expression different at approximately twofold:

"tophat" "cufflinks"
38.909133 9.36612
0.91067 0.354771
41.939884 11.8654
51.855117 26.4923
29.918255 14.8333
0.958243 0.498132
9.877271 3.23085
42.750148 7.27455
7.309547 2.96622
34.138915 15.9528
0.515681 0.268071
26.973343 7.07882
17.023328 8.72268
5.106998 2.65482
2.672484 0.542681
19.564314 5.65249
75.505939 11.4093
2.761388 1.25562
30.870286 8.01352
1.258868 0.394501
48.208016 19.7907
19.472759 10.0725
10.026528 3.33728
512.270034 251.421
23.124383 8.41853
1.573585 0.818012
46.433153 20.5306
130.854092 64.0969
13.534309 6.6825
1.227912 0.546913
6.979133 2.34747
45.60238 20.1194
28.006887 13.6131
5.616077 0.33712
8.535561 4.40489
5.303665 1.89437
3.534253 0.975873
6.929768 3.60237
7.30238 0.470069
7.297792 1.9941
88.198285 2.91368
14.978767 5.7685
45.052355 20.7238
6.299281 2.91945
11.335606 0.79464
3.633373 1.48333

##################

Thanks!

Last edited by pengchy; 10-13-2009 at 11:55 PM.
pengchy is offline   Reply With Quote
Old 10-13-2009, 10:24 PM   #9
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Thanks for the details. What I'd really like to do is reproduce the situation on my own. Can you please make a tarball containing both scaffold402.sam and a small GTF file containing the records for gene BGIBMGA009001-PA?

That way, I can at least run this data through Cufflinks in the debugger to check the values.

If you have time, it would also be good to have a small bowtie index for the genome sequence around that gene, along with the raw reads that were mapped in scaffold402.sam

Thanks very much for the feedback.
Cole Trapnell is offline   Reply With Quote
Old 10-15-2009, 02:13 AM   #10
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

The test data has been send to you, wait for your reply.
Thanks!
pengchy is offline   Reply With Quote
Old 10-15-2009, 08:38 AM   #11
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Thanks so much for the test data. I will look at this over the next few days and post the outcome back to this thread.
Cole Trapnell is offline   Reply With Quote
Old 10-15-2009, 10:12 PM   #12
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Hey Cole - would it be possible in an upcoming release of Cufflinks for you to write the gene expression files with all genes from the GTF file and not only those that have non-zero expressions? In order for one to compare one set to another it would be helpful to have consistent row alignments between those sets of gene expression files. I've written a script to do this already but I feel like it would be a great help for people using this program who don't code.

Just to clarify my question at the top of the thread - and I'm not sure you addressed it - is there a way to determine the percentage of source reads that were successfully aligned? Or does the accepted_hits.sam file contain only unique read alignments (one alignment max per read)? For example when my lab got it's first set of sequencing done they had the alignment run with ELAND which produced a table that showed us this percentage. If you could explain a way I could figure that out from the output files it would be a big help.

Thanks!

A quick question about Cufflinks. I've got output from several data files in the same experiments where we are comparing wild type mice to mutant mice and while comparing the gene expressions I see a lot of the time the expression values for genes that should be similar between two samples jump to other isoforms of the gene. For example I've seen a value of 50 on a gene at a certain isoform and then in the next lane that value of 50 is in a different isoform. What we are wondering is if that is something we should ignore and just interpret the expression of that gene to be the same...but if that is the case then when we are comparing the mutant to the wild type, and we don't have that control, how can we trust that the differences in expression are actual differences or just these isoform jumping events?

Last edited by sdriscoll; 10-15-2009 at 10:24 PM.
sdriscoll is offline   Reply With Quote
Old 10-15-2009, 10:58 PM   #13
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

Hi sdriscoll/Cole,
According to my understand (learn from the manual), TopHat (v1.0.11) use the reads that hit on the exon regions at less than 40 times. And the multihits on the same location are filtered. But the accepted_hits.sam file contain some lines that not consistent with these thoughts:

##these lines are the multihits on the different location resulted from the tandem repeat.
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262507 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262511 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262515 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262519 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262523 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262527 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262531 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262535 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262539 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262543 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262547 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 25 scaffold507 262551 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1
FC42DB6AAXX:3:97:225:1087#0 89 scaffold507 262503 0 75M * 0 0 ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCATCTATC T[]\WZZYZ]Y][^]XZ__`]```^````aaaaaaaaaa_a`baaa`aa_b`aaaaaaa`aaabb_`bbababa` NM:i:1

##these three lines confused me that why the last two lines are different, because they are the same reads that hit on the same location?
FC42DB6AAXX:3:91:1346:410#0 163 scaffold507 50287 255 75M = 50369 0 TTCGAGAATTACCTGAAAGGTCGCAAACGCGCCATTTGGGTGTCTGTCTCGAACGATCTCAAGTATGATGCGGAG abbbbbbabbbaabb`b_S_\`baaba_bab``aaaaaab^b`_ab[^a`aZ`Za_`Z`Z__aVa\X\_\[]_XB NM:i:1
FC42DB6AAXX:3:91:1346:410#0 19 scaffold507 50369 255 29M168N46M = 50287 0 TAAAAGACATAGGTGCATCCAAAATTGAGGTACATGCCTTGAATAAATTCAAATACGCCAAGATCTCGTCGGCCA LZ```Y\`\MVSZ\^_aZ`]^``__`\_Z^[X]]`Y```_]`aaaaa_abaa`^Zbaaaaaaaab_a_aaabaaaNM:i:3 XS:A:+ NS:i:0
FC42DB6AAXX:3:91:1346:410#0 83 scaffold507 50369 255 29M168N46M = 50287 0 TAAAAGACATAGGTGCATCCAAAATTGAGGTACATGCCTTGAATAAATTCAAATACGCCAAGATCTCGTCGGCCA LZ```Y\`\MVSZ\^_aZ`]^``__`\_Z^[X]]`Y```_]`aaaaa_abaa`^Zbaaaaaaaab_a_aaabaaaNM:i:1 XS:A:+ NS:i:0

##these lines indicate that these reads hit on the same location (both end)

FC42DB6AAXX:3:108:1657:59#0 99 scaffold507 64 255 75M = 155 0 CAGCTTCCTCTCTGATTACAAGTCAATTAAATGTGATAATTGAGACGAAATAATTTTTTTCTTCCAGTGACTTCT `bb`aaabaaaaaaaaba`]^a^`a`aaa``aa]a`a_aa`a^a`\a^_``a\``__``_R]][QW]Y[[U[\P[ NM:i:0
FC42DB6AAXX:3:43:411:1766#0 99 scaffold507 64 255 75M = 155 0 CAGCTTCCTCTCTGATTACAAGTCAATTAAATGTGATAATTGAGACGAAATAATTTTTTTCTTCCAGTGACTTCT aaaaaaa_aaZaa]``a`__]___``aa`^aaaX_]a^___MXVRR^X^_][[]^]^\__Z]^^WQSVPVY]ZXZ NM:i:0
FC42DB6AAXX:3:62:1385:1546#0 163 scaffold507 64 255 75M = 155 0 CAGCTTCCTCTCTGATTACAAGTCAATTAAATGTGATAATTGAGACGAAATAATTTTTTTCTTCCAGTGACTTCT `baaaa_abb_a`Z]baaaa`a`aabb`aaa`XVaab`aa_JV^a^a`\`^_^_`a``^aUa_W]T[TPY\Y[X_ NM:i:0
FC42DB6AAXX:3:7:1652:518#0 163 scaffold507 64 255 75M = 155 0 CAGCTTCCTCTCTGATTACAAGTCAATTAAATGTGATAATTGAGACGAAATAATTTTTTTCTTCCAGTGACTTCT _bbaaaaababab_`aaaaaaaa`a^aaa_`bb]a_a_]aaa_a`^a`^a``^a^aaa`aUaaT[]_Z_[[^[T[ NM:i:0
pengchy is offline   Reply With Quote
Old 10-15-2009, 11:52 PM   #14
arthur.yxt
Member
 
Location: Berlin

Join Date: Oct 2009
Posts: 10
Default to pengchy

##these three lines confused me that why the last two lines are different, because they are the same reads that hit on the same location?
FC42DB6AAXX:3:91:1346:410#0 163 scaffold507 50287 255 75M = 50369 0 TTCGAGAATTACCTGAAAGGTCGCAAACGCGCCATTTGGGTGTCTGTCTC GAACGATCTCAAGTATGATGCGGAG abbbbbbabbbaabb`b_S_\`baaba_bab``aaaaaab^b`_ab[^a`aZ`Za_`Z`Z__aVa\X\_\[]_XB NM:i:1
FC42DB6AAXX:3:91:1346:410#0 19 scaffold507 50369 255 29M168N46M = 50287 0 TAAAAGACATAGGTGCATCCAAAATTGAGGTACATGCCTTGAATAAATTC AAATACGCCAAGATCTCGTCGGCCA LZ```Y\`\MVSZ\^_aZ`]^``__`\_Z^[X]]`Y```_]`aaaaa_abaa`^Zbaaaaaaaab_a_aaabaaaNM:i:3 XS:A:+ NS:i:0
FC42DB6AAXX:3:91:1346:410#0 83 scaffold507 50369 255 29M168N46M = 50287 0 TAAAAGACATAGGTGCATCCAAAATTGAGGTACATGCCTTGAATAAATTC AAATACGCCAAGATCTCGTCGGCCA LZ```Y\`\MVSZ\^_aZ`]^``__`\_Z^[X]]`Y```_]`aaaaa_abaa`^Zbaaaaaaaab_a_aaabaaaNM:i:1 XS:A:+ NS:i:0


To me, these alignments seem correct. As it can be mapped to multiple location, it's mate can also be. Then it is a question that in each location, the relative position of the #1 and #2 is different.
So flag 163 = it's paired and it's pair can be mapped and it's mate on "-" and it's the "right one" or #2
flag 19 = it's paired and it's pair can be mapped and it's on "-"
flag 83 = it's paired and it's pair can be mapped and it's on "-" and it's the "left one" or #1
arthur.yxt is offline   Reply With Quote
Old 10-16-2009, 12:09 AM   #15
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

Hi Arthur,
Thanks for your reply.
I mean the following lines:

FC42DB6AAXX:3:91:1346:410#0 19 scaffold507 50369 255 29M168N46M = 50287 0 TAAAAGACATAGGTGCATCCAAAATTGAGGTACATGCCTTGAATAAATTC AAATACGCCAAGATCTCGTCGGCCA LZ```Y\`\MVSZ\^_aZ`]^``__`\_Z^[X]]`Y```_]`aaaaa_abaa`^Zbaaaaaaaab_a_aaabaaaNM:i:3 XS:A:+ NS:i:0
FC42DB6AAXX:3:91:1346:410#0 83 scaffold507 50369 255 29M168N46M = 50287 0 TAAAAGACATAGGTGCATCCAAAATTGAGGTACATGCCTTGAATAAATTC AAATACGCCAAGATCTCGTCGGCCA LZ```Y\`\MVSZ\^_aZ`]^``__`\_Z^[X]]`Y```_]`aaaaa_abaa`^Zbaaaaaaaab_a_aaabaaaNM:i:1 XS:A:+ NS:i:0

It seems they are the same only with different flag value, as you mentioned flag 19 and 83. The line with flag 19 seems redundant, isn't it?
pengchy is offline   Reply With Quote
Old 10-16-2009, 12:47 AM   #16
arthur.yxt
Member
 
Location: Berlin

Join Date: Oct 2009
Posts: 10
Default

well i don't know. In fact I find it kinda strange that the flag=19 alignment holds mismatch=3, but flag=83 alignment holds mismatch=1. Can a sequence map to one location but with different mismatches?
arthur.yxt is offline   Reply With Quote
Old 10-16-2009, 12:54 AM   #17
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

It is indeed strange. But there are 258*3 lines like this of the total 76308 lines in accepted_hits.sam file.
pengchy is offline   Reply With Quote
Old 10-16-2009, 08:18 AM   #18
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

so your files show that it is for sure mapping the same read to multiple locations. if that's the case then what i have to do is scan my sam files and create a count myself filtering out duplicates. then i can use that count to calculate my percentages.
sdriscoll is offline   Reply With Quote
Old 10-16-2009, 03:25 PM   #19
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by sdriscoll View Post
Hey Cole - would it be possible in an upcoming release of Cufflinks for you to write the gene expression files with all genes from the GTF file and not only those that have non-zero expressions? In order for one to compare one set to another it would be helpful to have consistent row alignments between those sets of gene expression files. I've written a script to do this already but I feel like it would be a great help for people using this program who don't code.

Just to clarify my question at the top of the thread - and I'm not sure you addressed it - is there a way to determine the percentage of source reads that were successfully aligned? Or does the accepted_hits.sam file contain only unique read alignments (one alignment max per read)? For example when my lab got it's first set of sequencing done they had the alignment run with ELAND which produced a table that showed us this percentage. If you could explain a way I could figure that out from the output files it would be a big help.

Thanks!

A quick question about Cufflinks. I've got output from several data files in the same experiments where we are comparing wild type mice to mutant mice and while comparing the gene expressions I see a lot of the time the expression values for genes that should be similar between two samples jump to other isoforms of the gene. For example I've seen a value of 50 on a gene at a certain isoform and then in the next lane that value of 50 is in a different isoform. What we are wondering is if that is something we should ignore and just interpret the expression of that gene to be the same...but if that is the case then when we are comparing the mutant to the wild type, and we don't have that control, how can we trust that the differences in expression are actual differences or just these isoform jumping events?
To answer the original question: the accepted_hits.sam file may contain more than alignment for each read, as long as they are "equally good" according to TopHat. The specific rules for determining goodness of an alignment will be added to the manual when I get to time to do it. Briefly, TopHat prefers to map mates within a gene's length of each other, contiguously over spliced, and with as few mismatches as possible. These rules aren't perfect, but they work well for me. So to get the actual percentage of source reads that map, you'll need to awk out the unique list of read ids from the SAM file (adding back the /1 and /2 as necessary), and then diff that against the ids from your FASTQ file.

To answer your other questions:

1) About the GTF files: I don't understand - isn't cuffcompare already matching these guys up for you?

2) About the differential expression vs. differential splicing. This kinds of analysis comes with a somewhat complicated statistics problem that we are still addressing. I hope to have a tool (and associated documentation explaining our thinking and approach) added to the Cufflinks distribution in a few weeks that handles this. Keep an eye on the site. I can't say too much more about this right now.
Cole Trapnell is offline   Reply With Quote
Old 10-16-2009, 04:01 PM   #20
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Thanks Cole. Great work, by the way. These tools are moving things forward around here in a big way.
sdriscoll is offline   Reply With Quote
Reply

Tags
rna-seq, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO