SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics
Similar Threads
Thread Thread Starter Forum Replies Last Post
TopHat -paired end vs single end reads adarshjose RNA Sequencing 10 06-12-2012 06:15 PM
how to determine strand from tophat output for paired-end RNA-seq data jay2008 Bioinformatics 1 05-30-2012 04:46 AM
Bowtie output from paired end reads godzilla07 Bioinformatics 0 01-06-2011 11:36 AM
Does Cufflinks support single-end and paired end data together ? ersenkavak Bioinformatics 1 10-22-2010 07:26 AM
Mosaik, gigaBayes and paired-end output Gianza Bioinformatics 0 08-24-2010 08:06 AM

Reply
 
Thread Tools
Old 07-22-2010, 01:33 AM   #1
hoho
Junior Member
 
Location: Europe

Join Date: Dec 2009
Posts: 3
Exclamation Strange/erroneous TopHat output with paired-end data

Several members have noted strange bugs in the TopHat accepted_hits.sam file. One bug mentioned by many people (see here, here and here), is that the MRNM-field (7th row) in the SAM-file is reported as '=' even if the mate reference sequence name is different (e.g. different chromosomes). This is contradictory to how the SAM-format is defined, and will cause highly unpredictable behavior.

I recently tested the newest TopHat (v. 1.0.14) and the error is still present, even though the authors seem to be aware of the issue!

In addition I noticed additional strange behavior, that I will show with an example:

HWUSI-EAS697:1:73:6760:7084#0 147 chr1 21032 255 76M = 21062 0 GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NM:i:0 NH:i:1
HWUSI-EAS697:1:73:6760:7084#0 99 chr9 21062 255 76M = 21032 0 CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@? NM:i:0 NH:i:1


Here you see two mate-pairs apparently uniquely mapped to chr1 and chr9. Obviously the MRNM-problem is still there, but I got a bit suspicious about the results as well. If I blat both sequences, I get these results (only best-hits shown):

Blat-result 1st seq:
1 76 76 100.0% 9 + 21145 21220 76
1 76 76 100.0% 19 + 62640 62715 76
1 76 76 100.0% 1 + 21032 21107 76

Blat-result 2nd seq:
1 76 76 100.0% 15 - 102510140 102510215 76
1 76 76 100.0% 9 + 21062 21137 76

As you can see, the mapping info in the SAM-output is indeed corresponding to the BLAT-output. However, what happened to the other hits? In fact, in this case anyone can see that the correct pairing should be on chr9 only, and not between chr1 and chr9! Does anyone else see this issue? It appears to be very prevalent, and it almost appears as if TopHat just chooses randomly one of the hits from each paired end.
hoho is offline   Reply With Quote
Old 08-31-2010, 09:36 AM   #2
Uwe Appelt
Member
 
Location: Heidelberg, Germany

Join Date: Oct 2009
Posts: 27
Default

Hi Santa ;-)

blating the reads provided above against HG19 indeed returns some more alignments that are at least close to optimal and thus maybe considered by tophat (didn't looked at the alignment details)

(your upper read)
76 1 76 76 100.0% 9 + 21145 21220 76
76 1 76 76 100.0% 19 + 62640 62715 76
76 1 76 76 100.0% 1 + 21032 21107 76
74 1 76 76 98.7% 2 - 114349912 114349987 76
74 1 76 76 98.7% 15 - 102510057 102510132 76
74 1 76 76 98.7% 12 - 82572 82647 76

(your lower read)
76 1 76 76 100.0% 15 - 102510140 102510215 76
76 1 76 76 100.0% 9 + 21062 21137 76
75 1 75 76 100.0% 19 + 62557 62631 75
75 1 75 76 100.0% 1 + 20949 21023 75
74 1 76 76 98.7% 2 - 114349995 114350070 76
74 1 76 76 98.7% 12 - 82656 82731 76

Looking at these data (indicating there're up to even six valid "paired" alignments) i would suspect that the SAM-alignments provided above are simply not all originally provided by tophat (did you do a grep?) and thus just appear to be mismatched. It is by the way the case that tophat does NOT sort the SAM-output by read ids!

Cheers Uwe

Last edited by Uwe Appelt; 08-31-2010 at 09:43 AM.
Uwe Appelt is offline   Reply With Quote
Old 08-31-2010, 11:46 AM   #3
hoho
Junior Member
 
Location: Europe

Join Date: Dec 2009
Posts: 3
Default

Thanks for the reply, Uwe.

The tophat results I show are all the results provided for these reads, so something strange is happening.

If you focus on the best hits (100% over the entire length), you do not have any redundancy. The optimal solution would be to choose the pair on chr9. When I think of it, it could be that this issue is caused by the fact that bowtie is used without the --best option within tophat. This is easily fixed, so I will do a test to see if this is indeed the reason.
hoho is offline   Reply With Quote
Old 09-01-2010, 01:52 AM   #4
Uwe Appelt
Member
 
Location: Heidelberg, Germany

Join Date: Oct 2009
Posts: 27
Default

In this case, i wouldn't mess up any alignment settings, because it would mean that there is indeed a serious problem! Assuming that the SAM-flags (147 and 99 as provided in post #1) indicate a proper paired alignment for that read pair, i would suspect that there is a bug in the code parts that crossmatch the individual alignments of the two mates. It at least appears suspicious to me that the two reads are aligned to similar positions of about 21kb (but in fact on different chromosomes), which could mean that chromosomal cross-mismatch is ignored, if at least the positions are adjacent enough?! This is, however, pure speculation!
Uwe Appelt is offline   Reply With Quote
Old 09-01-2010, 01:59 AM   #5
hoho
Junior Member
 
Location: Europe

Join Date: Dec 2009
Posts: 3
Default

This appeared suspicious to me at first also. However, there are lots of examples similar to the one I mention in post #1, where the alignments are much farther apart.

It is however possible that tophat does not try to optimally pair up the alignment of the two pairs in any particular way though, and this would be more bad design than a bug.
hoho is offline   Reply With Quote
Old 09-01-2010, 03:25 AM   #6
Uwe Appelt
Member
 
Location: Heidelberg, Germany

Join Date: Oct 2009
Posts: 27
Default

I tend to disagree. Assuming the initially provided SAM output is complete, we definitely have a bug here, because -as i mentioned- we cannot deny, what the SAM-flags are telling! And they indicate, we have two mate-reads properly(!) aligned, while RNAME and MRNM clearly point to something different. Thus it's not an optimization problem, it's simply inconsistent, isn't it.

ps: you might want to have a look at the SAM-flag definitions @ http://samtools.sourceforge.net/SAM1.pdf (see section 2.2.2)
Uwe Appelt is offline   Reply With Quote
Old 09-01-2010, 04:34 AM   #7
Uwe Appelt
Member
 
Location: Heidelberg, Germany

Join Date: Oct 2009
Posts: 27
Default Tophat Bug in Paired-End Mode? (reproducable)

I've just tried to reproduce the tophat behavior described in post #1 and tophat indeed appears to report buggy alignments in this particular case:

sa_1.txt:
Code:
@HWUSI-EAS697:1:73:6760:7084#0/1
CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT
+HWUSI-EAS697:1:73:6760:7084#0/1
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?
sa_2.txt:
Code:
@HWUSI-EAS697:1:73:6760:7084#0/2
CCAGCTTTGAACCCGTCCCCTTAAACATGCTCCTCCTGCACGGAAGAGACAGGGGCAGGGGAGAGACTCTCTCCCC
+HWUSI-EAS697:1:73:6760:7084#0/2
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDADDDBDCACAC;D=@:=@DBD>BC
cmd-line:
Code:
./tophat --mate-inner-dist 250 --keep-tmp /home/extuser/data/_indices/HG19 sa_1.txt sa_2.txt
Results are:

tophat_out/tmp/left_kept_reads.bwtout:
Code:
1	+	9	21061	CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?	0	
1	-	12	82655	AGCTCAGGCCCCAGCACAGCCCGGCCTCTGGCCTCACTGGCGTCTGTGCCCAGTGACGCAGGCAGGTGAGCTCCTG	?@A=A?C4AAC?<7?8C@=CCCCBCDDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	0	74:T>G
1	+	1	20948	CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?	1	75:C>T
1	+	19	62556	CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?	1	75:C>T
1	-	15	102510139	AGCTCAGGCCCCAGCACAGCCCGGCCTCTGGCCTCACTGGCGTCTGTGCCCAGTGACGCAGGCAGGTGAGCTCCTG	?@A=A?C4AAC?<7?8C@=CCCCBCDDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	0	
1	-	2	114349994	AGCTCAGGCCCCAGCACAGCCCGGCCTCTGGCCTCACTGGCGTCTGTGCCCAGTGACGCAGGCAGGTGAGCTCCTG	?@A=A?C4AAC?<7?8C@=CCCCBCDDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	0	50:T>C
tophat_out/tmp/right_kept_reads.bwtout:
Code:
1	-	1	21031	GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG	CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	2	
1	-	19	62639	GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG	CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	2	
1	+	2	114349911	CCAGCTTTGAACCCGTCCCCTTAAACATGCTCCTCCTGCACGGAAGAGACAGGGGCAGGGGAGAGACTCTCTCCCC	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDADDDBDCACAC;D=@:=@DBD>BC	1	14:T>G
1	+	12	82571	CCAGCTTTGAACCCGTCCCCTTAAACATGCTCCTCCTGCACGGAAGAGACAGGGGCAGGGGAGAGACTCTCTCCCC	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDADDDBDCACAC;D=@:=@DBD>BC	1	14:T>G
1	-	9	21144	GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG	CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	2	
1	+	15	102510056	CCAGCTTTGAACCCGTCCCCTTAAACATGCTCCTCCTGCACGGAAGAGACAGGGGCAGGGGAGAGACTCTCTCCCC	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDADDDBDCACAC;D=@:=@DBD>BC	0	40:T>C
tophat_out/accepted_hits.sam:
Code:
@HD	VN:1.0	SO:sorted
@PG	ID:TopHat	VN:1.0.14	CL:./tophat --mate-inner-dist 250 --keep-tmp /home/extuser/data/_indices/HG19 sa_1.txt sa_2.txt
HWUSI-EAS697:1:73:6760:7084#0	147	1	21032	255	76M	=	21062	0	GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG	CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
HWUSI-EAS697:1:73:6760:7084#0	99	9	21062	255	76M	=	21032	0	CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?	NM:i:0	NH:i:1
While the left and right alignments pretty much correlate with post #2, accepted_hits.sam clearly reports an invalid alignment here.

@Cole: can u have a look at it?

Thanks and best!
Uwe
Attached Files
File Type: zip tophat_out.zip (23.5 KB, 3 views)

Last edited by Uwe Appelt; 09-01-2010 at 04:46 AM.
Uwe Appelt is offline   Reply With Quote
Old 03-09-2011, 12:21 PM   #8
jkorn
Junior Member
 
Location: Boston, MA

Join Date: Mar 2011
Posts: 3
Default Issue continues with TopHat 1.2.0

Hi Uwe,

Did you ever find a resolution to this error? I am having similar--but not identical--problems. I've narrowed the problem down to the tophat_reports program. In my case, alignments on the same chromosome are indeed preferred over alignments on different chromosomes. However, tophat is choosing alignments that result in a 160Mb mate pair distance (and includes 5 mismatches) over an alignment that results in a 20bp mate pair distance (and has 0 mismatches--but includes one intron).

Thanks,
Josh
jkorn is offline   Reply With Quote
Reply

Tags
accepted_hits.sam, bug, paired-end, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



All times are GMT -8. The time now is 07:57 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO