Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange/erroneous TopHat output with paired-end data

    Several members have noted strange bugs in the TopHat accepted_hits.sam file. One bug mentioned by many people (see here, here and here), is that the MRNM-field (7th row) in the SAM-file is reported as '=' even if the mate reference sequence name is different (e.g. different chromosomes). This is contradictory to how the SAM-format is defined, and will cause highly unpredictable behavior.

    I recently tested the newest TopHat (v. 1.0.14) and the error is still present, even though the authors seem to be aware of the issue!

    In addition I noticed additional strange behavior, that I will show with an example:

    HWUSI-EAS697:1:73:6760:7084#0 147 chr1 21032 255 76M = 21062 0 GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NM:i:0 NH:i:1
    HWUSI-EAS697:1:73:6760:7084#0 99 chr9 21062 255 76M = 21032 0 CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@? NM:i:0 NH:i:1


    Here you see two mate-pairs apparently uniquely mapped to chr1 and chr9. Obviously the MRNM-problem is still there, but I got a bit suspicious about the results as well. If I blat both sequences, I get these results (only best-hits shown):

    Blat-result 1st seq:
    1 76 76 100.0% 9 + 21145 21220 76
    1 76 76 100.0% 19 + 62640 62715 76
    1 76 76 100.0% 1 + 21032 21107 76

    Blat-result 2nd seq:
    1 76 76 100.0% 15 - 102510140 102510215 76
    1 76 76 100.0% 9 + 21062 21137 76

    As you can see, the mapping info in the SAM-output is indeed corresponding to the BLAT-output. However, what happened to the other hits? In fact, in this case anyone can see that the correct pairing should be on chr9 only, and not between chr1 and chr9! Does anyone else see this issue? It appears to be very prevalent, and it almost appears as if TopHat just chooses randomly one of the hits from each paired end.

  • #2
    Hi Santa ;-)

    blating the reads provided above against HG19 indeed returns some more alignments that are at least close to optimal and thus maybe considered by tophat (didn't looked at the alignment details)

    (your upper read)
    76 1 76 76 100.0% 9 + 21145 21220 76
    76 1 76 76 100.0% 19 + 62640 62715 76
    76 1 76 76 100.0% 1 + 21032 21107 76
    74 1 76 76 98.7% 2 - 114349912 114349987 76
    74 1 76 76 98.7% 15 - 102510057 102510132 76
    74 1 76 76 98.7% 12 - 82572 82647 76

    (your lower read)
    76 1 76 76 100.0% 15 - 102510140 102510215 76
    76 1 76 76 100.0% 9 + 21062 21137 76
    75 1 75 76 100.0% 19 + 62557 62631 75
    75 1 75 76 100.0% 1 + 20949 21023 75
    74 1 76 76 98.7% 2 - 114349995 114350070 76
    74 1 76 76 98.7% 12 - 82656 82731 76

    Looking at these data (indicating there're up to even six valid "paired" alignments) i would suspect that the SAM-alignments provided above are simply not all originally provided by tophat (did you do a grep?) and thus just appear to be mismatched. It is by the way the case that tophat does NOT sort the SAM-output by read ids!

    Cheers Uwe
    Last edited by Uwe Appelt; 08-31-2010, 09:43 AM.

    Comment


    • #3
      Thanks for the reply, Uwe.

      The tophat results I show are all the results provided for these reads, so something strange is happening.

      If you focus on the best hits (100% over the entire length), you do not have any redundancy. The optimal solution would be to choose the pair on chr9. When I think of it, it could be that this issue is caused by the fact that bowtie is used without the --best option within tophat. This is easily fixed, so I will do a test to see if this is indeed the reason.

      Comment


      • #4
        In this case, i wouldn't mess up any alignment settings, because it would mean that there is indeed a serious problem! Assuming that the SAM-flags (147 and 99 as provided in post #1) indicate a proper paired alignment for that read pair, i would suspect that there is a bug in the code parts that crossmatch the individual alignments of the two mates. It at least appears suspicious to me that the two reads are aligned to similar positions of about 21kb (but in fact on different chromosomes), which could mean that chromosomal cross-mismatch is ignored, if at least the positions are adjacent enough?! This is, however, pure speculation!

        Comment


        • #5
          This appeared suspicious to me at first also. However, there are lots of examples similar to the one I mention in post #1, where the alignments are much farther apart.

          It is however possible that tophat does not try to optimally pair up the alignment of the two pairs in any particular way though, and this would be more bad design than a bug.

          Comment


          • #6
            I tend to disagree. Assuming the initially provided SAM output is complete, we definitely have a bug here, because -as i mentioned- we cannot deny, what the SAM-flags are telling! And they indicate, we have two mate-reads properly(!) aligned, while RNAME and MRNM clearly point to something different. Thus it's not an optimization problem, it's simply inconsistent, isn't it.

            ps: you might want to have a look at the SAM-flag definitions @ http://samtools.sourceforge.net/SAM1.pdf (see section 2.2.2)

            Comment


            • #7
              Tophat Bug in Paired-End Mode? (reproducable)

              I've just tried to reproduce the tophat behavior described in post #1 and tophat indeed appears to report buggy alignments in this particular case:

              sa_1.txt:
              Code:
              @HWUSI-EAS697:1:73:6760:7084#0/1
              CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT
              +HWUSI-EAS697:1:73:6760:7084#0/1
              CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?
              sa_2.txt:
              Code:
              @HWUSI-EAS697:1:73:6760:7084#0/2
              CCAGCTTTGAACCCGTCCCCTTAAACATGCTCCTCCTGCACGGAAGAGACAGGGGCAGGGGAGAGACTCTCTCCCC
              +HWUSI-EAS697:1:73:6760:7084#0/2
              CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDADDDBDCACAC;D=@:=@DBD>BC
              cmd-line:
              Code:
              ./tophat --mate-inner-dist 250 --keep-tmp /home/extuser/data/_indices/HG19 sa_1.txt sa_2.txt
              Results are:

              tophat_out/tmp/left_kept_reads.bwtout:
              Code:
              1	+	9	21061	CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?	0	
              1	-	12	82655	AGCTCAGGCCCCAGCACAGCCCGGCCTCTGGCCTCACTGGCGTCTGTGCCCAGTGACGCAGGCAGGTGAGCTCCTG	?@A=A?C4AAC?<7?8C@=CCCCBCDDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	0	74:T>G
              1	+	1	20948	CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?	1	75:C>T
              1	+	19	62556	CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?	1	75:C>T
              1	-	15	102510139	AGCTCAGGCCCCAGCACAGCCCGGCCTCTGGCCTCACTGGCGTCTGTGCCCAGTGACGCAGGCAGGTGAGCTCCTG	?@A=A?C4AAC?<7?8C@=CCCCBCDDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	0	
              1	-	2	114349994	AGCTCAGGCCCCAGCACAGCCCGGCCTCTGGCCTCACTGGCGTCTGTGCCCAGTGACGCAGGCAGGTGAGCTCCTG	?@A=A?C4AAC?<7?8C@=CCCCBCDDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	0	50:T>C
              tophat_out/tmp/right_kept_reads.bwtout:
              Code:
              1	-	1	21031	GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG	CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	2	
              1	-	19	62639	GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG	CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	2	
              1	+	2	114349911	CCAGCTTTGAACCCGTCCCCTTAAACATGCTCCTCCTGCACGGAAGAGACAGGGGCAGGGGAGAGACTCTCTCCCC	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDADDDBDCACAC;D=@:=@DBD>BC	1	14:T>G
              1	+	12	82571	CCAGCTTTGAACCCGTCCCCTTAAACATGCTCCTCCTGCACGGAAGAGACAGGGGCAGGGGAGAGACTCTCTCCCC	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDADDDBDCACAC;D=@:=@DBD>BC	1	14:T>G
              1	-	9	21144	GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG	CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	2	
              1	+	15	102510056	CCAGCTTTGAACCCGTCCCCTTAAACATGCTCCTCCTGCACGGAAGAGACAGGGGCAGGGGAGAGACTCTCTCCCC	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDADDDBDCACAC;D=@:=@DBD>BC	0	40:T>C
              tophat_out/accepted_hits.sam:
              Code:
              @HD	VN:1.0	SO:sorted
              @PG	ID:TopHat	VN:1.0.14	CL:./tophat --mate-inner-dist 250 --keep-tmp /home/extuser/data/_indices/HG19 sa_1.txt sa_2.txt
              HWUSI-EAS697:1:73:6760:7084#0	147	1	21032	255	76M	=	21062	0	GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG	CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
              HWUSI-EAS697:1:73:6760:7084#0	99	9	21062	255	76M	=	21032	0	CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT	CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@?	NM:i:0	NH:i:1
              While the left and right alignments pretty much correlate with post #2, accepted_hits.sam clearly reports an invalid alignment here.

              @Cole: can u have a look at it?

              Thanks and best!
              Uwe
              Attached Files
              Last edited by Uwe Appelt; 09-01-2010, 04:46 AM.

              Comment


              • #8
                Issue continues with TopHat 1.2.0

                Hi Uwe,

                Did you ever find a resolution to this error? I am having similar--but not identical--problems. I've narrowed the problem down to the tophat_reports program. In my case, alignments on the same chromosome are indeed preferred over alignments on different chromosomes. However, tophat is choosing alignments that result in a 160Mb mate pair distance (and includes 5 mismatches) over an alignment that results in a 20bp mate pair distance (and has 0 mismatches--but includes one intron).

                Thanks,
                Josh

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin


                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                  Today, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                37 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                40 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                35 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Working...
                X