Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat + cufflinks: strand information

    I am using tophat.
    Does anybody know a way to get the strand information in the output sam file?

    I am interested in using cufflinks on tophat output. In the following is what I found on the cufflinks web page, it shows what I am after, how do I get that custom tag into the sam file?

    Here's an example of an alignment Cufflinks will accept:
    s6.25mer.txt-913508 16 chr1 4482736 255 14M431N11M * 0 0 \
    CAAGATGCTAGGCAAGTCTTGGAAG IIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 XS:A:-
    Note the use of the custom tag XS. This attribute, which must have a value of "+" or "-", indicates which strand the RNA that produced this read came from. While this tag can be applied to any alignment, including unspliced ones, it must be present for all spliced alignment records (those with a 'N' operation in the CIGAR string).

  • #2
    Hi there,

    I believe the second column (the flag field. For more information, please refer to SAM format definition) contains the strand information. So you could write a script to determine the strand for that alignment and append the XS:A: field at the end of each line.

    Comment


    • #3
      Yes, the 2nd column of sam file could contain this information, but it appears (at least in tophat 1.0.13) the output sam file - accepted_hits.sam - contains the numbers 0 or 16 in this column. I am not sure what this means.

      For the sam output of bowtie (0.12.5) it's the same story.

      I got around that by running bowtie such that it produces its standard output, there the + - strand information is preserved and then use bowtie2sam.pl (one of the samtools tools).

      I still don't know how to get the strand information out of tophat, if it's possible at all.

      Comment


      • #4
        The strand is 'encoded' in hexadecimal within the field values. So to get the strand information, you will have to do an "AND" operation on the flag field.

        In the SAM definition for the flag field:
        0x0010 strand of the query (1 for reverse)

        So if you do AND between the flag and 0x0010 and you get a 1, the entry is on the negative strand, else it's on the positive strand.

        Comment


        • #5
          That was the info I needed. Thanks.

          Comment


          • #6
            Hi Haneko,

            If it were so I suppose I shouldn't be getting the following from TopHat:

            Code:
            HWI-EAS202_170:5:4:655:1891     0       chr1    11102910        255     15M10277N21M    *       0       0       GGATCATCCACCATGTTACCGATAAGCACCAGTTCA    B>@A@BA@>@??BAB@B@@@@??A>@;@>6>@:?>@    NM:i:0  XS:A:+  NS:i:0
            HWI-EAS202_170:5:12:784:1887    16      chr1    11102913        255     12M10277N24M    *       0       0       TCATCCACCATGTTACCGATAAGCACCAGTTCAAAC    @@??AA9@AA@=AA9B@@A@BB?A9@BB@B@BBA@B    NM:i:0  XS:A:+  NS:i:0
            One read has flag 0 and the other 16, meaning one maps to the forward and the other to the reverse, yet both have XS set to +

            Did I get it wrong? I'm using TopHat v1.0.13
            Thanks

            Comment


            • #7
              Hi Angela,

              I'm using TopHat v1.0.13 too, but there is no these two columns "XS:A:+ NS:i:0" in my output files.

              So, I used this command to add "XS:A:" column to the .sam file for cufflinks:
              the flag column number is 2, when 16 then strand=- else strand='+'


              awk -F'\t' 'BEGIN { OFS="\t"} {if($2==16) {$(NF+1)="XS:A:-"; print $0,$(NF+1)} else {$(NF+1)="XS:A:+"; print $0,$(NF+1)}}' test.sam > test.sam2

              Comment


              • #8
                I believe the XS tag is to be understood as the strand of the transcript which was sequenced and not the strand which the read itself was aligned to. That is why you, Angela, see reads aligned to opposite strands with identical XS tag. Because of that, it would be unwise to simply insert the XS tag manually by a script if you wish to run Cufflinks as a downstream analysis.

                xhuister, have you checked if your TopHat output SAM contains the XS tag in the reads covering splice junctions but not in the reads contained within exons? Cufflinks only needs the tag for the reads covering splice junctions so if you just did a quick check of the TopHat output you might have missed it.

                Comment


                • #9
                  Hi Thomas,

                  Thanks. You are right, there is XS tag in some of the reads of TOPHAT output, but in only a very small number of the reads.

                  $ grep -c "XS:A" accepted_hits.sam
                  17060
                  wc -c accepted_hits.sam
                  64463256

                  About what your said "Cufflinks only needs the tag for the reads covering splice junctions "?
                  At first, I thought I could used .sam output of Bowtie, then add "XS" tag to the reads, then use cufflinks to find the transcript boundaries. Am I wrong if I do so?

                  Comment


                  • #10
                    Yes, it would be wrong to run Cufflinks on a SAM file produced directly by Bowtie. Bowtie does not identify the strand of the transcript (or rather the splice junctions) so you need to align your reads with TopHat and then run Cufflinks. TopHat runs Bowtie as part of its pipeline so you don't need to run anything before TopHat. Having said that, I haven't actually run Cufflinks on Bowtie output, it might be able to do reasonably well but there is simply no reason to do it since you get a much better estimate of transcript abundances by using TopHat and then Cufflinks.

                    The reason TopHat only reports a strand for the transcript when a read spans a splice junction is that it cannot determine the strand of the read if it is contained within an exon. TopHat determines the strand of a read that spans a splice junction by looking at the splice sites to determine what strand contains a valid GT-AG pair (or AT-AC in case of the minor spliceosome).
                    Last edited by Thomas Doktor; 05-09-2010, 06:06 AM.

                    Comment


                    • #11
                      Thanks Thomas, but I'm rather confused about the strand. I thought that when a read is mapped to the chromosome, then it should be + strand, when mapped to the reverse chromosome, then - strand.
                      But in what you said "it cannot determine the strand of the read within an exon", is this strand not the strand of read mapping to the chromosome?

                      Comment


                      • #12
                        Sorry, I was being unclear. TopHat can't determine what strand the original transcript came from, not the read itself, if the read is entirely contained within an exon.

                        Comment


                        • #13
                          Sorry, I'm still confused.

                          Given the followings:
                          chromosome AAGGGG....
                          read1 AAGGGG
                          read2 CCCCTT

                          Read1 will map to strand +, read2 strand -. If a transcript contains read1 then I think this transcript will be strand +. I don't know why the strand of this transcript is undetermined?

                          Comment


                          • #14
                            It depends on the way you prepared your library. If you used a non-strand-specific protocol (which most people still do) you're not actually sequencing transcripts, you're sequencing cDNA with two strands. So the read could come from either of the two strands of a cDNA and you don't have any information which of the two strands corresponds to the original mRNA strand. This can be inferred when a read spans a splice junctions because splice site are highly conserved at the first 2 bases and last 2 bases of an intron.
                            Another way of inferring directionality would be to look at the exon islands of the transcripts and identify valid open reading frames, but TopHat does not, to my knowledge, employ that strategy.
                            Last edited by Thomas Doktor; 05-09-2010, 10:06 AM.

                            Comment


                            • #15
                              Thank you Thomas. I think the library I analyzed is strand-specific, but I didn't see any option in Tophat to specify strand-specific or non-strand-specific, do you know how to set this in Tophat?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X