Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • flobpf
    Member
    • Apr 2010
    • 76

    #16
    Strandness in GTF/GFF file: Strand of the gene, the coding strand of the RNA, the non-template strand

    Strandness in column 2 of SAM file: The strand the read being mapped onto the reference

    Strandness of Tophat XS:A tag: Strand of the RNA that produce the read.


    What I don't understand is that "Strand of the RNA" and "Strand of the gene in GTF". I think they are the same thing?

    If I am correct, then for a gene A sit on the +ve strand. Then I should only say the read comes from this gene A if Tophat tag this pair with XS:A:+ve.
    Marcowanger, I think you are right. Please see this image attached for an illustrated explanation (and let me know if I'm wrong)


    So the read obtained from directional sequencing corresponds to the template strand of the gene (not coding strand). Having said that, I infer the following. However, if we use XS tag in TopHat, we'd get the sense/antisense wrong (thanks Marcowanger for pointing that this was not an HTSeq issue):


    Is my understanding correct? Or am I being mistaken?

    Flobpf

    UPDATE June 10, 2011: I contacted Illumina and they were confused too. However, finally they (and people at TopHat and our sequencing center) resolved the issue. The reads that come out of the machine have the same sequence as the CODING strand of the DNA and not the template strand.
    Last edited by flobpf; 06-10-2011, 08:15 AM. Reason: Update. Also see last post

    Comment

    • marcowanger
      Senior Member
      • Dec 2008
      • 273

      #17
      To strengthen our understanding on this issue, perhaps we should take a look at the paper on strand specific sequencing.

      I take this for an example. http://nar.oxfordjournals.org/conten...e123/suppl/DC1

      Please see below attachment:
      nar-00743-met-g-2009-File006.marco.pdf

      I have add my annotation to the above supplementary file (highlighting for better illustration). Basically if you read to page 4, we know the read's sequence is identical to the 1st strand cDNA (i.e. complementary to the DNA coding strand, a.k.a mRNA sequence with T-U substitution). Then flobpf you are right. The read should corresponds to DNA's template strand, which is also complementary to the strandness in GTF/GFF file (which indicates the gene coding strand)


      XS:A:+/-:

      This is the tophat's tag not HTSeq tag.

      If the interpretation of XS:A is right
      XS: "+" for GT-AG and "-" for CT-AC
      (http://seqanswers.com/forums/showthr...=tophat+strand)

      As HT-Seq count only the read map to the Gene coding strand, the read from a gene-coding-strand being +ve will be mapped to -ve DNA strand (A gene with template strand -ve, coding strand+ve produces a mRNA that will be mapped to template strand -ve), then being discarded by HT-seq.

      By the way, anyone know how HT-seq deals with paired-ended strand specific sequencing? Does it only look at Forward pair to infer strandness?


      Originally posted by flobpf View Post
      Marcowanger, I think you are right. Please see this image attached for an illustrated explanation (and let me know if I'm wrong)


      So the read obtained from directional sequencing corresponds to the template strand of the gene (not coding strand). Having said that, I infer the following. However, HTSeq-counts seems to guess the sense/antisense wrong:


      Is my understanding correct? Or am I being mistaken?

      Flobpf
      Last edited by marcowanger; 03-17-2011, 08:05 PM.
      Marco

      Comment

      • marcowanger
        Senior Member
        • Dec 2008
        • 273

        #18
        As Simon assumes
        I assume strand-specific RNA-Seq sequences the coding strand
        .

        Then it probably explains why I got extremely low count for nearly all of the genes.
        Marco

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #19
          Just to quickly clarify:

          htseq-count does not look at the 'XS' optional field, only a the strand information according to the FLAG field. The strand information in the FLAG field (bit 0x10) specifies whether the sequence as read by the sequencer and as found in the FASTQ file has the same orientation as the sequence in the reference FASTA file ("+" strand, bit cleared), or whether the sequences in FASTQ and FASTA file are reverse-complements of each other ("-" strand, bit set).

          With "--stranded=no", htseq-count ignored the strand bit. With "--stranded=yes" and single-end reads, it will count a read only if the alignment strand information according to the FLAG field is the same as the strand information for the gene/exon in the GFF file. For paired-end data, the strands have to be the same for the mate from the first sequencing pass and opposite for the mate from the second pass.

          I guess I should add a new option "--stranded=reverse" that reverses this, i.e., counts reads only if, for single end, the strand informations in FLAG field and GFF file are opposite, and, for paired-end, if the first pass mate has opposite and the second pass mate same strand as the gene.

          I hope this makes sense.

          Simon

          Comment

          • marcowanger
            Senior Member
            • Dec 2008
            • 273

            #20
            Thanks Simon.

            Originally posted by Simon Anders View Post
            Just to quickly clarify:

            htseq-count does not look at the 'XS' optional field, only a the strand information according to the FLAG field. The strand information in the FLAG field (bit 0x10) specifies whether the sequence as read by the sequencer and as found in the FASTQ file has the same orientation as the sequence in the reference FASTA file ("+" strand, bit cleared), or whether the sequences in FASTQ and FASTA file are reverse-complements of each other ("-" strand, bit set).

            With "--stranded=no", htseq-count ignored the strand bit. With "--stranded=yes" and single-end reads, it will count a read only if the alignment strand information according to the FLAG field is the same as the strand information for the gene/exon in the GFF file. For paired-end data, the strands have to be the same for the mate from the first sequencing pass and opposite for the mate from the second pass.

            I guess I should add a new option "--stranded=reverse" that reverses this, i.e., counts reads only if, for single end, the strand informations in FLAG field and GFF file are opposite, and, for paired-end, if the first pass mate has opposite and the second pass mate same strand as the gene.

            I hope this makes sense.

            Simon
            Marco

            Comment

            • flobpf
              Member
              • Apr 2010
              • 76

              #21
              Resolved

              UPDATE June 10, 2011: I contacted Illumina and they were confused too. However, finally they (and people at TopHat and our sequencing center) resolved the issue. The reads that come out of the machine have the same sequence as the CODING strand of the DNA and not the template strand.

              The correct option to use for Cufflinks is fr-secondstrand
              Flobpf

              Comment

              • marcowanger
                Senior Member
                • Dec 2008
                • 273

                #22
                Flobpf, nice to hear from you again.

                May you kindly do some illustration? I got the protocol from our sequencing center and it seems the other way round. Maybe we have different protocol?

                Marco

                Originally posted by flobpf View Post
                UPDATE June 10, 2011: I contacted Illumina and they were confused too. However, finally they (and people at TopHat and our sequencing center) resolved the issue. The reads that come out of the machine have the same sequence as the CODING strand of the DNA and not the template strand.

                The correct option to use for Cufflinks is fr-secondstrand
                Flobpf
                Marco

                Comment

                • flobpf
                  Member
                  • Apr 2010
                  • 76

                  #23
                  Depends on method

                  Originally posted by marcowanger View Post
                  Flobpf, nice to hear from you again.

                  May you kindly do some illustration? I got the protocol from our sequencing center and it seems the other way round. Maybe we have different protocol?

                  Marco
                  It depends on what method you are using. If you are using Illumina directional sequencing protocol, you get the coding strand sequence. With dUTP, its the template strand. The correct options to use are noted here now:

                  Comment

                  • marcowanger
                    Senior Member
                    • Dec 2008
                    • 273

                    #24
                    It make more sense. We used dUTP so we got template strand.

                    Originally posted by flobpf View Post
                    It depends on what method you are using. If you are using Illumina directional sequencing protocol, you get the coding strand sequence. With dUTP, its the template strand. The correct options to use are noted here now:
                    http://cufflinks.cbcb.umd.edu/manual.html#library
                    Marco

                    Comment

                    • marcowanger
                      Senior Member
                      • Dec 2008
                      • 273

                      #25
                      Dear Simon,

                      How can I modify the code to reverse the --strand from counting coding strand to template strand?

                      Originally posted by Simon Anders View Post
                      Just to quickly clarify:

                      htseq-count does not look at the 'XS' optional field, only a the strand information according to the FLAG field. The strand information in the FLAG field (bit 0x10) specifies whether the sequence as read by the sequencer and as found in the FASTQ file has the same orientation as the sequence in the reference FASTA file ("+" strand, bit cleared), or whether the sequences in FASTQ and FASTA file are reverse-complements of each other ("-" strand, bit set).

                      With "--stranded=no", htseq-count ignored the strand bit. With "--stranded=yes" and single-end reads, it will count a read only if the alignment strand information according to the FLAG field is the same as the strand information for the gene/exon in the GFF file. For paired-end data, the strands have to be the same for the mate from the first sequencing pass and opposite for the mate from the second pass.

                      I guess I should add a new option "--stranded=reverse" that reverses this, i.e., counts reads only if, for single end, the strand informations in FLAG field and GFF file are opposite, and, for paired-end, if the first pass mate has opposite and the second pass mate same strand as the gene.

                      I hope this makes sense.

                      Simon
                      Marco

                      Comment

                      • Simon Anders
                        Senior Member
                        • Feb 2010
                        • 995

                        #26
                        I've added the '--standed=reverse' option promised above since then.

                        Comment

                        • marcowanger
                          Senior Member
                          • Dec 2008
                          • 273

                          #27
                          Dear Simon, I just downloaded 0.5.1.p3 .

                          It seems there is no --stranded=reverse?


                          Options:
                          -h, --help show this help message and exit
                          -m MODE, --mode=MODE mode to handle reads overlapping more than one
                          feature(choices: union, intersection-strict,
                          intersection-nonempty; default: union)
                          -s STRANDED, --stranded=STRANDED
                          whether the data is from a strand-specific assay
                          (default: yes)
                          is it still a hidden function?
                          Originally posted by Simon Anders View Post
                          I've added the '--standed=reverse' option promised above since then.
                          Marco

                          Comment

                          • marcowanger
                            Senior Member
                            • Dec 2008
                            • 273

                            #28
                            This is the error message I got from 0.5.1 p3

                            htseq-count: error: option -s: invalid choice: 'reverse' (choose from 'yes', 'no')
                            Marco

                            Comment

                            • Simon Anders
                              Senior Member
                              • Feb 2010
                              • 995

                              #29
                              Sorry, it seems I forgot to upload the latest version. Please try with 0.5.3.

                              Comment

                              • billstevens
                                Senior Member
                                • Mar 2012
                                • 120

                                #30
                                I'm sorry, I just read this entire thread, and I still have no idea if my data is strand-specific or not. The core I send it to uses the old TruSeq RNA kit, http://epigenome.usc.edu/docs/resour...15008136_A.pdf

                                I was looking at my accepted_hits.bam file from Tophat to figure it out, but the second column doesn't have a 0 or a 16. Here is the actual output:

                                HWI-1KL118:23:C0J57ACXX:8:1308:11486:195428 pPR1s chr1 565039 3 100M = 565073 135 CCGTCATCTACTCTACCATCTTTGCAGGCACACTCATCACAGCGCTAAGCTCGCACTGATTTTTTACCTGAGTAGGCCTAGAAATAAACATGCTAGCTTT @CCDDFFFHHDFHIJJJIEHIJJJFHIGHIEIIHFGHBGIIGHGHGHGGHGJJJBGHJCHHEEEDCEECECCCCDCCCCCCDDDDDCDDDDDDCDDDACC NM:i:0 NH:i:2 CC:Z:chrM CP:i:4490 HI:i:0

                                Can anyone help?

                                Comment

                                Latest Articles

                                Collapse

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-09-2026, 11:58 AM
                                0 responses
                                22 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-05-2026, 10:09 AM
                                0 responses
                                28 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-04-2026, 08:59 AM
                                0 responses
                                39 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                61 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...