Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Strandness in GTF/GFF file: Strand of the gene, the coding strand of the RNA, the non-template strand

    Strandness in column 2 of SAM file: The strand the read being mapped onto the reference

    Strandness of Tophat XS:A tag: Strand of the RNA that produce the read.


    What I don't understand is that "Strand of the RNA" and "Strand of the gene in GTF". I think they are the same thing?

    If I am correct, then for a gene A sit on the +ve strand. Then I should only say the read comes from this gene A if Tophat tag this pair with XS:A:+ve.
    Marcowanger, I think you are right. Please see this image attached for an illustrated explanation (and let me know if I'm wrong)


    So the read obtained from directional sequencing corresponds to the template strand of the gene (not coding strand). Having said that, I infer the following. However, if we use XS tag in TopHat, we'd get the sense/antisense wrong (thanks Marcowanger for pointing that this was not an HTSeq issue):


    Is my understanding correct? Or am I being mistaken?

    Flobpf

    UPDATE June 10, 2011: I contacted Illumina and they were confused too. However, finally they (and people at TopHat and our sequencing center) resolved the issue. The reads that come out of the machine have the same sequence as the CODING strand of the DNA and not the template strand.
    Last edited by flobpf; 06-10-2011, 08:15 AM. Reason: Update. Also see last post

    Comment


    • #17
      To strengthen our understanding on this issue, perhaps we should take a look at the paper on strand specific sequencing.

      I take this for an example. http://nar.oxfordjournals.org/conten...e123/suppl/DC1

      Please see below attachment:
      nar-00743-met-g-2009-File006.marco.pdf

      I have add my annotation to the above supplementary file (highlighting for better illustration). Basically if you read to page 4, we know the read's sequence is identical to the 1st strand cDNA (i.e. complementary to the DNA coding strand, a.k.a mRNA sequence with T-U substitution). Then flobpf you are right. The read should corresponds to DNA's template strand, which is also complementary to the strandness in GTF/GFF file (which indicates the gene coding strand)


      XS:A:+/-:

      This is the tophat's tag not HTSeq tag.

      If the interpretation of XS:A is right
      XS: "+" for GT-AG and "-" for CT-AC
      (http://seqanswers.com/forums/showthr...=tophat+strand)

      As HT-Seq count only the read map to the Gene coding strand, the read from a gene-coding-strand being +ve will be mapped to -ve DNA strand (A gene with template strand -ve, coding strand+ve produces a mRNA that will be mapped to template strand -ve), then being discarded by HT-seq.

      By the way, anyone know how HT-seq deals with paired-ended strand specific sequencing? Does it only look at Forward pair to infer strandness?


      Originally posted by flobpf View Post
      Marcowanger, I think you are right. Please see this image attached for an illustrated explanation (and let me know if I'm wrong)


      So the read obtained from directional sequencing corresponds to the template strand of the gene (not coding strand). Having said that, I infer the following. However, HTSeq-counts seems to guess the sense/antisense wrong:


      Is my understanding correct? Or am I being mistaken?

      Flobpf
      Last edited by marcowanger; 03-17-2011, 08:05 PM.
      Marco

      Comment


      • #18
        As Simon assumes
        I assume strand-specific RNA-Seq sequences the coding strand
        .

        Then it probably explains why I got extremely low count for nearly all of the genes.
        Marco

        Comment


        • #19
          Just to quickly clarify:

          htseq-count does not look at the 'XS' optional field, only a the strand information according to the FLAG field. The strand information in the FLAG field (bit 0x10) specifies whether the sequence as read by the sequencer and as found in the FASTQ file has the same orientation as the sequence in the reference FASTA file ("+" strand, bit cleared), or whether the sequences in FASTQ and FASTA file are reverse-complements of each other ("-" strand, bit set).

          With "--stranded=no", htseq-count ignored the strand bit. With "--stranded=yes" and single-end reads, it will count a read only if the alignment strand information according to the FLAG field is the same as the strand information for the gene/exon in the GFF file. For paired-end data, the strands have to be the same for the mate from the first sequencing pass and opposite for the mate from the second pass.

          I guess I should add a new option "--stranded=reverse" that reverses this, i.e., counts reads only if, for single end, the strand informations in FLAG field and GFF file are opposite, and, for paired-end, if the first pass mate has opposite and the second pass mate same strand as the gene.

          I hope this makes sense.

          Simon

          Comment


          • #20
            Thanks Simon.

            Originally posted by Simon Anders View Post
            Just to quickly clarify:

            htseq-count does not look at the 'XS' optional field, only a the strand information according to the FLAG field. The strand information in the FLAG field (bit 0x10) specifies whether the sequence as read by the sequencer and as found in the FASTQ file has the same orientation as the sequence in the reference FASTA file ("+" strand, bit cleared), or whether the sequences in FASTQ and FASTA file are reverse-complements of each other ("-" strand, bit set).

            With "--stranded=no", htseq-count ignored the strand bit. With "--stranded=yes" and single-end reads, it will count a read only if the alignment strand information according to the FLAG field is the same as the strand information for the gene/exon in the GFF file. For paired-end data, the strands have to be the same for the mate from the first sequencing pass and opposite for the mate from the second pass.

            I guess I should add a new option "--stranded=reverse" that reverses this, i.e., counts reads only if, for single end, the strand informations in FLAG field and GFF file are opposite, and, for paired-end, if the first pass mate has opposite and the second pass mate same strand as the gene.

            I hope this makes sense.

            Simon
            Marco

            Comment


            • #21
              Resolved

              UPDATE June 10, 2011: I contacted Illumina and they were confused too. However, finally they (and people at TopHat and our sequencing center) resolved the issue. The reads that come out of the machine have the same sequence as the CODING strand of the DNA and not the template strand.

              The correct option to use for Cufflinks is fr-secondstrand
              Flobpf

              Comment


              • #22
                Flobpf, nice to hear from you again.

                May you kindly do some illustration? I got the protocol from our sequencing center and it seems the other way round. Maybe we have different protocol?

                Marco

                Originally posted by flobpf View Post
                UPDATE June 10, 2011: I contacted Illumina and they were confused too. However, finally they (and people at TopHat and our sequencing center) resolved the issue. The reads that come out of the machine have the same sequence as the CODING strand of the DNA and not the template strand.

                The correct option to use for Cufflinks is fr-secondstrand
                Flobpf
                Marco

                Comment


                • #23
                  Depends on method

                  Originally posted by marcowanger View Post
                  Flobpf, nice to hear from you again.

                  May you kindly do some illustration? I got the protocol from our sequencing center and it seems the other way round. Maybe we have different protocol?

                  Marco
                  It depends on what method you are using. If you are using Illumina directional sequencing protocol, you get the coding strand sequence. With dUTP, its the template strand. The correct options to use are noted here now:

                  Comment


                  • #24
                    It make more sense. We used dUTP so we got template strand.

                    Originally posted by flobpf View Post
                    It depends on what method you are using. If you are using Illumina directional sequencing protocol, you get the coding strand sequence. With dUTP, its the template strand. The correct options to use are noted here now:
                    http://cufflinks.cbcb.umd.edu/manual.html#library
                    Marco

                    Comment


                    • #25
                      Dear Simon,

                      How can I modify the code to reverse the --strand from counting coding strand to template strand?

                      Originally posted by Simon Anders View Post
                      Just to quickly clarify:

                      htseq-count does not look at the 'XS' optional field, only a the strand information according to the FLAG field. The strand information in the FLAG field (bit 0x10) specifies whether the sequence as read by the sequencer and as found in the FASTQ file has the same orientation as the sequence in the reference FASTA file ("+" strand, bit cleared), or whether the sequences in FASTQ and FASTA file are reverse-complements of each other ("-" strand, bit set).

                      With "--stranded=no", htseq-count ignored the strand bit. With "--stranded=yes" and single-end reads, it will count a read only if the alignment strand information according to the FLAG field is the same as the strand information for the gene/exon in the GFF file. For paired-end data, the strands have to be the same for the mate from the first sequencing pass and opposite for the mate from the second pass.

                      I guess I should add a new option "--stranded=reverse" that reverses this, i.e., counts reads only if, for single end, the strand informations in FLAG field and GFF file are opposite, and, for paired-end, if the first pass mate has opposite and the second pass mate same strand as the gene.

                      I hope this makes sense.

                      Simon
                      Marco

                      Comment


                      • #26
                        I've added the '--standed=reverse' option promised above since then.

                        Comment


                        • #27
                          Dear Simon, I just downloaded 0.5.1.p3 .

                          It seems there is no --stranded=reverse?


                          Options:
                          -h, --help show this help message and exit
                          -m MODE, --mode=MODE mode to handle reads overlapping more than one
                          feature(choices: union, intersection-strict,
                          intersection-nonempty; default: union)
                          -s STRANDED, --stranded=STRANDED
                          whether the data is from a strand-specific assay
                          (default: yes)
                          is it still a hidden function?
                          Originally posted by Simon Anders View Post
                          I've added the '--standed=reverse' option promised above since then.
                          Marco

                          Comment


                          • #28
                            This is the error message I got from 0.5.1 p3

                            htseq-count: error: option -s: invalid choice: 'reverse' (choose from 'yes', 'no')
                            Marco

                            Comment


                            • #29
                              Sorry, it seems I forgot to upload the latest version. Please try with 0.5.3.

                              Comment


                              • #30
                                I'm sorry, I just read this entire thread, and I still have no idea if my data is strand-specific or not. The core I send it to uses the old TruSeq RNA kit, http://epigenome.usc.edu/docs/resour...15008136_A.pdf

                                I was looking at my accepted_hits.bam file from Tophat to figure it out, but the second column doesn't have a 0 or a 16. Here is the actual output:

                                HWI-1KL118:23:C0J57ACXX:8:1308:11486:195428 pPR1s chr1 565039 3 100M = 565073 135 CCGTCATCTACTCTACCATCTTTGCAGGCACACTCATCACAGCGCTAAGCTCGCACTGATTTTTTACCTGAGTAGGCCTAGAAATAAACATGCTAGCTTT @CCDDFFFHHDFHIJJJIEHIJJJFHIGHIEIIHFGHBGIIGHGHGHGGHGJJJBGHJCHHEEEDCEECECCCCDCCCCCCDDDDDCDDDDDDCDDDACC NM:i:0 NH:i:2 CC:Z:chrM CP:i:4490 HI:i:0

                                Can anyone help?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                71 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                80 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X