Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina Directional RNA-Seq

    Hi,

    We recently got single end directional mRNA-Seq data obtained using the new Illumina protocol. I'm trying to map it to the genome using TopHat and Cufflinks. For both programs to understand that the library prep was different, I'm using the option

    Code:
    --library-type fr-firststrand
    However, I'm not really sure whether the Illumina protocol only does first strand sequencing, or whether the library type should be standard Illumina or fr-secondstrand. I did it both ways (standard and firststrand) and found that the number of transcripts I get after Cufflinks is quite different. I know Illumina uses its own 5' and 3' adapters for defining the strand specificity, but not really clear how it does that/which strand it ends up sequencing?? Is the sequence we get that of the Coding Strand or the Template Strand?

    Please help.

    Thanks!
    Last edited by flobpf; 02-07-2011, 03:11 PM.

  • #2
    Update

    I think that the Illumina directional sequencing protocol only does first strand sequencing. So fr-firststrand may be the option to use for TopHat and Cufflinks. Similarly, if you want to find out # of reads mapping to annotated genes using htseq-counts function and want to know both sense and antisense expression of a gene, one would need to use the option --stranded=no else it will only give sense expression.
    Last edited by flobpf; 04-25-2011, 10:47 AM.

    Comment


    • #3
      I am a bit confused. According to Simon Anders

      htseq-counts assumes your RNA-Seq data to be strand-specific, i.e., it will only count those genes which map to the strand that the feature is on. If you want it to count reads on both strands, use the '--straded=no' option.
      I assume htseq count the read that map to the sense/antisense according the the strand of feature, so we should use --stranded=yes?

      But practically, I found if I use --stranded on my ss-RNAseq I got prettly low count for the genes..

      So I wonder which should I use.



      Originally posted by flobpf View Post
      So it turns out that the Illumina directional sequencing protocol only does first strand sequencing. So fr-firststrand is the option to use for TopHat and Cufflinks. Similarly, if you want to find out # of reads mapping to annotated genes using htseq-counts function and want to know both sense and antisense expression of a gene, one would need to use the option --stranded=no else it will only give sense expression.

      hope this helps someone in need...
      Marco

      Comment


      • #4
        In case of "--stranded=yes", only those reads are counted that map to the same strand as the gene sits on, i.e., I assume strand-specific RNA-Seq sequences the coding strand. This is how it is with the data from our lab (which has been made with an older pre-release version of Illumina's strand-specific protocol). To be honest, I didn't even think about that the reads could all end up on the opposite strand, but, of course, if the adapters are now ligated the other way round (P5 to 3', P7 to 5' of first strand), we can get it that way as well. So maybe, I should add a third setting, something like '--stranded=reverse'.

        Comment


        • #5
          HTSeq antisense values

          But practically, I found if I use --stranded on my ss-RNAseq I got prettly low count for the genes..

          So I wonder which should I use.
          I guess HTSeq is only mapping the reads on the same strand as the annotated feature. Using --stranded=no will (probably) consider all reads mapping to the same locus regardless of the strand, but still it will not give you the degree of antisense expression.

          What I did here was
          1) used -s=no option to count the reads regardless of whether they were sense or antisense to a gene
          2) Then, i wrote a custom python script to figure out from the readcounts file and the GFF file whether a given read was sense or antisense to a gene (0 = + strand and if gene = + strand, sense; 16=-strand and if gene = + strand, antisense)
          3) I created pseudo-annotations - geneA_forward and geneA_reverse and reported number of reads mapping to a given gene in either orientation

          To avoid these intermediate steps, --stranded=reverse option will be very helpful! In fact, if HTSeq-count could tell me the number of reads mapping to a gene in both sense and antisense, that would be the best!
          Last edited by flobpf; 03-16-2011, 04:27 PM. Reason: this approach is better

          Comment


          • #6
            Thanks for the clarification.

            Would you mind help me explain the following observation?

            I used htseq-count -t exon -i Name --mode=union --stranded=[yes/no] -o input.tagged.sam input.readName.sorted.sam hg19.Refseq.gff

            chr13 hg19_refGene gene 101183811 101185997 0.000000 - . ID=87769;Name=A2LD1
            chr13 hg19_refGene mRNA 101183811 101185997 0.000000 - . ID=NM_033110;Name=A2LD1;Parent=87769
            chr13 hg19_refGene exon 101183811 101184855 0.000000 - . ID=NM_033110.;Name=A2LD1;Parent=NM_033110
            chr13 hg19_refGene exon 101185817 101185997 0.000000 - . ID=NM_033110.;Name=A2LD1;Parent=NM_033110
            The gene "A2LD1" is coded by the -ve strand

            Then I have the reads mapped by BWA/Tophat. Both shows the reads should be mapped to -ve strand.

            BWA- the tagged SAM output by Htseq (0.47p2) with --stranded=no:

            GRC076:3:61:5267:19366#0 99 chr13 101183857 60 76M = 101183974 193 GCTGTGAAGCTTGATCTGAAAATTAGGAAGAATGGTGCTATACGCACTCTGATATACAGCATGCTTGCCCTGTGGA bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 XF:Z:A2LD1
            GRC076:3:61:5267:19366#0 147 chr13 101183974 60 76M = 101183857 -193 CTCAGGGGTTACAGGCCAGAATCAGTATAGTAGGGCTGCTAAAATCTAGCGGCAACACCAAGTCTTTACACTGATC babbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 XF:Z:A2LD1
            However, when I used --stranded=yes, then
            GRC076:3:61:5267:19366#0 99 chr13 101183857 60 76M = 101183974 193 GCTGTGAAGCTTGATCTGAAAATTAGGAAGAATGGTGCTATACGCACTCTGATATACAGCATGCTTGCCCTGTGGA bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 XF:Z:no_feature
            GRC076:3:61:5267:19366#0 147 chr13 101183974 60 76M = 101183857 -193 CTCAGGGGTTACAGGCCAGAATCAGTATAGTAGGGCTGCTAAAATCTAGCGGCAACACCAAGTCTTTACACTGATC babbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 XF:Z:no_feature

            I suppose when I use --stranded=yes, this pair should be still tagged to A2LD1, right? But It turns out it is only counted when stranded=no.

            p.s. I found out this problem when I used --stranded=yes to count the reads in my strand specific sequencing and found more than 90% of the reads are not counted, which is weird.
            Originally posted by Simon Anders View Post
            In case of "--stranded=yes", only those reads are counted that map to the same strand as the gene sits on, i.e., I assume strand-specific RNA-Seq sequences the coding strand. This is how it is with the data from our lab (which has been made with an older pre-release version of Illumina's strand-specific protocol). To be honest, I didn't even think about that the reads could all end up on the opposite strand, but, of course, if the adapters are now ligated the other way round (P5 to 3', P7 to 5' of first strand), we can get it that way as well. So maybe, I should add a third setting, something like '--stranded=reverse'.
            Last edited by marcowanger; 03-16-2011, 07:53 PM. Reason: supplementary
            Marco

            Comment


            • #7
              The protocol we used

              1> Poly-A selection, Fragmentation
              2> First strand cDNA synthesis
              3> Second strand syntehsis with dUNTP
              4> end repair,3' end A-tailing, Adapter ligation (i have no idea if P5 to 3', P7 to 5' of first strand was used)
              5> Size selection
              6> UDGase treatment

              I suppose in this protocol, we should have sequenced the coding strand?
              Marco

              Comment


              • #8
                One thing come into my mind.

                We used strand specific (ss) paired-end sequencing.

                And from illumina's website, the ss prep only works on single read flow cell only
                The Directional mRNA-Seq Sample Prep protocol is an experimental application of Illumina technology, requiring Illumina small RNA sequencing primers and the mRNA-Seq Sample Prep kit. This protocol is compatible with single-read flow cells only
                And Simon you mentioned

                This is how it is with the data from our lab (which has been made with an older pre-release version of Illumina's strand-specific protocol)
                Does it mean that HTSeq only count if both read of the pair are mapped into the strand as the gene sit on?

                If so, I think the situation we faced is that, we have paired-end sequencing so that the FORWARD read (with a tag 99) always get mapped into the coding strand. Therefore in Tophat using --library-type fr-firststrand, the read mapped to Forward strand has a tag XS:A:-, meaning the RNA that produce this read comes from -ve strand (coding strand). Then, the gene that code for this read should sit on +ve strand (to have a -ve coding strand). Therefore, this Forward read, if exists alone, should be counted.

                The complication comes in, when the reverse read of the pair is of opposite strand of the Forward (tag: 147), so that it behaves the other way round. Therefore, summed it up, the pair is considered not the be counted in the gene and is discarded. Is it true?

                So in the meantime, I should use --stranded=no, and wait for a htseq update?

                Originally posted by Simon Anders View Post
                In case of "--stranded=yes", only those reads are counted that map to the same strand as the gene sits on, i.e., I assume strand-specific RNA-Seq sequences the coding strand. This is how it is with the data from our lab (which has been made with an older pre-release version of Illumina's strand-specific protocol). To be honest, I didn't even think about that the reads could all end up on the opposite strand, but, of course, if the adapters are now ligated the other way round (P5 to 3', P7 to 5' of first strand), we can get it that way as well. So maybe, I should add a third setting, something like '--stranded=reverse'.
                Last edited by marcowanger; 03-16-2011, 09:23 PM. Reason: supp.
                Marco

                Comment


                • #9
                  Thanks flobpf

                  Originally posted by flobpf View Post
                  I guess HTSeq is only mapping the reads on the same strand as the annotated feature. Using --stranded=no will (probably) consider all reads mapping to the same locus regardless of the strand, but still it will not give you the degree of antisense expression.

                  What I did here was
                  1) used -s=no option to count the reads regardless of whether they were sense or antisense to a gene
                  2) Then, i wrote a custom python script to figure out from the readcounts file and the GFF file whether a given read was sense or antisense to a gene (0 = + strand and if gene = + strand, sense; 16=-strand and if gene = + strand, antisense)
                  3) I created pseudo-annotations - geneA_forward and geneA_reverse and reported number of reads mapping to a given gene in either orientation

                  To avoid these intermediate steps, --stranded=reverse option will be very helpful! In fact, if HTSeq-count could tell me the number of reads mapping to a gene in both sense and antisense, that would be the best!
                  Marco

                  Comment


                  • #10
                    I mean in the meantime, I should

                    1>
                    So in the meantime, I should use --stranded=no
                    2> Filter the tagged sam by Htseq -o, filter the reads mapped to the gene that are tagged by Tophat XS:A:[+/-] with the strand matching the Gene strand in the GTF/GFF file.

                    Is it true?
                    Marco

                    Comment


                    • #11
                      I suppose when I use --stranded=yes, this pair should be still tagged to A2LD1, right? But It turns out it is only counted when stranded=no.
                      I'm not sure why that is happening. I used tophat for mapping my reads, am not familiar with BWA.
                      I mean in the meantime, I should

                      1>
                      Quote:
                      So in the meantime, I should use --stranded=no
                      2> Filter the tagged sam by Htseq -o, filter the reads mapped to the gene that are tagged by Tophat XS:A:[+/-] with the strand matching the Gene strand in the GTF/GFF file.

                      Is it true?
                      As far as I understand, TopHat XS:A is not the right FLAG. Strandedness of the read is displayed in the second column (0 or 16)
                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                      Comment


                      • #12
                        flobpf, I think that is 2 different issues we are talking about here.

                        The term "strandness" puzzled me so much.

                        Strandness of the read:
                        The second column is about the strandness of the read's sequence.
                        For example, when a read ATGC being mapped onto the forward strand of the DNA, then this read is a FORWARD READ , as indicated by the status in column 2.

                        Strandness of the RNA that produce this read:
                        But the Tophat XS tag correspond to the strandness that the RNA that code for this read. Think we have a strand specific illumina sequencing, we only sequence the first strand of the mRNA (yet the genes may express mRNA from either +ve strand / -ve strand of the DNA sequences). Therefore, when mRNA is converted to first strand cDNA and get sequenced back, its sequence still correspond to the coding strand of the DNA (i.e. the strand the gene sit on). So theoretically, I think that is the FORWARD read of the pair determines the strand of the mRNA molecules that the read come froms (PE sequencing in illumina produce FR read on opposite strand).



                        Originally posted by flobpf View Post
                        I'm not sure why that is happening. I used tophat for mapping my reads, am not familiar with BWA.


                        As far as I understand, TopHat XS:A is not the right FLAG. Strandedness of the read is displayed in the second column (0 or 16)
                        http://seqanswers.com/forums/showthread.php?t=10122
                        Last edited by marcowanger; 03-17-2011, 05:12 AM.
                        Marco

                        Comment


                        • #13
                          In fact I just used BWA to illustrate the scenario. I faced the problem using tophat BAM.....


                          Originally posted by flobpf View Post
                          I'm not sure why that is happening. I used tophat for mapping my reads, am not familiar with BWA.


                          As far as I understand, TopHat XS:A is not the right FLAG. Strandedness of the read is displayed in the second column (0 or 16)
                          http://seqanswers.com/forums/showthread.php?t=10122
                          Marco

                          Comment


                          • #14
                            Maybe first we have to clarify the terms so that we can discuss on the same line.

                            Please correct me if I am wrong,

                            Strandness in GTF/GFF file: Strand of the gene, the coding strand of the RNA, the non-template strand

                            Strandness in column 2 of SAM file: The strand the read being mapped onto the reference

                            Strandness of Tophat XS:A tag: Strand of the RNA that produce the read.


                            What I don't understand is that "Strand of the RNA" and "Strand of the gene in GTF". I think they are the same thing?

                            If I am correct, then for a gene A sit on the +ve strand. Then I should only say the read comes from this gene A if Tophat tag this pair with XS:A:+ve.
                            Last edited by marcowanger; 03-17-2011, 05:20 AM.
                            Marco

                            Comment


                            • #15
                              I think this "thread" should be moved to "bioinformatics"?
                              Marco

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              71 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              80 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X