Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    Originally posted by carmeyeii View Post
    No prob.
    OK, so just to make sure I'm on the right track -- the sequence that is "spit out" by the sequencer is the actual sequence as seen by the camera... i.e, no base translation, just raw fluorescence -> letter .
    Correct.

    So if the first read of a pair is the "first [cDNA] strand", THAT one was the strand synthesized during the first round of sequencing, using as template the strand corresponding to the RNA molecule sequence and direction.

    C
    The first cDNA strand synthesized uses the actual RNA molecule as its template, so it seems a little convoluted to say "using as template the strand corresponding to the RNA molecule sequence and direction." (which almost made my head spin around like Linda Blair ).

    Comment


    • #47

      I know - my bad. i just wanted to say that it is the original RNA molecule, without it being "actually the RNA molecule", which of course we all know because this is cDNA we're talking about.

      Comment


      • #48
        Now the Illumina TruSeq stranded RNA-Seq kits use a dUTP second strand marking protocol (like ScriptSeq). The correct option for this protocol is fr-firststrand.
        And the correct option for htseq-count is --stranded=reverse?

        Comment


        • #49
          Now the Illumina TruSeq stranded RNA-Seq kits use a dUTP second strand marking protocol (like ScriptSeq). The correct option for this protocol is fr-firststrand.
          Originally posted by dvanic View Post
          And the correct option for htseq-count is --stranded=reverse?
          No, for this protocol you should use --stranded=yes. I was wrong.

          EDIT:

          Sorry, I was wrong in my original post, and dvanic was correct. If your library was prepared using the Illumina TruSeq Stranded RNA Kit use --stranded reverse in htseq-count.
          Last edited by kmcarr; 04-03-2014, 04:33 AM.

          Comment


          • #50
            Sorry, but I think you're wrong!

            I am using the currently available Illumina TrueSeq Stranded Total RNA Sample Preparation kit, the manual for which is Illumina part number 15031048 Rev. D. (googling that gives you the manual)

            First off - QC; I use rseqc as one of my QC tools and it shows me that my "strandedness" is relatively clean:
            Code:
            This is PairEnd Data
            Fraction of reads explained by "1++,1--,2+-,2-+": 0.0375
            Fraction of reads explained by "1+-,1-+,2++,2--": 0.9625
            Fraction of reads explained by other combinations: 0.0000
            So I can assume that for highly expressed housekeeping genes looking at the flags should be relatively informative.

            /not to mention that rseqc is actually giving me the answer (facepalm) I am looking for (looked at this after writing everything below):

            We can infer that this is a strand-specific RNA-seq data. Strandness of read1 is opposite to that of gene model, while strandness of read2 is is consistent with the strand of reference gene model.
            1+-,1-+,2++,2––
            read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
            read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
            read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
            read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
            When I pull out the reads from the human gapdh locus (irrespective of strand, using samtools) and look at them, I see that the majority of tags are 163/83:
            Code:
            # Count  Flag
                  1 345
                  6 153
                  7 73
                 17 145
                 17 97
                 53 385
                 76 417
                 88 339
                 88 419
                 90 337
                 93 369
                136 129
                158 99
                159 147
                187 113
                197 137
                753 89
               1796 161
               1859 81
              20382 163
              20382 83
            These tags mean:
            • 163 - read paired; read mapped in proper pair; mate reverse strand; second in pair - i.e. 2nd in pair is on (+) strand
            • 83 - read paired; read mapped in proper pair; read reverse strand; first in pair - i.e. 1st in pair is on (-) strand


            GAPDH in humans is on the (+) strand, so looking at Simon's response in the thread above I would think it's --stranded=reverse:

            The first mate aligns to the strand opposite to the gene, so you need --stranded=reverse.
            If I look at a case similar to Nico's, where the gene is on the (-) strand (tfrc, not as highly expressed as GAPDH in my dataset):
            Code:
            # Count  Flag
                  1 129
                  1 321
                  1 353
                  1 355
                  1 403
                  3 433
                  4 401
                 13 153
                 14 65
                 19 177
                 50 73
                280 97
                329 145
                782 99
                815 147
            Most of the reads are 99/146, just like in Nico's example...
            So looking purely at the data + Simon's comment above, it seems like I should be using --stranded=reverse


            __________________________________________________________________________


            When I use htseq-count on my sam files from the gapdh locus, if I use stranded=yes:
            Code:
            cat gapdh_yes.counttable1 | awk '{if ($2 != 0) print $0}'
            ENSG00000010295.14	18
            ENSG00000111640.9	18
            no_feature	23676
            alignment_not_unique	588
            When I use htseq-count on my sam files from just the gapdh locus, if I use stranded=reverse:
            Code:
            cat gapdh_reverse.counttable1 | awk '{if ($2 != 0) print $0}'
            ENSG00000010295.14	103
            ENSG00000111640.9	23279
            no_feature	330
            alignment_not_unique	588
            ENSG00000010295.14 is IFFO1, a much more lowly expressed gene on the (-) strand downstream of GAPDH.


            __________________________________________________________________________
            BUT I would like to logically understand what's going on.
            Looking at the information on the Illumina website, it says:
            How do the TruSeq Stranded mRNA and Total RNA protocols generate stranded cDNA?
            Strand specificity is achieved by replacing dTTP with dUTP in the Second Strand Marking Mix (SMM). The incorporation of dUTP in second strand synthesis effectively quenches the second strand during amplification, since the polymerase used in the assay will not incorporate past this nucleotide. Further specificity is achieved by addition of Actinomycin D to the First Strand Master Mix Act D (FSA). Actinomycin prevents spurious DNA dependent synthesis during first strand synthesis, while allowing RNA dependent synthesis.

            How do you ensure that different adapters are ligated to each end of a DNA fragment using this protocol?
            Illumina’s proprietary method ensures ligation of two different adapters in the required orientation to opposing ends of a DNA fragment. PCR selects for these and finalizes the construct ready for hybridizing onto the flow cell surface. Adapter sequences can be determined by sequencing the ligation fragments, but sequence information alone is not sufficient to uncover the method.

            How is strandedness maintained after DNA amplification?
            Strandedness is maintained via the directionality of the adapters. The p7 adapter will be on the 3' end of the cDNA strand. As a consequence, the cDNA strand is sequenced. Second strand synthesis is performed using dUTP in place of dUTP and a high fidelity polymerase that cannot process through dUTP template is used for PCR enrichment. As a result, only the first strand product is amplified and the strand information is retained based on the p5 and p7 adapter orientation.
            The P5 end is sequenced first, then the P7; another name for the P5 is the "Universal adapter" (UNad) and P7 is "Index-containing adapter" (IndexAd), because it has the index for de-multiplexing multiple libraries run on the same lane.

            Now for some ASCII art:

            Code:
            RNA 	5' ~~~~~~~~~~~~~~~~~~~~~~~~ 3'
            
            cDNA-1	3' ------------------------ 5' /Illumina 1st strand synthesis
            cDNA-2	5' ---U---U---U---U-------- 3' /Illumina 2nd strand synthesis
            # what you have at the end of the "Synthesize Second Strand cDNA" step pp. 69 of the manual

            # Next step - Adenylate 3' Ends
            A single ‘A’ nucleotide is added to the 3’ ends of the blunt fragments to prevent them from ligating to one another during the adapter ligation reaction. A corresponding single‘T’ nucleotide on the 3’ end of the adapter provides a complementary overhang for ligating the adapter to the fragment. This strategy ensures a low rate of chimera (concatenated template) formation.

            Code:
            # have in tube:
            cDNA-1	3' A------------------------ 5' /Illumina 1st strand + A
            cDNA-2	5' ---U---U---U---U--------A 3' /Illumina 2nd strand + A
            # Ligate Adapters
            This process ligates multiple indexing adapters to the ends of the ds cDNA, preparing them for hybridization onto a flow cell.
            Code:
            # have in tube:
            cDNA-1	3' IndexAd-A-------------------------UNad 5' /Illumina 1st strand + A
            cDNA-2	5' UnAd--U---U---U---U----------A IndexAd 3' /Illumina 2nd strand + A
            As said above in the Illumina documentation, "the p7 (indexing) adapter will be on the 3' end of the cDNA strand"

            # PCR-amplify, using a polymerase that doesn't amplify from the dUTP containing-strand:
            # have in tube:
            Code:
            cDNA-1	3' IndexAd-A-----------------------UNad 5'
            copy 	5' IndexAd-T-----------------------UNad 3'
            # Sequence, starting from adapter P5 (UNad) then P7 (IndexAd).
            # Following the schematic here (http://nextgen.mgh.harvard.edu/IlluminaChemistry.html), only one of those strands would be sequenced in the 1st round of sequencing - the one that is "stuck" to the flowcell by the indexing adapter end, so "copy" in my diagram:
            Code:
            stuck to flowcell-copy - 5' IndexAd-T-----------------------UNad 3'
            											   		 <------UNad 5'
            
            Results in Read 1 being  3' IndexAd-A---------------------- 5'
            This means the sequence of read 1 is identical to that of cDNA-1 above, meaning that is the complement of the original RNA - which is indeed what I see in the case of GAPDH and TFRC - 1st in pair is on opposite strand to the gene.

            # Then we have bridge amplification, followed by cleavage at the P7-attached fragments, meaning the sequence of the attached fragment is now that of "cDNA1"
            In Read 2 we sequence off the P7 adapter:
            Code:
            stuck to flowcell-cDNA1 - 5'-UNad--------------------------T-IndexAd 3'
            													<------A-IndexAd 5'
            
            Results in read2 being 	 3'-UNad---------------------------A 5'
            I.e. the sequence of the read is complementary to cDNA1, or the sequence of the original RNA, which is what we see: GAPDH read 2 is on the + strand, and TFRC is on the - strand, just like the gene annotations.

            Comment


            • #51
              Single Cell RNA seq

              Hi to all
              this is my first post in this forum, and I am sorry my english is not good so please dont mind. I am asking some guidance for RNA-seq differential expression.
              I want to do single cell sequencing (Illumina, PE) of human sample (before and after chemotherapy) to study differential expression, transcript discover using tophat, Cufflinks. I want to ask your openion :
              1. How much the seq read length should be (long or short) (100, 151, or more) ?
              2. How many replicate per sample ? for example, 1 sample of liver cells (before chemotherapy); how many replicates should be taken for sequencing (one time sequencing the sample or three times sequencing) ?
              3. How long will it take to finish 20 sample sequencing ?
              4. Is sequencing procedure for single-cell is same as normal sequencing ?

              Waiting for your kind replies.
              Thank you
              Nakiyami jp.

              Comment


              • #52
                Originally posted by flobpf View Post
                I think that the Illumina directional sequencing protocol only does first strand sequencing. So fr-firststrand may be the option to use for TopHat and Cufflinks. Similarly, if you want to find out # of reads mapping to annotated genes using htseq-counts function and want to know both sense and antisense expression of a gene, one would need to use the option --stranded=no else it will only give sense expression.
                Hi flobpf,
                I am a rookie in RNA-Seq. Now I am playing the analysis on two single-ended reads datasets, one is from Illumina, the other is from Proton.
                I have asked the people doing the NGS, so far I know that they are not sure which library type (fr-unstranded,fr-firststranded,fr-secondstranded), one thing with the library prep they use though is that it retains polarity, so there is directionality in the sequences.
                From this post, in my cases, should I use tophat/cufflinks(fr-firststrand) and htseq-counts(--stranded=reverse ) option? I am sorry for the naive question.Thanks!
                Last edited by super0925; 04-03-2014, 01:51 AM.

                Comment


                • #53
                  if it's single-end, you don't have to precise the type of library.

                  Comment


                  • #54
                    Originally posted by NicoBxl View Post
                    if it's single-end, you don't have to precise the type of library.
                    Library strandedness still matters when you are aligning single end reads. For Illumina TruSeq Stranded RNA libraries single end reads match the anti-sense strand of mRNA (aka reverse, firststrand of cDNA).

                    Tophat/cufflinks defaults to --fr-unstranded. Stranded, single end reads would align o.k. with this setting but you should use --fr-firststrand.

                    htseq-count defaults to --stranded=yes. For TruSeq Stranded RNA libraries this default is not appropriate and you should change it to --stranded=reverse, even if your reads are single-end.

                    Comment


                    • #55
                      Hi all,

                      I have 2 x 50 bp paired end data, library preparation from NEBNext® Ultra™ Directional RNA Library Prep Kit for Illumina®, dUTP method.

                      Data as: seq_1.txt and seq_2.txt

                      From the discussion in this post, seq_1.txt contains the sequence of the antisense strand and locates at the 3' end of the fragment, seq_2.txt contains sequence of sense strand and the 5' end.

                      So for Tophat, I will use this option: --library-type fr-firststrand. But I have one question about the order of the two files? I am not sure if the order of the files will give completely different results.

                      tophat --library-type fr-firststrand seq_1.txt seq_2.txt
                      or
                      tophat --library-type fr-firststrand seq_2.txt seq_1.txt

                      And for htseq-count, I think I should use stranded=reverse option, because my first reads are from the antisense (opposite) strand.

                      Thanks a lot for any comments.

                      Comment


                      • #56
                        You want
                        Code:
                        tophat --library-type fr-firststrand seq_1.txt seq_2.txt
                        Of course, I'd check that your libraries are indeed in the correct orientation by doing what dvanic did. I recently found out that my Ovation libraries (also using dUTP method) were in fact fr-secondstrand and stranded=yes

                        Comment


                        • #57
                          Thanks fanli! I have checked my reads. Seq_1.txt is from the antisense strand Seq_2.txt is from the sense strand on UCSC.

                          Then I do tophat --library-type fr-firststrand seq_1.txt seq_2.txt and stranded=reverse

                          I think tophat --library-type fr-firststrand seq_2.txt seq_1.txt will end up with less reads mapped.

                          Comment


                          • #58
                            Then I do tophat --library-type fr-firststrand seq_1.txt seq_2.txt and stranded=reverse
                            Yep. Tbh, I'm not sure what happens if you reverse the r1 and r2 files - I suppose most of the pairs get thrown away for being discordantly paired but have never tried.

                            Comment


                            • #59
                              Hi Guys

                              I did a tophat2 alignment of the stranded RNA-seq data generated with Trueseq libraries and I am getting strange/random distribution of the reads between sense and antisense strand:


                              End 1 Sense End 1 Antisense End 2 Sense End 2 Antisense End 1 % Sense End 2 % Sense
                              590,379 30,087,178 29,906,269 588,963 1.924 98.069
                              559,882 29,395,607 29,103,746 557,860 1.869 98.119
                              521,844 27,260,014 27,021,895 516,739 1.878 98.124
                              28,392,839 508,446 508,950 28,645,451 98.241 1.746
                              447,074 25,369,572 25,145,640 444,053 1.732 98.265
                              25,655,301 524,163 529,995 26,208,516 97.998 1.982
                              569,612 27,917,987 27,640,016 566,733 2 97.991
                              26,829,925 567,503 571,852 27,104,313 97.929 2.066
                              28,741,488 507,667 509,982 29,116,876 98.264 1.721
                              657,904 30,651,762 30,227,867 652,430 2.101 97.887
                              515,548 27,660,113 27,433,598 515,549 1.83 98.155
                              27,514,719 519,533 519,904 27,667,448 98.147 1.844


                              I would have expected majority of the reads mapping only to one of the strands and not randomly switching stranded as above.

                              Any help with this will be appreciated.

                              Thanks.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              82 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X