Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • truseq paired reads - what strandedness parameters in tophat?

    In RNA-seq, paired ends, stranded, done with TrueSeq - what parameter in tophat should be applied to see strandedness properly?

    fr-firststrand or fr-secondstrand ?

    I am a bit confused, seen also the hints on reverse complement or swapping the order of R1 and R2 in the command line. (http://www.biostars.org/p/64250/)

    Thanks for the hints!

  • #2
    There are now a handful of Library prep kits from Illumina with TruSeq in their name. Some are strand specific, some are not. The strand specific ones are fairly new. The only way to know for sure which one was used is to ask the person who created the sequencing library.

    If they used the TruSeq Stranded kit the proper parameter is fr-firststrand.

    If they used the earlier, unstranded TruSeq kit the proper parameter is fr-unstranded. This is the default setting for TopHat.

    ETA: O.K. I just re-read your post more carefully and see that you did specifically ask about the TruSeq stranded protocol. As I said above the correct --library-type parameter is fr-firststrand.
    Last edited by kmcarr; 04-22-2013, 07:13 AM. Reason: Reread original post.

    Comment


    • #3
      Thanks for the clear explanation!
      I'm checking the fr-firstrand now

      Comment


      • #4
        Truseq stranded RNA: Tuxedo library parameters

        I don't think this is correct.
        From the TruSeq stranded RNA manual:

        The cleaved RNA fragments are copied into
        first strand cDNA using reverse transcriptase and random primers, followed by second
        strand cDNA synthesis using DNA Polymerase I and RNase H. These cDNA fragments
        then have the addition of a single 'A' base and subsequent ligation of the adapter. The
        products are purified and enriched with PCR to create the final cDNA library.
        This implies to me that the second strand cDNA is what gets sequenced (as it has an A overhang and therefore gets the adaptor ligated. Tuxedo definitions of libraries say:

        fr-unstranded (default): Standard Illumina Reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand.
        fr-firststrand: dUTP, NSR, NNSR Same as above except we enforce the rule that the right-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced.
        fr-secondstrand: Directional Illumina (Ligation), Standard SOLiD Same as above except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced.
        To me this implies the correct parameter is fr-secondstrand

        Comment


        • #5
          Originally posted by danwiththeplan View Post
          I don't think this is correct.
          From the TruSeq stranded RNA manual:
          The cleaved RNA fragments are copied into
          first strand cDNA using reverse transcriptase and random primers, followed by second
          strand cDNA synthesis using DNA Polymerase I and RNase H. These cDNA fragments
          then have the addition of a single 'A' base and subsequent ligation of the adapter. The
          products are purified and enriched with PCR to create the final cDNA library.
          This implies to me that the second strand cDNA is what gets sequenced (as it has an A overhang and therefore gets the adaptor ligated.
          The adapters, like the cDNA are double stranded. Both strands of the cDNA are ligated.
          Tuxedo definitions of libraries say:
          fr-unstranded (default): Standard Illumina Reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand.

          fr-firststrand: dUTP, NSR, NNSR Same as above except we enforce the rule that the right-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced.

          fr-secondstrand: Directional Illumina (Ligation), Standard SOLiD Same as above except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced.
          To me this implies the correct parameter is fr-secondstrand
          In the Tuxedo documentation the "Directional Illumina (Ligation)" protocol referred to is a very old protocol from Illumina for creating strand specific RNA-Seq libraries; it is no longer used. The TruSeq Stranded RNA-Seq kits use a fairly standard dUTP second-strand making protocol. This means, as the Tuxedo documentation states, that you should specify your library-type as "fr-firststrand"

          Comment


          • #6
            Alright thanks for clearing that up.

            Comment


            • #7
              For my RNA-Seq data, I subsequently learned that the data was generated using a unstranded protocol. Before I got this information, I used the following hints from TopHat manual to try to figure things out.

              "I am not sure which library type to use (fr-firststrand or fr-secondstrand), what should I do?

              One possible way to figure out the correct library-type is to run TopHat with a small subset of the reads (e.g., 1M) as follows.
              1) run TopHat with fr-firststrand and count the number of junctions in junctions.bed (one of the output files from TopHat)
              2) run TopHat with fr-secondstrand and count the number of junctions in junctions.bed
              Since the splice junction finding algorithm of TopHat makes use of library-type information (if provided), one of the two TopHat runs would result in many more splice junctions than the other one. You can then use the library type that gives more junctions. If this is not the case TopHat might not work well with your sequencing protocol. Please let us know more details about your protocol so we can add support for new library types."

              > more fr-firststrand/junctions.bed |wc -l
              151757
              > more fr-secondstrand/junctions.bed |wc -l
              151901
              > more fr-unstranded/junctions.bed |wc -l
              157557

              Indeed setting fr-unstranded option gave the most junctions. I however wonder whether this ~20% extra junction is in line with "many more splice junctions" expected when setting the correct --library-type option?

              Comment


              • #8
                check this analysis guide from Illumina, it should be
                fr-unstranded for Trueseq regular
                fr-firststrand for Trueseq stranded

                Comment


                • #9
                  Help understanding the --library-type paramater for TruSeq Stranded llibraries

                  Sorry to drag this thread to the top again but I'm not sure I understand why the fr-firststrand parameter is recommended for mapping Illumina TruSeq Stranded libraries.

                  From my understanding of the protocol I'm looking at, library generation proceeds as follows:

                  I. single-strand cDNA is synthesised from template RNA using RT. This will of course be reverse-complementary to the RNA.

                  II. a second cDNA strand is synthesised with the inclusion of dUTP instead of dTTP, thus the cDNA strand with the same orientation as the template RNA is tagged with dUTP.

                  III. During PCR enrichment of the library, a polymerase that cannot amplify through dUTP is used, thus only the 'first', non-dUTP cDNA strand (which is reverse-complementary to the original RNA) is available as a template.

                  From that I would have thought the correct choice would be fr-secondstrand. To test this, I took ~2 million reads from a library made with this protocol and performed the following tests with Tophat and htseq-count.

                  First, I mapped the reads using all four possible combinations of the --library-type parameter and the order of reads:

                  Code:
                  tophat -p 4 -o $OUTDIR.fs --mate-inner-dist 131 \
                  	--mate-std-dev 61 --max-intron-length 5000 \
                  	--library-type fr-firststrand --no-mixed \
                  	--transcriptome-index $INDEX $GENOME \
                  	$READS1 $READS2 
                  
                  tophat -p 4 -o $OUTDIR.ss --mate-inner-dist 131 \
                  	--mate-std-dev 61 --max-intron-length 5000 \
                  	--library-type fr-secondstrand --no-mixed \
                  	--transcriptome-index $INDEX $GENOME \
                  	$READS1 $READS2 
                  
                  tophat -p 4 -o $OUTDIR.fsrev --mate-inner-dist 131 \
                  	--mate-std-dev 61 --max-intron-length 5000 \
                  	--library-type fr-firststrand --no-mixed \
                  	--transcriptome-index $INDEX $GENOME \
                  	$READS2 $READS1
                  
                  tophat -p 4 -o $OUTDIR.ssrev --mate-inner-dist 131 \
                  	--mate-std-dev 61 --max-intron-length 5000 \
                  	--library-type fr-secondstrand --no-mixed \
                  	--transcriptome-index $INDEX $GENOME \
                  	$READS2 $READS1
                  Looking at the junctions.bed files:

                  Code:
                  $ wc -l */junctions.bed
                     85334 OUTDIR.fs/junctions.bed
                     73254 OUTDIR.fsrev/junctions.bed
                     73308 OUTDIR.ss/junctions.bed
                     85334 OUTDIR.ssrev/junctions.bed
                  I get the most junctions using fr-firststrand READS1 READS2 or fr-secondstrand READS2 READS1. OK then... next I ran htseq-count on all four accepted_hits.bam files, and counted the number of reads mapping within genes for each of the above variations:

                  Code:
                       FS   FSREV      SS   SSREV 
                    42675 1219998   42117 1231029
                  So from this, it would appear that using fr-secondstrand and supplying the reads to tophat in reverse order is the way to go. But I can't really work out why this would be the case! Can anyone point out what I'm missing here?

                  Comment


                  • #10
                    HI TomHarrop,
                    I am new to RNAseq and running into the same problem now. I aligned my reads using fr-firststrand READS1 READS2 with tophat2 and then did HTseq count and a lot of my reads are not counted (they end up in the __no_feature)
                    did you ever realize why when you use fr-secondstrand READS2 READS1 you get more reads counted?
                    I used the trueseq stranded polyA illumina kit for paired end seq
                    thanks

                    Originally posted by TomHarrop View Post
                    Sorry to drag this thread to the top again but I'm not sure I understand why the fr-firststrand parameter is recommended for mapping Illumina TruSeq Stranded libraries.

                    From my understanding of the protocol I'm looking at, library generation proceeds as follows:

                    I. single-strand cDNA is synthesised from template RNA using RT. This will of course be reverse-complementary to the RNA.

                    II. a second cDNA strand is synthesised with the inclusion of dUTP instead of dTTP, thus the cDNA strand with the same orientation as the template RNA is tagged with dUTP.

                    III. During PCR enrichment of the library, a polymerase that cannot amplify through dUTP is used, thus only the 'first', non-dUTP cDNA strand (which is reverse-complementary to the original RNA) is available as a template.

                    From that I would have thought the correct choice would be fr-secondstrand. To test this, I took ~2 million reads from a library made with this protocol and performed the following tests with Tophat and htseq-count.

                    First, I mapped the reads using all four possible combinations of the --library-type parameter and the order of reads:

                    Code:
                    tophat -p 4 -o $OUTDIR.fs --mate-inner-dist 131 \
                    	--mate-std-dev 61 --max-intron-length 5000 \
                    	--library-type fr-firststrand --no-mixed \
                    	--transcriptome-index $INDEX $GENOME \
                    	$READS1 $READS2 
                    
                    tophat -p 4 -o $OUTDIR.ss --mate-inner-dist 131 \
                    	--mate-std-dev 61 --max-intron-length 5000 \
                    	--library-type fr-secondstrand --no-mixed \
                    	--transcriptome-index $INDEX $GENOME \
                    	$READS1 $READS2 
                    
                    tophat -p 4 -o $OUTDIR.fsrev --mate-inner-dist 131 \
                    	--mate-std-dev 61 --max-intron-length 5000 \
                    	--library-type fr-firststrand --no-mixed \
                    	--transcriptome-index $INDEX $GENOME \
                    	$READS2 $READS1
                    
                    tophat -p 4 -o $OUTDIR.ssrev --mate-inner-dist 131 \
                    	--mate-std-dev 61 --max-intron-length 5000 \
                    	--library-type fr-secondstrand --no-mixed \
                    	--transcriptome-index $INDEX $GENOME \
                    	$READS2 $READS1
                    Looking at the junctions.bed files:

                    Code:
                    $ wc -l */junctions.bed
                       85334 OUTDIR.fs/junctions.bed
                       73254 OUTDIR.fsrev/junctions.bed
                       73308 OUTDIR.ss/junctions.bed
                       85334 OUTDIR.ssrev/junctions.bed
                    I get the most junctions using fr-firststrand READS1 READS2 or fr-secondstrand READS2 READS1. OK then... next I ran htseq-count on all four accepted_hits.bam files, and counted the number of reads mapping within genes for each of the above variations:

                    Code:
                         FS   FSREV      SS   SSREV 
                      42675 1219998   42117 1231029
                    So from this, it would appear that using fr-secondstrand and supplying the reads to tophat in reverse order is the way to go. But I can't really work out why this would be the case! Can anyone point out what I'm missing here?

                    Comment


                    • #11
                      TomHarrop which options did you use when you did htseq-count?
                      --stranded=yes
                      --stranded=reverse
                      not sure it if matters

                      Comment


                      • #12
                        Originally posted by lm003 View Post
                        TomHarrop which options did you use when you did htseq-count?
                        --stranded=yes
                        --stranded=reverse
                        not sure it if matters
                        Yes, it matters a great deal. For TruSeq Stranded libraries it is --fr-firstrand for TopHat and --stranded=reverse for htseq-count. What these are both saying is that the read #1 matches the cDNA first-strand orientation, which is the reverse complement of the mRNA.

                        Comment


                        • #13
                          Thank you!

                          I actually just compared the two , here are the outcomes :
                          I am convinced --stranded=reverse is the right option for paired end first-stranded. (using Truseq illumina kit)

                          htseq-count --stranded=reverse file.sam file.genes.gtf

                          no_feature 194094
                          ambiguous 6540
                          too_low_aQual 0
                          not_aligned 0
                          alignment_not_unique 0


                          htseq-count --stranded=yes file.sam file.genes.gtf

                          no_feature 2223636
                          ambiguous 200
                          too_low_aQual 0
                          not_aligned 0
                          alignment_not_unique 0


                          Originally posted by kmcarr View Post
                          Yes, it matters a great deal. For TruSeq Stranded libraries it is --fr-firstrand for TopHat and --stranded=reverse for htseq-count. What these are both saying is that the read #1 matches the cDNA first-strand orientation, which is the reverse complement of the mRNA.

                          Comment


                          • #14
                            fastqc error (How do I post a new thread??)

                            $ fastqc abc.fastq
                            Output:
                            Started analysis of abc.fastq

                            Failed to process file abc.fastq
                            uk.ac.babraham.FastQC.Sequence.SequenceFormatException: ID line didn't start with '@'
                            at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:134)
                            at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:105)
                            at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:76)
                            at java.lang.Thread.run(Thread.java:724)

                            So, I looked at the file and it has @ in the first line and at the beginning of the reads....

                            $ head abc.fastq
                            @HWI-ST560:173:C5WEGACXX:2:1101:1907:2424 1:N:0:
                            ACGACGGTCTAAACCCTNGANNTCTCGGGNNNNNNNNNNNNNNNAAGAGCGNNNNNNNNGNNNTGCCGAGACCGATCTCGTATGCCGTCTTCT
                            +
                            DHHAFHIEGICH@HDGG#08##00?FHGI################################################################
                            @HWI-ST560:173:C5WEGACXX:2:1101:1860:2478 1:N:0:
                            AGCGCTCCGCCAGGGCCGTGGGCCGACCCCGGNNNNNNNNNTNNGAGGGCCTCACTAAACC
                            +
                            FDBFHHIHIJJJJJIJIJFHIJBHBGGFFFBB#########,##,++<@-828>+:>>A>?
                            @HWI-ST560:173:C5WEGACXX:2:1101:2019:2431 1:N:0:
                            AAACCCCCGGGACGGGGNCCNGCGGGGCANNNNNNNNNNNNNNNGGGGGGGNNNNNNNNGNNNGTTTTCGGGGGGCCAGGGGAAGGGAGAAGG

                            What do I do to fix this? Sorry for posting here, but I don't see a tab to post a new thread somehow...

                            Comment


                            • #15
                              Originally posted by lm003 View Post
                              TomHarrop which options did you use when you did htseq-count?
                              --stranded=yes
                              --stranded=reverse
                              not sure it if matters
                              Ah, this looks like it was exactly what I was missing. I was using --stranded=yes. Thanks for catching that and kmcarr for confirming. I switched to --stranded=reverse and now I have:
                              Code:
                                   FS   FSREV      SS   SSREV 
                              1231029   42109 1220563   42675
                              Last edited by TomHarrop; 02-09-2015, 01:51 AM. Reason: updated

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X