Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How many reads are needed for RNASeq experiments?

    Hello everybody,

    I am interested in the number of reads needed to perform DE analysis, detection of fusion genes and new transcripts. I search the forum but I have not found a clear answer. A good starting point are the ENCODE Standards.
    1. DE analysis
      "Experiments whose purpose is to evaluate the similarity between the transcriptional profiles of two polyA+ samples may require only modest depths of sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are mappable to the genome or known transcriptome."
    2. Gene fusion & novel transcripts detection
      "Experiments whose purpose is discovery of novel transcribed elements and strong quantification of known transcript isoforms… a minimum depth of 100-200 M 2 x 76 bp or longer reads is currently recommended."


    For DE analysis, it is recommended to have 20-25M mappable to the genome or known transcriptome. Nevertheless, "genome or transcriptome" is pretty different, especially when we are interested in mRNA only and that the libraries are ribodepleted!

    Here are some details about my experiment:

    My data are 101bp paired-ends reads. The library were ribodepleted. To know which part of my reads aligns to the transcriptome, I performed filtering with Trimmomatic, then alignment with Tophat2, then I used HTSeqCount in intersetion_nonempty mode to count the reads in each gene. Finally, I summed the reads of all the genes.
    Do you think it's a good way to evaluate the proportion of reads mappable to the transcriptome?
    For the moment, I have between 1 and 20M of reads mappable to the transcriptome per sample. We will run all the samples one more time.

    Here come my questions:

    1) Is it good to aim at 25M of reads mappable to the transcriptome for DE analysis?

    2) Since I have 2*101 bp reads, is it enough to have 1OOM filtered reads mappable to the genome for Gene fusion & novel transcripts detection?

    Thank you for your help,
    Jane
    Last edited by Jane M; 11-10-2013, 04:54 AM.

  • #2
    1. for DE analysis it is not so much the number of reads that is important, as the number of replicates.

    Comment


    • #3
      1) Yes, 25M will be fine for gene level DE tests PER REPLICATE. You’ll want 3 or more replicates, if possible.

      2) 100M is probably good for that, yes.

      More is pretty much always better though, so get the highest number of replicates and depth that you can afford. And once you’re over about 40M raw reads per replicate, start adding replicates instead of depth.

      Comment


      • #4
        Thank you for your answers.

        I agree, the number of replicates is primoridal. I forgot to mentionned my design in my first post. I paid attention to the number of replicates. I have 3 conditions and 6-4-4 replicates per condition. Now, I worry about the number of reads.

        I have between 15 and 38 (+3 extreme values : 2, 8 and 111) M of filtered reads for the read 2 (reverse) per sample, which is a bit less good than the forward strand.

        Wallysb01, do you think that 40M raw reads per replicate is enough? Maybe, it's ok with polyA+, but with ribodepletion, I will finally have less than 10M of reads, which is not enough even for DE.

        Comment


        • #5
          It also depends, of course, on which transcripts you're interested in. For example, if we were studying photosynthesis in Arabidopsis, 40M raw reads would be overkill, since the genes responsible for photosynthesis have huge expression. But in our work, there are genes such as the "Fantastic Four" (FAF1-FAF4) that have very low read counts - well below 10 per lane at 40 M, so we can't study them at 40M. At 80M we'd have twice that, getting into usable stats for those.

          I studied this and found that by doubling the raw reads (20M, 40M, 80M) we pick up 3% more transcripts above 10 counts per doubling. That's not a lot, but it can matter like in the above example.
          Sam Hokin
          Computational Scientist, Carnegie and NCGR

          Comment


          • #6
            Thank you samhokin for mentioning this point.
            Unfortunately, I cannot say if our transcripts of interest are strongly or weakly expressed.
            I am working on human, with a subpopulation of monocytes for all samples.
            The 3 conditions are : 6 patients with a specific mutation, 4 patients without this specific mutation and 4 healthy people.

            I am interested in differences (genes differentially expressed, novel transcripts or presence of fusion genes) between these groups. So maybe one transcript won't be expressed in one condition but strongly expressed in another one. In this case, I sould be able to detect it.
            If a transcript is weakly expressed in all conditions with no significant change, I am not interested in it.

            Comment


            • #7
              Is your experiment/sequencing already done (sounds like it is)? You can always sequence more if you find that you do not have enough reads for the analysis.

              Comment


              • #8
                Originally posted by GenoMax View Post
                Is your experiment/sequencing already done (sounds like it is)?
                Yes indeed, it's done. We planed from the beginning 2 sequencings (since it was not possible to do it one time due to "the policy" in my institute) to achieve a good coverage. In each sequencing, all the lines are multiplexed by 4.

                You can always sequence more if you find that you do not have enough reads for the analysis.
                That's the question. I try to define what is "enough reads" for my 3 problematics...

                Comment


                • #9
                  Why not try to do the analysis first with what you currently have to see what you get before you decide on the next steps (which sample needs additional sequence)? Since your libraries are made you are only going to be sampling deeper from the same pool.

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    Why not try to do the analysis first with what you currently have to see what you get before you decide on the next steps (which sample needs additional sequence)? Since your libraries are made you are only going to be sampling deeper from the same pool.
                    Before the next sequencing, I check what I have in terms of number of reads in order to decide how much I still have to sequence each sample. I try to reach a "sufficient coverage" because I loose one month per sequencing and I only have 10 months left for my PhD...
                    For now, none but one of the samples has more than 25M of reads aligned to the transcriptome.
                    Of course, I could perform DE analysis and I probably will, but I keep in mind that what I can see with a low coverage is likely not the main effect. Moreover, I have not enough reads to highlight new transcripts.

                    From this discussion, I would like to get an idea of what I should aim at. Thanks to all of you, for DE analysis, it's pretty clear. For fusion genes and new transcripts detection, I don't know. I had 100M in mind but Wallysb01 suggested 40M of raw reads.
                    There is no magic number, we cannot say precisely what is needed for these questions but I try to find a good compromise between time, cost and quality...

                    Comment


                    • #11
                      At our sequencing facility we aim for *at least* 30M reads per sample/replicate; this number is from our resident biostat expert.

                      Comment


                      • #12
                        Originally posted by westerman View Post
                        At our sequencing facility we aim for *at least* 30M reads per sample/replicate; this number is from our resident biostat expert.
                        Thank you westerman! Is this threshold only for DE analysis or for all RNA-Seq problematics? Is it the number of raw reads?

                        Comment


                        • #13
                          All. Most of our rna-seq is de-novo plant and animal. It is for the number of filtered reads which, unless something is drastically wrong will be within 5% of the raw reads.

                          Comment


                          • #14
                            Ok, thank you a lot.
                            May I ask you one last question? Do you use polyA+ or ribodepletion?

                            Comment


                            • #15
                              Being a service facility we do both.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X