Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is it possible to estimate mRNA-seq depth/coverage just with genome size?

    Hi,

    Being a newbie in NGS, I have a very basic question.

    I sequenced tissue mRNAs using a paired-end strategy.
    Is it possible to calculate the depth of an overall mRNA-seq experiment when no reference genome or transcriptome data are available (but knowing only the genome size)?

    Can we use the following formula or it is correct just for calculating genome depth?
    coverage=(average length of reads)*(number of raw forward + reverse reads) / (haploid genome size).

    I also read (UCSC - ENCODE Project: http://genome.ucsc.edu/ENCODE/protoc...dards_V1.0.pdf) that we can estimate the depth using this formula:
    (number of NT sequenced / number of mRNA molecules per cell) / (average mRNA length)

    Am I wrong if I say that it seems very approximate to me?
    Because of the different levels of expression of every single transcript, does it make any sense trying to know the depth of a RNA-seq experiment?


    Thanks for your help !
    Last edited by Alun3.1; 02-23-2015, 03:24 PM.

  • #2
    Originally posted by Alun3.1 View Post
    Hi,
    Being a newbie in NGS, I have a very basic question.

    I sequenced tissue mRNAs using a paired-end strategy.
    Is it possible to calculate the depth of an overall mRNA-seq experiment when no reference genome or transcriptome data are available (but knowing only the genome size)?
    Not in any sensible way.

    Originally posted by Alun3.1 View Post
    Can we use the following formula or it is correct just for calculating genome depth?
    coverage=(average length of reads)*(number of raw forward + reverse reads) / (haploid genome size).
    That formula only makes sense for whole genome sequencing.

    Originally posted by Alun3.1 View Post
    I also read (UCSC - ENCODE Project: http://genome.ucsc.edu/ENCODE/protoc...dards_V1.0.pdf) that we can estimate the depth using this formula:
    (number of NT sequenced / number of mRNA molecules per cell) / (average mRNA length)

    Am I wrong if I say that it seems very approximative to me?
    No, it is very approximate, and would only give you the average coverage of every transcript. Such an average coverage will be incorrect for most of the transcripts, as expression levels aren't normally distributed.

    Originally posted by Alun3.1 View Post
    Because of the different levels of expression of every single transcript, does it make any sense trying to know the depth of a RNA-seq experiment?

    Thanks for your help !
    The average depth makes little sense.

    For RNA-Seq with differential expression analysis in mind, you usually select sequencing depth based on previous experience or some rule of thumb, as the exact numbers are unknown for your experiment (which is why you carry it out in the first place).

    A place to start would be the following: for a "standard" DE experiment, with a typical "higher eukaryote" species, usually 10-50 million reads per sample are "enough". If you have many replicates, the lower end is usually fine, if you have few replicates and/or are interested in genes with a generally very low expression (e.g. transcription factors) - or are interested in fine-tuned gene regulation (small differences in expression between samples), the upper end would be recommended.
    For a prokaryote, 5-20 million reads are "enough" - if your rRNA depletion protocol works well.

    For RNA-Seq with transcriptome assembly as a primary goal, things change a bit as you can choose between different strategies. But I suppose you are interested in DE.

    Comment


    • #3
      Strange. It seems like people use the term "depth of coverage" more often for RNAseq experiments, where it really doesn't make sense, more than they do for DNAseq, where it does.

      --
      Phillip

      Comment


      • #4
        Thanks sarvidsson !

        for a "standard" DE experiment, with a typical "higher eukaryote" species, usually 10-50 million reads per sample are "enough"
        So I assume it also depends on the species you study, the complexity of the transcriptome, the length of the reads (and the cost of the sequencing).
        What about if you only focus on mRNAs and get the same number of reads (10-50 millions). As they are a (small) fraction of the total RNA, one could think that having 10-50 millions reads from mRNA is more complete than 10-50 million reads from total RNA, right? Then you could potentially detect rare transcripts without needing 100-200 millions reads?

        For RNA-Seq with transcriptome assembly as a primary goal, things change a bit as you can choose between different strategies. But I suppose you are interested in DE.
        Yes, I am more into DE. But if you want to assemble a transcriptome, I assume (depending if it is a reference or de novo) you would need definitely more reads as long as possible?
        Last edited by Alun3.1; 02-23-2015, 06:10 PM.

        Comment


        • #5
          Originally posted by Alun3.1 View Post
          So I assume it also depends on the species you study, the complexity of the transcriptome, the length of the reads (and the cost of the sequencing).
          Life is full of compromises

          Originally posted by Alun3.1 View Post
          What about if you only focus on mRNAs and get the same number of reads (10-50 millions). As they are a (small) fraction of the total RNA, one could think that having 10-50 millions reads from mRNA is more complete than 10-50 million reads from total RNA, right? Then you could potentially detect rare transcripts without needing 100-200 millions reads?
          With undegraded RNA and a well-trained technician we typically get ~93-98 % mRNA, and with 30-50 million reads we typically see most known transcripts for the specific tissue (numbers depends on complexity of the tissue and species).

          Some recommendations to read on the subject:

          Background RNA-Seq is the recently developed high-throughput sequencing technology for profiling the entire transcriptome in any organism. It has several major advantages over current hybridization-based approach such as microarrays. However, the cost per sample by RNA-Seq is still prohibitive for most laboratories. With continued improvement in sequence output, it would be cost-effective if multiple samples are multiplexed and sequenced in a single lane with sufficient transcriptome coverage. The objective of this analysis is to evaluate what sequencing depth might be sufficient to interrogate gene expression profiling in the chicken by RNA-Seq. Results Two cDNA libraries from chicken lungs were sequenced initially, and 4.9 million (M) and 1.6 M (60 bp) reads were generated, respectively. With significant improvements in sequencing technology, two technical replicate cDNA libraries were re-sequenced. Totals of 29.6 M and 28.7 M (75 bp) reads were obtained with the two samples. More than 90% of annotated genes were detected in the data sets with 28.7-29.6 M reads, while only 68% of genes were detected in the data set with 1.6 M reads. The correlation coefficients of gene expression between technical replicates within the same sample were 0.9458 and 0.8442. To evaluate the appropriate depth needed for mRNA profiling, a random sampling method was used to generate different number of reads from each sample. There was a significant increase in correlation coefficients from a sequencing depth of 1.6 M to 10 M for all genes except highly abundant genes. No significant improvement was observed from the depth of 10 M to 20 M (75 bp) reads. Conclusion The analysis from the current study demonstrated that 30 M (75 bp) reads is sufficient to detect all annotated genes in chicken lungs. Ten million (75 bp) reads could detect about 80% of annotated chicken genes, and RNA-Seq at this depth can serve as a replacement of microarray technology. Furthermore, the depth of sequencing had a significant impact on measuring gene expression of low abundant genes. Finally, the combination of experimental and simulation approaches is a powerful approach to address the relationship between the depth of sequencing and transcriptome coverage.


          Originally posted by Alun3.1 View Post
          Yes, I am more into DE. But if you want to assemble a transcriptome, I assume (depending if it is a reference or de novo) you would need definitely more reads as long as possible?
          IMO both is necessary - I'd recommend a wet-lab normalized cDNA library on 1/2 to 1 MiSeq V3 (2x300 bp) run (or possibly PacBio, we don't have one however) + whatever samples you would like to study the expression on as many HiSeq lanes you need. Then in silico normalize the HiSeq reads and assemble the whole thing.

          Comment


          • #6
            Originally posted by sarvidsson View Post
            With undegraded RNA and a well-trained technician we typically get ~93-98 % mRNA, and with 30-50 million reads we typically see most known transcripts for the specific tissue (numbers depends on complexity of the tissue and species).
            If your main method for determining which genes are expressed in a given tissue is sequencing 30-50 million reads from its transcriptome, then what you see when you sequence 30-50 million reads from a tissue will be
            "most known transcripts for the specific tissue".

            Which is fine. But the 30-50 million reads figure is just what is fashionable at the moment. Should it become possible to obtain 300-500 million reads per sample for around $500/€ 440, that will probably become the new standard.

            --
            Phillip

            Comment


            • #7
              Originally posted by pmiguel View Post
              If your main method for determining which genes are expressed in a given tissue is sequencing 30-50 million reads from its transcriptome, then what you see when you sequence 30-50 million reads from a tissue will be
              "most known transcripts for the specific tissue".

              Which is fine. But the 30-50 million reads figure is just what is fashionable at the moment. Should it become possible to obtain 300-500 million reads per sample for around $500/€ 440, that will probably become the new standard.
              Point taken. But if 300-500 million reads per sample would be that cheap, for most research questions I'd rather analyze 5 times more samples at 60-100 million reads per sample, provided that library costs follow the same trend.

              Comment


              • #8
                Thanks guys for your replies !

                Comment


                • #9
                  Originally posted by sarvidsson View Post
                  Point taken. But if 300-500 million reads per sample would be that cheap, for most research questions I'd rather analyze 5 times more samples at 60-100 million reads per sample, provided that library costs follow the same trend.
                  And yet, there were DE experiments done on 1/4 PTP 454 runs that generated less typically less than 200K reads split among lots of samples. If a DE experiment back then generated 40,000 reads per sample was considered reasonable--still with 3 replicates-- then why don't people do 15 replicates now?

                  --
                  Phillip

                  Comment


                  • #10
                    Originally posted by pmiguel View Post
                    And yet, there were DE experiments done on 1/4 PTP 454 runs that generated less typically less than 200K reads split among lots of samples. If a DE experiment back then generated 40,000 reads per sample was considered reasonable--still with 3 replicates-- then why don't people do 15 replicates now?
                    The library costs tend to be prohibitive for the academic customers we have - from 454 to Illumina these costs haven't dropped by far as much as the sequencing costs have. So a "screen few samples with RNA-Seq, then validate on many samples by RT-qPCR" mentality is quite common. I could speculate on other reasons as well - e.g. statistical training is seldom attractive to biology PhD students here. The commerical customers we have are generally more interested in speedy results, so tend to spend more money on RNA-Seq libraries... but this is just my current experience.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    46 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X