Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Required sequencing depth for finding (nearly) all unique human transcripts

    Dear SEQanswers community,

    does anyone know a study where the required sequencing depth/number of mapped reads is estimated for different sequencing technologies (454, Illumina, ABi) that allow identification of N% of the unique transcripts in the human genome? In other words, which depth would be needed to have a 95% coverage of unique transcripts in my human sample? It strikes me that there does not seem to be a published consensus on the depth we need to reliably identify (nearly) all transcripts. It seems to me that this kind of information is necessary for deciding if we can multiplex several samples within a run, as well as for estimating the suitability of long-read technology for whole-transcriptome RNA-Seq.

    Literature on the topic seems to be sparse: while reference [1] indicates that up to 80 Million ABi reads in mouse could be necessary before the number of different transcripts that have been identified reaches a plateau, study [2] suggest that about 3 Million mappable Illumina reads from human are required before the discovery rate flattens. Does anyone know equivalent data for 454, or could share some more comprehensive insights on this problem?

    [1] Wang et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet (2009) vol. 10 (1) pp. 57-63

    [2] Li et al. Determination of tag density required for digital transcriptome analysis: application to an androgen-sensitive prostate cancer model. Proc Natl Acad Sci USA (2008) vol. 105 (51) pp. 20179-84

  • #2
    There is some discussion of this topic for human transcriptomes sequenced by Illumina paired-end sequencing here: ALEXA-seq. Most of the relevant figures and text are in the supplementary materials. I'm sure there are comparable discussions for 454 and SOLID.

    I agree that there is not a consensus. Part of the problem is that the answer to the question is highly dependent on the end goals of your analysis and how you define these end points. For example, you mention X number of reads are required before the discovery rate 'flattens'. Flat is a highly subjective term. Unless the slope of the line is 0, it is not flat. How flat is flat enough?

    The expression level difference between the most lowly expressed gene and the highest is very large (4 to 7 orders of magnitude depending on how you measure/estimate). This means, that when sampling randomly and noting newly discovered genes, the line begins to flatten very quickly (as all the most highly expressed genes are observed). But many lowly expressed genes will still not have been observed or sequenced to your minimum depth requirement. The discovery rate slows but unless you only are interested in the most highly expressed genes, you need to continue sequencing... If you want to cover 95% of base positions of 95% of expressed genes (including very lowly expressed genes) you may be surprised how much coverage you need. Unfortunately it also seems to depend a fair bit on the tissue you are studying, the manner of library preparation (library normalized versus not?), etc.

    You can search the forums, but quickly here are some more posts relevant to your question: one, two, three.

    Comment


    • #3
      Hi malachig,

      I was wondering if there are any new insights that you could give me on the topic of RNA-Seq read depth. Assuming that the RNA samples are polyA-tail selected, and the sequencing is done with 100 nucleotides, paired-end reads, what number of sequences/sample would be optimal to explore transcript differential expression for a high proportion of the transcriptome (even when the genes are expressed at a low level)?

      Are there any relevant article reviews on this topic that you might be aware of? It is clear to me that tissue type (e.g. brain vs liver), RNA preparation protocols, RNA quality (e.g. RIN), and specific research questions for the RNA-Seq data will all have a great impact on the optimal read-depth and it would be great if some studies have already been performed to address some of these variables.

      Thank you,
      Alexandra

      Comment


      • #4
        Thanks, malachig, for the insightful answer. Just to add to this thread, there is a recent paper for coverage estimates in monoculture bacterial transcriptomes that goes into some detail. It's on bacteria, so obviously the results are not applicable to human. Also, this Genome Research paper and this Bioinformatics paper may be of interest. Perhaps we and others could return this thread in case new references turn up and add them here. Until then, 100M reads seem to be a good target for human.
        Last edited by schelhorn; 01-04-2013, 04:01 AM.

        Comment


        • #5
          schelhorn, thank you for the references! They were very useful.

          Comment


          • #6
            I was just reading a paper about NOIseq (Differential expression in RNA-seq: A matter of depth) and had to think of this thread. In the paper they state "Some recent reports suggest that in a mammalian genome, about 700 million reads would be required to obtain accurate quantification of >95% of expressed transcripts (Blencowe et al. 2009) ..."
            I didn't check the primary source, but maybe you will find your answer there. Full reference is:
            Blencowe et al. 2009: Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes. Genes Dev 23: 1379-1386

            Best,
            Simon

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 11:49 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X