Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What is good coverage on an RNA-seq experiment

    Hi,

    I just got my first RNA-seq dataset alignment results today, and I am wondering what would be considered a "good run".
    This is a pilot experiment, whose objective is to determine how many samples we can multiplex in one lane of Illumina HiSeq to get the information we want and the best trade off between cost, specificity and sensitivity.

    This are mouse brain samples, I am getting 75 million read pairs, of which 85% map consistently (the fwd and rw read map to the same place) and uniquely. I get 97% of the reads mapped in total.
    I am concerned about the coverage, I get about 10% of the bases in the mouse genome covered by unique mappers and 3% by non-unique mappers.
    Is this typical? When people talk about 30X coverage of RNA-seq datasets, what is the coverage relative to? RefSeq mRNA?

    Right now I am running 3 samples per lane, and want to know whether this coverage is enough for basic transcriptome analysis, whether I can multiplex more or if I should run less samples per lane

    thanks for the help

  • #2
    You've got plenty of reads. If you check out the cufflinks paper you can see how much sequencing they did (re: a lot) and also they found that in their transcriptome analysis only about a third or a quarter of the reads were necessary. As they performed the analysis with more and more of the data they collected they found no new information. You've also got very good alignment percentages...better than I have ever seen in data from my lab. Two years ago we were doing differential expression analysis with 18 million 36 bp single end reads per sample and it worked fine.

    if you're only interested in differential expression you could collect fewer reads per sample. for splicing i think the number of reads you've collected is good.

    as far as the percentage of the mouse genome you data is aligning to - keep in mind that the genome is only about 5% genes (i think that's right). when people talk about fold coverage for an RNA-Seq experiment it's only relative to the parts of the genome that make RNA.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      Hi sdriscoll,
      Not sure I agree with the conclusions you draw from the cufflinks paper. (Trapnell et al., Nature Biotechnology, 2010, 28:311)

      If you accepted Trapnell et al.'s calling transcripts present at >15 "fragments per kilobase of transcript per million reads -- FPKM" as "moderately expressed", then 25 million reads/sample might get you where you need to go for your analysis. (Fig 4) Or not. Their example shows that at 20 million reads only around 50% of these "moderately expressed" messages will be estimated to within 15% a data set using 140 million reads.

      If you are interested in genes that are expressed at level much lower than "moderately expressed" transcripts, then the answer would seem to be "no, not even close to enough."

      The problem with questions like this one, is that the questioner does not seem to be trying to match an analysis method with what is likely or possibly going on in the pool of transcripts that compose the pool of RNAs present in the cells of a given sample. (And from this information derive some sort of information about what is actually going on in these samples.) Instead they seem to be asking "what is the standard practice of those deploying this methodology?" As if science is primarily a rules-based process and we are all part of some compliance-enforcing bureaucracy with no goals other than seeing that the "rules" are followed.

      My take is that if you are so deep into a process that you have lost track of what it is you are ultimately trying to achieve, you need to step back and re-think what it is you are doing and why you are doing it.

      --
      Phillip

      Comment


      • #4
        i thought the question was pretty straightforward and i don't think you helped much - no offense of course.

        the question of how much sequencing depth is "good" is really an open ended question is all. maybe the best answer for now is "nobody knows". if we had some ultimate reference of expression values for genes that we could compare those calculated in our individual experiments to then maybe we could figure it out. as of now people can only compare fractions of their data to their full data and produce plots that are basically useless in-so-far as they don't help anyone decide how many reads they should be collecting for reliable results.

        i know of the plots you're referring to from the cufflinks paper and they are similar to the plot from Wold's paper http://www.nature.com/nmeth/journal/...meth.1226.html where they only sequenced to 40 million reads. so their plot with 40 million reads looks like the plot from the cufflinks paper with 140 million reads. i've seen another paper where they sequenced to 1 billion reads and they, again, had the same plot. those plots are kind of senseless because they can't be used an as absolute reference for anyone else - they're only in terms of themselves. so you collect N reads, quantify expression, then re-quantify with 3/4, 1/2 and 1/4 of your reads and you find that the gene expression levels vary from their "final" value with all of your reads. i'm afraid you'll find the same thing no matter what your depth is. however there's no logic to only using fractions of your complete data set so why bother comparing the complete set to fractions of the set.

        anyways, in my experience, having the coverage that LP_SEP23 says they have sounds like it will be a pretty solid dataset and i don't think they need to collect more reads per sample. they could and theoretically their expression values would be more reliable but down to what level - who knows.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #5
          Okay, but what does "solid" mean? For the transcripts detected at all (or above some threshold) you get nearly the same count over replicates in the set? If that is what you are looking for, fine.

          But there are lots of transcripts that are going to be present in an RNA sample that will not be detectable by sequencing 40 million reads. Could be they occur in only a small sub-population of the cells of a sample or are simply maintained at low levels in cells (or both).

          I agree my answer was not helpful. But I was trying not to validate the question -- as I said, I think it is the wrong one to ask. Or at least it is too vague to be answered.

          --
          Phillip

          Comment


          • #6
            fair enough!

            so then can you tell me how many reads I should collect if i'm interested in genes expressed below the 5 FPKM level? is it worth it to spread a single sample across 8 lanes of a HiSeq 2000 run? those are the exact questions researchers want answers to. for example when you're proposing a grant for a sequencing based project and you have to think in terms of cost.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #7
              maybe this paper is useful for this discussion: http://genome.cshlp.org/content/earl....full.pdf+html. it would seem that depth results in detection of different types of RNAs. they also provide a very useful comparison of DE analysis (and of course introduce their own which appears to be less biased by sequencing depth and feature length plus has much lower FDR).
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • #8
                Originally posted by sdriscoll View Post
                fair enough!

                so then can you tell me how many reads I should collect if i'm interested in genes expressed below the 5 FPKM level? is it worth it to spread a single sample across 8 lanes of a HiSeq 2000 run? those are the exact questions researchers want answers to. for example when you're proposing a grant for a sequencing based project and you have to think in terms of cost.
                I was going to write "I am in exactly the same position here in the Purdue genomics core." But I have never had anyone pose the question you just did. If they know what 5 FPKM means, they probably will already know how many reads they want. But I get the more general "how many reads/sample is enough." To tell you the truth I mainly just tell them how many reads/sample the head of our bioinformatics core recommends. If they don't like that answer I can expand somewhat. But unless you have some idea what the shape of your transcriptome is, there is not much to add other than "give it a shot and see what you get".

                --
                Phillip

                Comment


                • #9
                  Originally posted by pmiguel View Post
                  "give it a shot and see what you get".
                  that seems to be the state-of-the-art.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • #10
                    Hey, I just wanted to say this was a very helpful discussion. As the only person in our lab who is just starting RNA-Seq, it can seem very intimidating, and in many things, there are set standards. I wanted to point out that sometimes when we first start, we don't realize how complex it all is. At the beginning, you can sometimes think,
                    "hey, what's a good depth for my genome? I'll figure out what all the things that are interesting later."

                    Comment


                    • #11
                      so there's one thing that bothers me with these "depth" investigations when they collected N million reads and then randomly subsetted the reads, did alignments and quantified expressions at each subset. so they find some variation across the subsets that decreases as they approach the full alignment.

                      wouldn't that be obvious though? i mean - as you're using more and more of the total, amount of the data wouldn't there naturally be less and less variation between adjacet subset sizes just due to the fact that you're including more of the data? so the differences between two random subsets of 10% of the data would naturally have more variation than two random subsets of 75% of the data. there is potential for much more overlap between the two 75% subsets than the to 10% subsets.

                      i guess what i mean is it doesn't really make sense to compare the data to itself over and over again - that's why there aren't statistical tests designed for such comparisons and of course 75% of the data is going to look more like 100% of the data than 10% of the data. duh...right?

                      try this in R:

                      Code:
                      # 100 random values between 0 and 10
                      sset <- runif(100,max=10)
                      # compute mean of different random subsets at different percentages of the
                      # total set of numbers at 10% increments
                      mmean <- 0
                      for(i in 1:10) { mmean[i] <- mean(sample(sset(,i*10)) }
                      plot(mmean,type='b')
                      you'll see is the values tend to be closer to the mean of all the numbers as you use larger subsets of the data. so...duh?
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • #12
                        Originally posted by sdriscoll View Post
                        that seems to be the state-of-the-art.
                        LOL...now that is funny. But I thought RNAseq has been evolving as a replacement to the former "state of the art"...expression microarrays. Data I've seen in the past from one very large service provider compared a given number of read to the genes detected/quantified compared to the number of genes detected/quantified using a microarray. Based on their work 20M reads was more than sufficient to outperform the microarray. But maybe that was too simplistic an analysis.

                        Comment


                        • #13
                          Originally posted by sdriscoll View Post
                          you'll see is the values tend to be closer to the mean of all the numbers as you use larger subsets of the data. so...duh?
                          Not sure why this bothers you. I think they are trying to address the question: How few reads am I likely to be able to get away with while retaining most of the information I need. So you can look at the curves they generate and dial in a number that might work for the genes you are interested in.

                          I don't think their results were intended to surprise.

                          --
                          Phillip

                          Comment


                          • #14
                            I think the deeper the better. No dataset has yet reached the maximum coverage needed to answer all questions.

                            If you want to only look at Gene expression, probably ~ 60 Million reads would be ok (it's my feeling).

                            But in my experience, we often hit the limits in our datasets (we have ~140 mill 100 bp PE reads, several biological replicates which is very important!)

                            What do you want to do after the global gene expression analysis? As soon as you look into specific candidates you will be happy for every read.

                            Often the expression of your gene of interest might be low (like a FPKM of 3) and you are interested in an alternative first exon. Then you just might end up having to compare 5 reads versus 1

                            The problem is that expression levels are so different and small regions like exons have much lower read counts than entire genes.

                            Comment


                            • #15
                              Originally posted by DerSeb View Post
                              I think the deeper the better. No dataset has yet reached the maximum coverage needed to answer all questions.
                              Past a certain point you might begin to see limits imposed by the library itself. Your major bottleneck is how many molecules you have of your library prior to enrichment PCR. That is, you converted your RNA (probably after ribo-depleting and fragmenting) to DNA and ligated adapters onto it. At that point how many "good" library molecules do you have?

                              We actually have started doing qPCR at this step. The results are quite variable. But if you are seeing that your concentration is 10 pM -- that is ~6,000,000 "good" molecules/ul. For 20 ul of library you are looking at 120 million molecules. If you want to sequence 30 million reads -- no problem. But at 10x deeper you are guaranteed to be wasting more than 1/2 your reads on PCR duplicates.

                              --
                              Phillip

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X