Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • pettervikman
    Member
    • Nov 2009
    • 23

    How many reads are acceptable from an RNA seq experiment

    Hi

    We have data from an RNA seq experiment, 48 samples v2.5 Illumina. We had roughly the recommended number of clusters and an even distribution between the samples so we've ended up with roughly 6-7 million paird reads or 12-14 million single reads per sample.

    I've heard people claim that you need at least 20-25 million reads per sample. So I'm wondering if anyone knows or have an article that has looked at a good read number for an RNA seq experiment. The data quality is really nice, if someone ask me how our runs look I always show the fastqc from this run...

    /Petter
  • cedance
    Senior Member
    • Feb 2011
    • 108

    #2
    I'd guess it depends on the analysis you want to do on the data, or the purpose of your experiment. Generally, for snp-calling, this amount of reads is sufficient I'd suppose. However, if you are looking at gene expression, especially to detect low expressed genes' differential expression, then maybe more reads would help.

    I'd love to see the fastqc results to see how good an RNA-Seq data could look like. The ones I am working with, while they are good after preprocessing (adapter clipping + quality trimming), I have never seen a library sequenced good enough by looking at the raw data.
    Also, it would be great if you could tell how much of total RNA did you use and also a bit about pre-amplification of the library.. if it was performed, how many cycles etc...

    Thank you.

    Comment

    • kopi-o
      Senior Member
      • Feb 2008
      • 319

      #3
      This is a hotly debated topic, see e. g. http://blog.fejes.ca/?p=607 where Anthony Fejes discusses a paper claiming that 500 million reads are needed to estimate transcription levels ... There has been a kind of mini-trend lately with several papers claiming that RNA-seq is actually not that good compared to microarrays unless you have very deep coverage.

      As cedance said, it really depends on what you are interested in. I have performed some simulations where I downsampled the data and looked at the resulting abundance estimates for isoforms from Cufflinks and other tools, and haven't seen that much difference beyond 10 million paired-end reads so far. Looking at the number of detected transcripts, it always grows with sequencing depth, but again the curve is almost flat after 10-20M reads in the cases I've looked at.

      Comment

      • adameur
        Member
        • Nov 2009
        • 23

        #4
        To make it even more complex, we have seen that polyA+ RNA gives a much higher fraction of reads mapping to exons compared to total RNA (rRNA depleted) where there are instead lots of intronic reads. Our explanation is that total RNA-seq captures lots of nascent transcripts that have not yet been fully transcribed while PolyA+ RNA-seq captures mainly mature transcripts (see http://dx.doi.org/10.1038/nsmb.2143).

        So I think fewer reads are required for polyA+ RNA-seq compared to total RNA-seq if you are interested in mRNA expression.

        Comment

        • harryzs
          Member
          • Dec 2010
          • 30

          #5
          you should read this:

          Comment

          • pettervikman
            Member
            • Nov 2009
            • 23

            #6
            Thanks for all the answers. I've decided to resequence a couple of samples to a much higher depth as well as doing some data pooling to see how things look in our system. I'm assuming that the coveraqge needed it will be dependent on read length as well read depth and since we have 101 bp long reads we might be better off. I'm also uncertain regarding the number of transcripts to expect, we're working in a highly specialised celltype, not in a cell line, so I'm expecting less transcripts and far from all that could exist in comparison to the vast numbers found in the immortalised cell lines.

            I'm also curious whether it's much dependent on the highly expressed genes that are in the sample since they "steal" a lot of the data being produced. I know that it's possible to select the genes that one is interested in but have any one tried to remove the genes that is uninteresting/highly expressed to increase the coverage of the other genes? This would allow for a higher coverage even of genes that you don't know exist in comparison to the positive selection when you only find what you expected to find.'

            I've also (wanted to) attach a figure to show what I call high quality data since cedence asked for it but since it ask for an url to do it and I have those figures just on my computer I can't. Are there any nice (fast and simple) ways of doing this?

            Comment

            • cedance
              Senior Member
              • Feb 2011
              • 108

              #7
              Pettervikman,
              About posting images/urls to images, I use imageshack to upload images and paste the url here with the URL button.

              Comment

              • pettervikman
                Member
                • Nov 2009
                • 23

                #8
                A new try for the figures


                Comment

                • cedance
                  Senior Member
                  • Feb 2011
                  • 108

                  #9
                  That looks really great. Could you also post the plots for "Sequence duplication levels" and "per base sequence content"? These are the ones I am not quite satisfied with, with our data.

                  Comment

                  • pettervikman
                    Member
                    • Nov 2009
                    • 23

                    #10




                    Here are per base content and duplication levels. Since we've used the poly A tail pulldown I'm not suprised of the increase in A/T initially. The duplication levels are much higher then I'd accept for a genomic project but since there's much less diversity from the transcriptome I'm fine with this. Consider that there are hard end points that really cant be changed (5' and 3' ends of transcripts) and between maybe 10-15 k transcripts to start with.

                    An other question though. After cufflinks using RABT (-g) the transcripts creation looks a lot nicer. That said does anyone know why some transcripts are labelled OK despite the fact that their FPKM_low is 0? I'm also wondering about transcripts labelled as FAIL that have the positive numbers in coverage, fpkm, fpkm_high.

                    To sum it up, why are there transcripts with positive numbers in coverage, fpkm, fpkm_high and 0 in fpkm_low sometime OK, LOWDATA or FAIL?

                    Comment

                    • cedance
                      Senior Member
                      • Feb 2011
                      • 108

                      #11
                      Thanks again. I am sorry I don't/haven't used cufflinks, yet.
                      1 more question!!: why is poly-A pulldown responsible for initial increase in A/T?

                      Comment

                      • kopi-o
                        Senior Member
                        • Feb 2008
                        • 319

                        #12
                        Petter, those data look super. Did you get them sequenced in Uppsala?

                        Comment

                        • pettervikman
                          Member
                          • Nov 2009
                          • 23

                          #13
                          Thanks! They got sequenced here on "my" hiseq. We have a hiseq here on CRC in Malmö, and where part of Lund University/LUDC (Lund University Diabets Center).

                          The pulldown uses a poly T tail and this will bind somewhere in the poly A tail (just to be super clear). Hopefully close to the 3' end of the CDS/3'non coding. But if it binds further down there will be a few As or Ts sequenced before the actual sequencing, hence the slight increase of A/T.

                          Comment

                          • pmiguel
                            Senior Member
                            • Aug 2008
                            • 2328

                            #14
                            Originally posted by cedance View Post
                            Thanks again. I am sorry I don't/haven't used cufflinks, yet.
                            1 more question!!: why is poly-A pulldown responsible for initial increase in A/T?
                            It isn't. The non-random base distribution in the first 10 bases is attributed to hexamer-primed 2nd strand synthesis. (The hexamers do not prime perfectly randomly.)

                            --
                            Phillip

                            Comment

                            • pettervikman
                              Member
                              • Nov 2009
                              • 23

                              #15
                              Thanks pmiquel. Didn't know that. But I've heard that it's much more common in rna-seq experiments in comparison to dna seq, hence the poly a tail story. But your saying that it's only dependent on the 2nd strand syntesis?

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...