Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with super abundant transcripts in RNAseq

    I've discovered that my RNAseq libraries contain different levels of a very abundant transcript that is derived from mitochondria but which seems to be present even after polyA selection. It represents about 10-20% of the reads in most libraries, although in one library it represents as few as 3% of the reads and in another it represents 50% of the reads. Clustering of the libraries using DEseq indicates that the library with 50% contamination is an outlier and it doesn't cluster with the other replicates. I'm concerned that this transcript is going to skew the normalization procedure used by DEseq and I wonder if it would be best to remove the counts for this gene before running DEseq? How are people dealing with libraries that have unusually high levels of ribosomal rRNA contamination?

    Cheers

  • #2
    Often in RNA-Seq analyses, normalization is done by simply dividing by the total number of mapped reads from a library. The case that you describe is precisely the reason why this is not a good idea, and why we in the DESeq paper, and independently Oshlack and Robinson in their paper on normalization, advise against it. DESeq's normalization looks at each gene, calculates a normalization factor from this gene's count, and then takes the median of the factors from all the genes. Then, a single gene should have little influence, even if it is very strongly expressed.

    Still, it might be interesting to double-check this. Remove the mitochondrial transcripts from your count table and re-run the analysis. I'd hope that your results for all the other gene won't change much.
    Last edited by Simon Anders; 04-07-2011, 01:05 AM. Reason: sp

    Comment


    • #3
      I tried removing the offending abundant transcript and running DEseq but the sample which contained 50% reads from the same gene was still an outlier with respect to the other replicates. I also tried removing the outlying sample altogether. I have just 3 replicates per condition.

      Anyway it looks like the normalization implemented by DEseq is pretty robust because I got similar lists of differentially expressed genes regardless of whether I ran the analysis using all the replicates, or after removing the outlying sample, or after removing the very abundant transcript.

      Comment


      • #4
        Originally posted by kirby View Post
        I I'm concerned that this transcript is going to skew the normalization procedure used by DEseq and I wonder if it would be best to remove the counts for this gene before running DEseq? How are people dealing with libraries that have unusually high levels of ribosomal rRNA contamination?

        Cheers
        Kirby,

        I have a very similar problem to yours.

        I am analyzing some Illumina libraries that appear to have a lot of ribosomal RNA contamination.

        I'm using Bowtie to align the reads only to a specific set of sequences, and because of the differing amount of rRNA contamination in each sample, each of them maps a different percentage of reads to the dataset (some half of what others map), ranging from 1% to 0.3%.

        I wonder if the amount of rRNA contamination in the preparation of a library can have an impact on the apparent expression level of a gene -- even though one normalizes its counts agains the total number of reads that mapped.

        What's your opinion in this subject?

        Carmen

        Comment


        • #5
          Originally posted by carmeyeii View Post
          I wonder if the amount of rRNA contamination in the preparation of a library can have an impact on the apparent expression level of a gene -- even though one normalizes its counts agains the total number of reads that mapped.
          This is a very nice example where a normalization by total number of reads would lead to wrong results while using one of the normalization methods I mention in post #2 will take care of the issue.

          Comment


          • #6


            Thanks, Simon.

            So the norm factors produced by default in DESeq are indeed calculated in the manner describer above by yourself, I assume?

            Carmen

            Comment


            • #7
              Originally posted by carmeyeii View Post
              So the norm factors produced by default in DESeq are indeed calculated in the manner describer above by yourself, I assume?
              Of course.

              Comment


              • #8
                Hello again,

                I've gone through with the normalization and differential expression analysis for my samples, but it seems I'm still having trouble with the very diverse amount of rRNA contamination, which I suspect may be obscuring DE effects due to very large differences in counts among replicates.

                The percentage of reads mapped to the small index of interest from each sample was very different, ranging from 0.2% to .99%, presumably because of the great difference in rRNA content in each library. Because of this, the size factor vectors were very diverse, ranging from 0.4 to 4 in one set of comparisons. Because of the great difference in rRNA contamination, I did not want to normalize by library size, as stated above by the authors of DESeq.

                I am also concerned that the normalization used (the default method in DESeq), because it estimates size factors based on the changes in counts of each feature, while assuming that most features are not differentially expressed, will be too conservative if it is the case that most of the features in the present dataset are indeed upregulated.

                Unfortunately, I did not find any significantly differentially expressed TEs. Perhaps the library being so contaminated is an obstacle to finding this, or perhaps I could use another normalization method to even out the rRNA contamination among samples?

                In short, there is a huge amount (and diversity) of rRNA contamination between samples and the possibility that most features being compared MIGHT be differentially expressed, complicating the analysis a bit.

                Below is one of the size factor vectors obtained and a representative histogram of what I'm getting.

                Any input on this matter would be greatly appreciated!

                Carmen
                > cds = estimateSizeFactors( cds )
                > sizeFactors(cds)
                1 2 3 4
                0.7007070 0.4144263 0.7905694 3.9685978

                Comment


                • #9
                  PoissonSeq (SAMseq) normalization

                  you could try the normalisation method provided in SAMseq (samr-package). It can be used as stand-alone function from the very similar PoissonSeq package (available from CRAN). the usage is simple:

                  PS.Est.Depth(n, iter=5, ct.sum=5, ct.mean=0.5)

                  and you could feed the result to DESeq...

                  perhaps you can post the result here and if this method improved your results.

                  Comment


                  • #10
                    I will try this, and post any changes to the results here. thanks dietmar!

                    Comment


                    • #11
                      Originally posted by carmeyeii View Post
                      I will try this, and post any changes to the results here. thanks dietmar!
                      Hi Carmen,
                      did you obtain better results with this second normalization?
                      I'm dealing with a similar problem...

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Today, 08:47 AM
                      0 responses
                      12 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      59 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      54 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X