Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with high background in some RNA-seq samples

    I have been looking at some Illumina RNA-seq samples (84bp single read, bovine samples, 28 samples, 5 timepoints - developmental timecourse).
    In 9 of the samples from this experiment we are seeing a high rate of reads aligning to intergenic regions. The high background means that the RPKM values for transcripts are dramatically different between the "good" samples and the "Bad" samples.
    The bad samples are not clustering to a single group - although all samples in one timepoint are affected - and none in another. (The remaining bad samples are scattered among the remaining 3 groups.)

    Has anyone else come across this problem before?
    Could it be caused by problems in tissue collection or sample prep?
    Alternatively could there be a biological explanation for this?
    Are there any suggested lab or bioinformatic approaches to dealing with this problem?

    We are evaluating applying normalization techniques to the samples to make the samples more comparable -does anyone have any thoughts on the validity of this approach?

  • #2
    Has anyone else come across this problem before?
    Could it be caused by problems in tissue collection or sample prep?
    Hm - not exactly the same problem... But some ideas.
    I guess you isolated RNA and performed a DNAse treatment. Afterwards the sequencing protocol without amplification?
    Myself I have some samples at hand where RNA was isolated, DNAse treated and then either amplified using random or oligo-dT primers. I just saw, that in random priming (no mRNA selection) there were (hell of) a lot of reads matching to multiple sites and intergenic regions. My conclusion was that this was mainly due to the incomplete DNA digestion (was also obvious from the coverage pattern). So if you use your RNA after DNAse treatment directly you may get quite some differences, depending on how well the DNA digest worked... (and I also believe that this can make quite a big difference).

    Alternatively could there be a biological explanation for this?
    In theory I could think about biological meanings (ncRNA...). Especially in developmental timecourses this might happen (BTW - I'm working on plants, so another person may be more precise here). So - I guess it would depend on how exact you were in collecting the same/different timepoints... However - I personally would be rather careful and first think of the DNA digestion above... Maybe you could check the length and location of the covered intergenic regions (?) to see if they are more likely to be DNA leftovers or RNAs...

    Comment


    • #3
      Sequences aligning to intergenic regions most likely correspond to genomic DNA (gDNA) contamination but could also represent 'stochastic transcription'. I would agree with 'schmima' that incomplete DNAse treatment is a possible cause. We have seen a wide range of intergenic and intronic signal levels across libraries. Sometimes the library creation protocol was different between libraries and this might explain the difference, in other cases the library creation was the same but the quality of input material was different. Our most common library preparation involves total RNA isolation, DNAse treatment, polyA+ selection, cDNA synthesis, sonication, etc. The most important steps for resulting in low gDNA contamination are probably the early steps. The total RNA isolation procedure itself can result in varying amounts of genomic DNA contamination (did you use a column or non-column based method?). Similarly the DNAse treatment can vary in efficiency (on column or in solution? Buffer conditions? etc.). If you select poly(A)+ RNA (and it works well) you should also reduce the presence of gDNA (except near polyA stretches of the genome). gDNA contamination could also be introduced during library construction and have nothing to do with your sample. Furthermore, we have found that regardless of how the library construction goes, you will tend to get more intergenic signal if the amount of input RNA is very low. As 'schima' suggests the distribution pattern of intergenic reads can hint at their source.

      Regarding dealing with the data you have. One option is to calculate your RPKM-like values by considering only the reads mapped within exonic portions of the genome. That is, the number of mapped reads used to normalize across libraries does not consider those reads mapping outside of exons... Or you can calculate RPKM as normal and apply normalization after (quantiles normalization for example)...

      Comment


      • #4
        Thanks for the suggestions - I didn't prepare the samples myself.
        Having spoken with the wetlab personnel I have found that the full RNA sample preparation protocol wasn't followed for some of the timepoints due to small amounts of starting material - so DNAse treatment did not occur for all samples.
        The strong suspicion is therefore that what we are seeing is indeed gDNA contamination of these samples.
        Calculation of RPKM values based using the total number of reads mapped to exonic portions seems to work in terms of making the samples comparable.

        Comment


        • #5
          I have the same problem in some libraries. As I have not found any published and well stablished method to deal with that, I'm exploring some strategies.

          For example, I can calculate the RPKM of the intergenic regions and subtract that value from each gene's RPKM. To calculate the RPKM of intergenic regions, I prepared two GFF files, one contaning the positions of all predicted genes (plus 500bp in both 5' and 3' directions) and another GFF containing the whole contigs (from 1 to contig size). Them I use bedtools to subtract the second GFF from the first, resulting in a BED file containing intergenic regions (intergenic.bed). I tried to remove rRNA regions from this BED file manually.

          Next I use coverageBed to count how many reads map to each intergenic region, and sum them all to get the total number of reads mapped to intergenic regions. I use this number, along with the sum of the sizes of intergenic regions to calculate the intergenic RPKM. Finally, I subtract all gene's RPKM from this value (and assign zero to the RPKM's that eventually became negative). I typically get values from 2 to 7, depending on the library.

          I would like to hear the comments from other seqanswers users on this strategy. Do you think it's reasonable?

          Comment


          • #6
            @glacerda

            I'm not sure if I fully understood what you mean (never used any bed (software^^) tools to format/change/combine GFF files or to calculate reads).
            As far as I got it:
            1. get intergenic (IG) regions
            2. sum up number of reads in IG regions
            3. calculate IG-RPKM with sum(IGreads) and length(allIGregions)
            4. substract this IG-RPKM from the genic RPKMs
            ad IG-RPKM: I assume you just have one RPKM that is in principle the 'expression value' of all IG regions.

            Given the above stuff is correct:

            I don't think that the approach is making a lot of sense - I would assume that you will cause more error than you get rid of. The method makes one assumption that is most probably never close to reality (also in your case - otherwise you should not get negative RPKMs):
            "the sequence coverage caused by genomic contamination is uniform over the whole genome"

            I think that - in case you would like to do something in this way - it would be better to change the assumption towards:
            "the sequence coverage caused by genomic contamination is uniform at certain (small) stretches of the genome and may be different between the stretches"

            means that you would take for example 1 kb up and downstream of a locus to calculate the IG-RPKM that you use to substract from the RPKM of exactly this locus: RPKM(locus)-RPKM(flankingregionoflocus)

            (note that also the second assumption may be quite away from reality - the chromatin is very diverse at regions that are transcribed - likely causing again non uniformity. In addition: the whole thing is highly depending on the protocols you used to get the samples)

            Hm - no - I don't have any better idea at the moment
            ... and I guess that there is no 'generally good' method to do this. It will depend on the type of data you have (-> lab procedures)

            Comment


            • #7
              Hi schmima,

              Thank you for commenting on this strategy. What you understood is exactly what I did. The whole Bedtools was used only to calculate IG-RPKM (what could be calculated in another way)

              However, I ddn't understood your point on why the coverage caused by genomic contamination is not uniform.
              I won't argue that the coverage is uniform, but it could be approximated by Poisson (Lander-Waterman theory for WGS genome sequencing projects). I know that sequencing technologies have several biases (for example, GC bias), but the Poisson assumption still is a good approximation for most cases. So, if the genomic contamination is random, the mean of this distribution would be a good guess to the level of genomic contamination.

              Because of this, I believed that genomic contamination followed the same distribution typical of WGS genome sequencing. But I'm only a dry lab guy, and of course I'm not sure.
              From the wet-lab point of view, do you think there is some major point that makes genomic contamination distribution different for WGS genome sequencing distribution? Why would coverage be dependent on the locus?

              Comment


              • #8
                Hi Glacerda

                From the wet-lab point of view, do you think there is some major point that makes genomic contamination distribution different for WGS genome sequencing distribution? Why would coverage be dependent on the locus?
                Hm - I'm sorry but I don't know the WGS protocols - so a rather generic and quick answer: DNA is packed into chromatin (you don't see a lot of naked DNA within a cell). Chromatin is a collection of various proteins that interact with the DNA - in this way they can also "protect" the DNA from experimental procedures (eg make certain regions more stable to degradation). In a cell, chromatin structure and organization influences several DNA-related processes (and is influenced by them). An example is transcription. The makeup of the chromatin at a transcribed locus is normally quite different from the one at an untranscribed locus (also depends on how "untranscribed" the locus is - e.g. temporary off versus silent for long/ever). This may cause a difference in DNA stability (...?) during the experiment and finally in your data.

                I won't argue that the coverage is uniform, but it could be approximated by Poisson (Lander-Waterman theory for WGS genome sequencing projects). I know that sequencing technologies have several biases (for example, GC bias), but the Poisson assumption still is a good approximation for most cases. So, if the genomic contamination is random, the mean of this distribution would be a good guess to the level of genomic contamination.
                I would agree on this (and add that beside the mean there is also variance - for poisson "expected value (mean) = variance"). Anyway: you're most probably not going to be punished if you would do your analysis in this way

                But personally I would ask myself: Does the assumption fit my data?
                (True for all cases - one should not just start calculating and testing before making sure the assumptions are given)
                In your case it would not be a too big piece of work (I guess so at least - but I don't know the theory in detail). I would maybe try to have a look at the intergenic RPKM distribution - calculate RPKM for every region beside two genes and have a look at the distribution, mean and var (quick and dirty - not necessarily correct in the sense of the theory). If it looks well, you could use the strategy you mentioned first. Otherwise? If the assumption is not given you should not use it (at least not without commenting it/telling why you did use it).

                Comment


                • #9
                  This may be off topic at this point, but if you did a ribosome depletion based RNAseq run rather than polyA selection based, you can see a lot of intronic reads that appear to arise from lariat RNA. Apparently some lariats are stable enough or abundant enough to stick around in the cell and be picked up by deep sequencing methods.

                  It would be unlikely that you'd see this in PolyA+ based sequencing, but it bears mentioning for other people who are tangling with this issue.

                  Comment


                  • #10
                    Depending on what you wish to do with this data, you could just abandon RPKM altogether. Just summarize your count data as reads on exons or reads on genes (i.e. ignore the intergenic mapped reads) and use raw feature count data for further analysis, or use an alternate normalization scheme. RPKM is just one metric you may use, but there are numerous (and arguably better) alternative metrics for normalizing RNA-Seq data.
                    Michael Black, Ph.D.
                    ScitoVation LLC. RTP, N.C.

                    Comment


                    • #11
                      Originally posted by wjeck View Post
                      This may be off topic at this point, but if you did a ribosome depletion based RNAseq run rather than polyA selection based, you can see a lot of intronic reads that appear to arise from lariat RNA. Apparently some lariats are stable enough or abundant enough to stick around in the cell and be picked up by deep sequencing methods.

                      It would be unlikely that you'd see this in PolyA+ based sequencing, but it bears mentioning for other people who are tangling with this issue.
                      Just to confirm, we saw intergenic reads when we used the NuGen Ovation protocol. There is no specific PolyA+ selection in this protocol (although a dT primer is used in conjunction with some random priming to make cDNA, (albeit random priming without sequences that would bind to rRNA)). We put this down to random priming of genomic DNA contamination (no DNase treatment).
                      The exact same total RNA samples were also put through the Illumina TruSeq protocol which has a specific polyA+ enrichment step. That protocol produced pretty much everything mapping back to exonic region (where we could map) as expected.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin


                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                        Yesterday, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      39 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      41 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      35 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X