Seqanswers Leaderboard Ad

**schmima** · 08-23-2010, 10:12 PM

Has anyone else come across this problem before?
Could it be caused by problems in tissue collection or sample prep?

Hm - not exactly the same problem... But some ideas.
I guess you isolated RNA and performed a DNAse treatment. Afterwards the sequencing protocol without amplification?
Myself I have some samples at hand where RNA was isolated, DNAse treated and then either amplified using random or oligo-dT primers. I just saw, that in random priming (no mRNA selection) there were (hell of) a lot of reads matching to multiple sites and intergenic regions. My conclusion was that this was mainly due to the incomplete DNA digestion (was also obvious from the coverage pattern). So if you use your RNA after DNAse treatment directly you may get quite some differences, depending on how well the DNA digest worked... (and I also believe that this can make quite a big difference).

Alternatively could there be a biological explanation for this?

In theory I could think about biological meanings (ncRNA...). Especially in developmental timecourses this might happen (BTW - I'm working on plants, so another person may be more precise here). So - I guess it would depend on how exact you were in collecting the same/different timepoints... However - I personally would be rather careful and first think of the DNA digestion above... Maybe you could check the length and location of the covered intergenic regions (?) to see if they are more likely to be DNA leftovers or RNAs...

**malachig** · 08-24-2010, 04:12 PM

Sequences aligning to intergenic regions most likely correspond to genomic DNA (gDNA) contamination but could also represent 'stochastic transcription'. I would agree with 'schmima' that incomplete DNAse treatment is a possible cause. We have seen a wide range of intergenic and intronic signal levels across libraries. Sometimes the library creation protocol was different between libraries and this might explain the difference, in other cases the library creation was the same but the quality of input material was different. Our most common library preparation involves total RNA isolation, DNAse treatment, polyA+ selection, cDNA synthesis, sonication, etc. The most important steps for resulting in low gDNA contamination are probably the early steps. The total RNA isolation procedure itself can result in varying amounts of genomic DNA contamination (did you use a column or non-column based method?). Similarly the DNAse treatment can vary in efficiency (on column or in solution? Buffer conditions? etc.). If you select poly(A)+ RNA (and it works well) you should also reduce the presence of gDNA (except near polyA stretches of the genome). gDNA contamination could also be introduced during library construction and have nothing to do with your sample. Furthermore, we have found that regardless of how the library construction goes, you will tend to get more intergenic signal if the amount of input RNA is very low. As 'schima' suggests the distribution pattern of intergenic reads can hint at their source.

Regarding dealing with the data you have. One option is to calculate your RPKM-like values by considering only the reads mapped within exonic portions of the genome. That is, the number of mapped reads used to normalize across libraries does not consider those reads mapping outside of exons... Or you can calculate RPKM as normal and apply normalization after (quantiles normalization for example)...

**pmcget** · 09-21-2010, 02:42 AM

Thanks for the suggestions - I didn't prepare the samples myself.
Having spoken with the wetlab personnel I have found that the full RNA sample preparation protocol wasn't followed for some of the timepoints due to small amounts of starting material - so DNAse treatment did not occur for all samples.
The strong suspicion is therefore that what we are seeing is indeed gDNA contamination of these samples.
Calculation of RPKM values based using the total number of reads mapped to exonic portions seems to work in terms of making the samples comparable.

**glacerda** · 10-20-2010, 10:40 AM

I have the same problem in some libraries. As I have not found any published and well stablished method to deal with that, I'm exploring some strategies.

For example, I can calculate the RPKM of the intergenic regions and subtract that value from each gene's RPKM. To calculate the RPKM of intergenic regions, I prepared two GFF files, one contaning the positions of all predicted genes (plus 500bp in both 5' and 3' directions) and another GFF containing the whole contigs (from 1 to contig size). Them I use bedtools to subtract the second GFF from the first, resulting in a BED file containing intergenic regions (intergenic.bed). I tried to remove rRNA regions from this BED file manually.

Next I use coverageBed to count how many reads map to each intergenic region, and sum them all to get the total number of reads mapped to intergenic regions. I use this number, along with the sum of the sizes of intergenic regions to calculate the intergenic RPKM. Finally, I subtract all gene's RPKM from this value (and assign zero to the RPKM's that eventually became negative). I typically get values from 2 to 7, depending on the library.

I would like to hear the comments from other seqanswers users on this strategy. Do you think it's reasonable?

**schmima** · 10-20-2010, 11:09 PM

@glacerda

I'm not sure if I fully understood what you mean (never used any bed (software^^) tools to format/change/combine GFF files or to calculate reads).
As far as I got it:
1. get intergenic (IG) regions
2. sum up number of reads in IG regions
3. calculate IG-RPKM with sum(IGreads) and length(allIGregions)
4. substract this IG-RPKM from the genic RPKMs
ad IG-RPKM: I assume you just have one RPKM that is in principle the 'expression value' of all IG regions.

Given the above stuff is correct:

I don't think that the approach is making a lot of sense - I would assume that you will cause more error than you get rid of. The method makes one assumption that is most probably never close to reality (also in your case - otherwise you should not get negative RPKMs):
"the sequence coverage caused by genomic contamination is uniform over the whole genome"

I think that - in case you would like to do something in this way - it would be better to change the assumption towards:
"the sequence coverage caused by genomic contamination is uniform at certain (small) stretches of the genome and may be different between the stretches"

means that you would take for example 1 kb up and downstream of a locus to calculate the IG-RPKM that you use to substract from the RPKM of exactly this locus: RPKM(locus)-RPKM(flankingregionoflocus)

(note that also the second assumption may be quite away from reality - the chromatin is very diverse at regions that are transcribed - likely causing again non uniformity. In addition: the whole thing is highly depending on the protocols you used to get the samples)

Hm - no - I don't have any better idea at the moment

... and I guess that there is no 'generally good' method to do this. It will depend on the type of data you have (-> lab procedures)

**glacerda** · 10-21-2010, 03:11 PM

Hi schmima,

Thank you for commenting on this strategy. What you understood is exactly what I did. The whole Bedtools was used only to calculate IG-RPKM (what could be calculated in another way)

However, I ddn't understood your point on why the coverage caused by genomic contamination is not uniform.
I won't argue that the coverage is uniform, but it could be approximated by Poisson (Lander-Waterman theory for WGS genome sequencing projects). I know that sequencing technologies have several biases (for example, GC bias), but the Poisson assumption still is a good approximation for most cases. So, if the genomic contamination is random, the mean of this distribution would be a good guess to the level of genomic contamination.

Because of this, I believed that genomic contamination followed the same distribution typical of WGS genome sequencing. But I'm only a dry lab guy, and of course I'm not sure.
From the wet-lab point of view, do you think there is some major point that makes genomic contamination distribution different for WGS genome sequencing distribution? Why would coverage be dependent on the locus?

**schmima** · 10-24-2010, 09:57 PM

Hi Glacerda

From the wet-lab point of view, do you think there is some major point that makes genomic contamination distribution different for WGS genome sequencing distribution? Why would coverage be dependent on the locus?

Hm - I'm sorry but I don't know the WGS protocols - so a rather generic and quick answer: DNA is packed into chromatin (you don't see a lot of naked DNA within a cell). Chromatin is a collection of various proteins that interact with the DNA - in this way they can also "protect" the DNA from experimental procedures (eg make certain regions more stable to degradation). In a cell, chromatin structure and organization influences several DNA-related processes (and is influenced by them). An example is transcription. The makeup of the chromatin at a transcribed locus is normally quite different from the one at an untranscribed locus (also depends on how "untranscribed" the locus is - e.g. temporary off versus silent for long/ever). This may cause a difference in DNA stability (...?) during the experiment and finally in your data.

I won't argue that the coverage is uniform, but it could be approximated by Poisson (Lander-Waterman theory for WGS genome sequencing projects). I know that sequencing technologies have several biases (for example, GC bias), but the Poisson assumption still is a good approximation for most cases. So, if the genomic contamination is random, the mean of this distribution would be a good guess to the level of genomic contamination.

I would agree on this (and add that beside the mean there is also variance - for poisson "expected value (mean) = variance"). Anyway: you're most probably not going to be punished if you would do your analysis in this way

But personally I would ask myself: Does the assumption fit my data?
(True for all cases - one should not just start calculating and testing before making sure the assumptions are given)
In your case it would not be a too big piece of work (I guess so at least - but I don't know the theory in detail). I would maybe try to have a look at the intergenic RPKM distribution - calculate RPKM for every region beside two genes and have a look at the distribution, mean and var (quick and dirty - not necessarily correct in the sense of the theory). If it looks well, you could use the strategy you mentioned first. Otherwise? If the assumption is not given you should not use it (at least not without commenting it/telling why you did use it).

**wjeck** · 11-21-2012, 10:40 AM

This may be off topic at this point, but if you did a ribosome depletion based RNAseq run rather than polyA selection based, you can see a lot of intronic reads that appear to arise from lariat RNA. Apparently some lariats are stable enough or abundant enough to stick around in the cell and be picked up by deep sequencing methods.

It would be unlikely that you'd see this in PolyA+ based sequencing, but it bears mentioning for other people who are tangling with this issue.

**mbblack** · 11-27-2012, 05:35 AM

Depending on what you wish to do with this data, you could just abandon RPKM altogether. Just summarize your count data as reads on exons or reads on genes (i.e. ignore the intergenic mapped reads) and use raw feature count data for further analysis, or use an alternate normalization scheme. RPKM is just one metric you may use, but there are numerous (and arguably better) alternative metrics for normalizing RNA-Seq data.

**TonyBrooks** · 11-27-2012, 09:09 AM

Originally posted by wjeck View Post

This may be off topic at this point, but if you did a ribosome depletion based RNAseq run rather than polyA selection based, you can see a lot of intronic reads that appear to arise from lariat RNA. Apparently some lariats are stable enough or abundant enough to stick around in the cell and be picked up by deep sequencing methods.

It would be unlikely that you'd see this in PolyA+ based sequencing, but it bears mentioning for other people who are tangling with this issue.

Just to confirm, we saw intergenic reads when we used the NuGen Ovation protocol. There is no specific PolyA+ selection in this protocol (although a dT primer is used in conjunction with some random priming to make cDNA, (albeit random priming without sequences that would bind to rRNA)). We put this down to random priming of genomic DNA contamination (no DNase treatment).
The exact same total RNA samples were also put through the Illumina TruSeq protocol which has a specific polyA+ enrichment step. That protocol produced pretty much everything mapping back to exonic region (where we could map) as expected.

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

Problem with high background in some RNA-seq samples

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News