![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
RNA-Seq: High-Throughput Illumina Strand-Specific RNA Sequencing Library Preparation. | Newsbot! | Literature Watch | 1 | 01-22-2013 12:48 AM |
MISO compare two samples with replicate; RNA-SEQ | demis001 | Bioinformatics | 3 | 11-19-2012 01:04 PM |
RNA-Seq from FFPE Samples | gavin.oliver | General | 2 | 04-30-2012 01:14 AM |
RNA-seq Galaxy workflow for PE barcoded samples? | jjw14 | Bioinformatics | 0 | 04-19-2011 01:58 PM |
RNA-Seq: A comparison of RNA-Seq and high-density exon array for detecting differenti | Newsbot! | Literature Watch | 0 | 09-25-2010 05:11 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Dublin, Ireland Join Date: Nov 2007
Posts: 28
|
![]()
I have been looking at some Illumina RNA-seq samples (84bp single read, bovine samples, 28 samples, 5 timepoints - developmental timecourse).
In 9 of the samples from this experiment we are seeing a high rate of reads aligning to intergenic regions. The high background means that the RPKM values for transcripts are dramatically different between the "good" samples and the "Bad" samples. The bad samples are not clustering to a single group - although all samples in one timepoint are affected - and none in another. (The remaining bad samples are scattered among the remaining 3 groups.) Has anyone else come across this problem before? Could it be caused by problems in tissue collection or sample prep? Alternatively could there be a biological explanation for this? Are there any suggested lab or bioinformatic approaches to dealing with this problem? We are evaluating applying normalization techniques to the samples to make the samples more comparable -does anyone have any thoughts on the validity of this approach? |
![]() |
![]() |
![]() |
#2 | ||
Member
Location: Zürich Join Date: Apr 2010
Posts: 56
|
![]() Quote:
I guess you isolated RNA and performed a DNAse treatment. Afterwards the sequencing protocol without amplification? Myself I have some samples at hand where RNA was isolated, DNAse treated and then either amplified using random or oligo-dT primers. I just saw, that in random priming (no mRNA selection) there were (hell of) a lot of reads matching to multiple sites and intergenic regions. My conclusion was that this was mainly due to the incomplete DNA digestion (was also obvious from the coverage pattern). So if you use your RNA after DNAse treatment directly you may get quite some differences, depending on how well the DNA digest worked... (and I also believe that this can make quite a big difference). Quote:
|
||
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: WashU Join Date: Aug 2010
Posts: 117
|
![]()
Sequences aligning to intergenic regions most likely correspond to genomic DNA (gDNA) contamination but could also represent 'stochastic transcription'. I would agree with 'schmima' that incomplete DNAse treatment is a possible cause. We have seen a wide range of intergenic and intronic signal levels across libraries. Sometimes the library creation protocol was different between libraries and this might explain the difference, in other cases the library creation was the same but the quality of input material was different. Our most common library preparation involves total RNA isolation, DNAse treatment, polyA+ selection, cDNA synthesis, sonication, etc. The most important steps for resulting in low gDNA contamination are probably the early steps. The total RNA isolation procedure itself can result in varying amounts of genomic DNA contamination (did you use a column or non-column based method?). Similarly the DNAse treatment can vary in efficiency (on column or in solution? Buffer conditions? etc.). If you select poly(A)+ RNA (and it works well) you should also reduce the presence of gDNA (except near polyA stretches of the genome). gDNA contamination could also be introduced during library construction and have nothing to do with your sample. Furthermore, we have found that regardless of how the library construction goes, you will tend to get more intergenic signal if the amount of input RNA is very low. As 'schima' suggests the distribution pattern of intergenic reads can hint at their source.
Regarding dealing with the data you have. One option is to calculate your RPKM-like values by considering only the reads mapped within exonic portions of the genome. That is, the number of mapped reads used to normalize across libraries does not consider those reads mapping outside of exons... Or you can calculate RPKM as normal and apply normalization after (quantiles normalization for example)... |
![]() |
![]() |
![]() |
#4 |
Member
Location: Dublin, Ireland Join Date: Nov 2007
Posts: 28
|
![]()
Thanks for the suggestions - I didn't prepare the samples myself.
Having spoken with the wetlab personnel I have found that the full RNA sample preparation protocol wasn't followed for some of the timepoints due to small amounts of starting material - so DNAse treatment did not occur for all samples. The strong suspicion is therefore that what we are seeing is indeed gDNA contamination of these samples. Calculation of RPKM values based using the total number of reads mapped to exonic portions seems to work in terms of making the samples comparable. |
![]() |
![]() |
![]() |
#5 |
Member
Location: Brazil Join Date: Aug 2008
Posts: 27
|
![]()
I have the same problem in some libraries. As I have not found any published and well stablished method to deal with that, I'm exploring some strategies.
For example, I can calculate the RPKM of the intergenic regions and subtract that value from each gene's RPKM. To calculate the RPKM of intergenic regions, I prepared two GFF files, one contaning the positions of all predicted genes (plus 500bp in both 5' and 3' directions) and another GFF containing the whole contigs (from 1 to contig size). Them I use bedtools to subtract the second GFF from the first, resulting in a BED file containing intergenic regions (intergenic.bed). I tried to remove rRNA regions from this BED file manually. Next I use coverageBed to count how many reads map to each intergenic region, and sum them all to get the total number of reads mapped to intergenic regions. I use this number, along with the sum of the sizes of intergenic regions to calculate the intergenic RPKM. Finally, I subtract all gene's RPKM from this value (and assign zero to the RPKM's that eventually became negative). I typically get values from 2 to 7, depending on the library. I would like to hear the comments from other seqanswers users on this strategy. Do you think it's reasonable? |
![]() |
![]() |
![]() |
#6 |
Member
Location: Zürich Join Date: Apr 2010
Posts: 56
|
![]()
@glacerda
I'm not sure if I fully understood what you mean (never used any bed (software^^) tools to format/change/combine GFF files or to calculate reads). As far as I got it: 1. get intergenic (IG) regions 2. sum up number of reads in IG regions 3. calculate IG-RPKM with sum(IGreads) and length(allIGregions) 4. substract this IG-RPKM from the genic RPKMs ad IG-RPKM: I assume you just have one RPKM that is in principle the 'expression value' of all IG regions. Given the above stuff is correct: I don't think that the approach is making a lot of sense - I would assume that you will cause more error than you get rid of. The method makes one assumption that is most probably never close to reality (also in your case - otherwise you should not get negative RPKMs): "the sequence coverage caused by genomic contamination is uniform over the whole genome" I think that - in case you would like to do something in this way - it would be better to change the assumption towards: "the sequence coverage caused by genomic contamination is uniform at certain (small) stretches of the genome and may be different between the stretches" means that you would take for example 1 kb up and downstream of a locus to calculate the IG-RPKM that you use to substract from the RPKM of exactly this locus: RPKM(locus)-RPKM(flankingregionoflocus) (note that also the second assumption may be quite away from reality - the chromatin is very diverse at regions that are transcribed - likely causing again non uniformity. In addition: the whole thing is highly depending on the protocols you used to get the samples) Hm - no - I don't have any better idea at the moment ![]() ... and I guess that there is no 'generally good' method to do this. It will depend on the type of data you have (-> lab procedures) |
![]() |
![]() |
![]() |
#7 |
Member
Location: Brazil Join Date: Aug 2008
Posts: 27
|
![]()
Hi schmima,
Thank you for commenting on this strategy. What you understood is exactly what I did. The whole Bedtools was used only to calculate IG-RPKM (what could be calculated in another way) However, I ddn't understood your point on why the coverage caused by genomic contamination is not uniform. I won't argue that the coverage is uniform, but it could be approximated by Poisson (Lander-Waterman theory for WGS genome sequencing projects). I know that sequencing technologies have several biases (for example, GC bias), but the Poisson assumption still is a good approximation for most cases. So, if the genomic contamination is random, the mean of this distribution would be a good guess to the level of genomic contamination. Because of this, I believed that genomic contamination followed the same distribution typical of WGS genome sequencing. But I'm only a dry lab guy, and of course I'm not sure. From the wet-lab point of view, do you think there is some major point that makes genomic contamination distribution different for WGS genome sequencing distribution? Why would coverage be dependent on the locus? |
![]() |
![]() |
![]() |
#8 | ||
Member
Location: Zürich Join Date: Apr 2010
Posts: 56
|
![]()
Hi Glacerda
Quote:
Quote:
![]() But personally I would ask myself: Does the assumption fit my data? (True for all cases - one should not just start calculating and testing before making sure the assumptions are given) In your case it would not be a too big piece of work (I guess so at least - but I don't know the theory in detail). I would maybe try to have a look at the intergenic RPKM distribution - calculate RPKM for every region beside two genes and have a look at the distribution, mean and var (quick and dirty - not necessarily correct in the sense of the theory). If it looks well, you could use the strategy you mentioned first. Otherwise? If the assumption is not given you should not use it (at least not without commenting it/telling why you did use it). |
||
![]() |
![]() |
![]() |
#9 |
Member
Location: Chapel Hill, NC Join Date: Mar 2009
Posts: 39
|
![]()
This may be off topic at this point, but if you did a ribosome depletion based RNAseq run rather than polyA selection based, you can see a lot of intronic reads that appear to arise from lariat RNA. Apparently some lariats are stable enough or abundant enough to stick around in the cell and be picked up by deep sequencing methods.
It would be unlikely that you'd see this in PolyA+ based sequencing, but it bears mentioning for other people who are tangling with this issue. |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Research Triangle Park, NC Join Date: Aug 2009
Posts: 245
|
![]()
Depending on what you wish to do with this data, you could just abandon RPKM altogether. Just summarize your count data as reads on exons or reads on genes (i.e. ignore the intergenic mapped reads) and use raw feature count data for further analysis, or use an alternate normalization scheme. RPKM is just one metric you may use, but there are numerous (and arguably better) alternative metrics for normalizing RNA-Seq data.
__________________
Michael Black, Ph.D. ScitoVation LLC. RTP, N.C. |
![]() |
![]() |
![]() |
#11 | |
Senior Member
Location: London Join Date: Jun 2009
Posts: 298
|
![]() Quote:
The exact same total RNA samples were also put through the Illumina TruSeq protocol which has a specific polyA+ enrichment step. That protocol produced pretty much everything mapping back to exonic region (where we could map) as expected. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|