SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-Seq: Genome Wide Full-Length Transcript Analysis Using 5' and 3' Paired-End-Tag N Newsbot! Literature Watch 1 01-20-2012 06:38 PM
length of transcript in cufflinks output papori Bioinformatics 0 08-02-2011 06:06 AM
transcript length bias in enrichment analysis and RPKM PFS RNA Sequencing 1 12-12-2010 06:32 PM
normalizing RNA-seq data to "unique transcript length" instead of "transcript length" lmc Bioinformatics 2 06-23-2010 11:45 AM
Transcript length bias in RNA-seq data confounds systems biology. NGSfan Literature Watch 1 05-12-2009 04:35 PM

Reply
 
Thread Tools
Old 10-06-2010, 06:25 AM   #1
Daniel
Junior Member
 
Location: Detroit,MI

Join Date: Jul 2010
Posts: 7
Default Nonuniformity of reads across transcript length

Hello,
I have been looking at the alignment of RNAseq reads (Illumina) from a library which preserved both PolyA + and PolyA - transcripts. As expected, a majority (~80%) of the reads appear to be from rRNA (18S, 28S) fragments. In mapping these reads to these rRNA sequences (18S is around 1800 bp, 28S is 5500bp), I obtain extremely uneven distribution of the reads. This uneven distribution takes the form of some relatively large regions where there are very few reads compared to other regions where there are many. Additionally, in terms of the exact mapping - even in regions where there are large numbers of reads, the reads are not evenly distributed (or at least a semi Poisson distribution), but rather many reads pile up at a specific bp site, which might have 10X the number of aligns as a neighboring bp.

The overall unevenness I can perhaps understand (degradation?), but the more local drastic peaks and valleys I find more difficult to explain. Some possibilities appear to be sequencing bias (GC bias), or differential PCR amplification. Any ideas from users with more experience than myself would be greatly appreciated.

Also, if anyone is aware of any human sequencing data (publicly available) where the PolyA Minus fraction has been maintained - which I can look at for comparison - this would be very helpful.

Thanks for any ideas.
Daniel is offline   Reply With Quote
Old 10-06-2010, 09:24 AM   #2
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Mapping "deserts" can be due to non-unique kmers in the genomic sequence, where no read can be unambiguously aligned.
As for unevenness of coverage, i would say the cause can be some amplification bias (too many PCR cycle) or some fragmentation/shearing bias (cleavage "hotspots", protected regions). Does that make sense?

Also, you may find this paper interesting, it mentions coverage biases (among others).
steven is offline   Reply With Quote
Old 10-06-2010, 09:34 AM   #3
mrawlins
Member
 
Location: Retirement - Not working with bioinformatics anymore.

Join Date: Apr 2010
Posts: 63
Default

It seems that some of the non-uniformity or uneven coverage comes from library preparation. During the cDNA library prep there are a number of factors that can contribute to non-uniformity. Shearing by enzymatic cleavage or sonication tends to cause breaks in some sequences more frequently than others. This is less of a problem with chemical cleavage (and indeed we observe less extreme non-uniformity when we use chemical cleavage). Any preparation that uses random k-mers to attach an adapter sequence to the library will have some non-uniformity introduced in that step, as not all k-mers have the same melting point (i.e. GCCGCC has a different melting point than ACCTAA, despite both being 6-mers). There are probably other factors that influence the non-uniformity.
Ambiguous alignments can cause some deserts, but in my experience do not account for all the non-uniformity.

We have observed, however, that the pattern of non-uniformity is very highly conserved for SOLiD data, even across different conditions. This leads us to believe that the non-uniformity does not influence our specific RNA-Seq experimental design. That may not be true of Illumina reads and all experimental designs, however, so take that with a grain of salt.
mrawlins is offline   Reply With Quote
Old 10-06-2010, 09:39 AM   #4
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Just in case, you may find some non polyA-selected RNA-seq data from the UCSC table browser, assembly=hg18, group=expression, track=CSHL Long RNA-seq.
The tables with a name that ends with "CellTotal" are from whole cellular extracts, not just from the cytosol, so i guess they could contain polyA- transcripts. May be worth a try.
steven is offline   Reply With Quote
Old 10-06-2010, 11:36 AM   #5
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

Daniel,

The sequence-specific bias correction method we've implemented in Cufflinks 0.9.x takes some of these issues into account when estimating abundances. There are some details on the method on the "How It Works" page.

-Adam
adarob is offline   Reply With Quote
Old 10-06-2010, 05:28 PM   #6
Daniel
Junior Member
 
Location: Detroit,MI

Join Date: Jul 2010
Posts: 7
Default

I thank everyone for their helpful ideas/suggestions/references. Insofar as the mapping "deserts" being due to repeats in genomic regions, this is something which I already examined - and does not appear to be an issue over here. The fragmentation bias certainly seems to be a possibility, I am just surprised by the magnitude of difference that 1 bp shift (i.e. the # of reads I get starting at site x, compared to those compared starting at site x+1) seems to make in the number of aligned reads. See the article http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2532726/ which discusses some of the issues for Solexa reads, and does not find that start sequence makes a significant difference in reads. I will try and examine all these possibilities in greater detail.

Daniel
Daniel is offline   Reply With Quote
Old 10-06-2010, 05:55 PM   #7
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

The paper you mention refers to DNA sequencing. In RNA sequencing there is an additional step where the single-stranded RNA is reverse transcribed and made into double-stranded cDNA. There is a substantial sequence specific bias introduced at this step, especially when random hexamer priming is used. See nar.oxfordjournals.org/cgi/content/abstract/38/12/e131 for more details. We have since found similar biases in numerous other protocols and will be publishing a paper on our correction method shortly.
adarob is offline   Reply With Quote
Old 10-07-2010, 06:06 AM   #8
Daniel
Junior Member
 
Location: Detroit,MI

Join Date: Jul 2010
Posts: 7
Default

Yes- I see clearly from this article the bias in RNAseq -as opposed to DNAseq - which you are referring to. It appears that the specific bias which they find in begin sites of Illumina reads corresponds very closely to at least some of the unevenness which we are seeing in our read aligns. I will certainly watch out for your correction method when it comes out.
Daniel is offline   Reply With Quote
Reply

Tags
nonuniformity, rrna reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:42 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO