SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-seq tag distribution RadAniba Bioinformatics 1 01-12-2012 01:54 PM
Chromosome distribution of RNA seq data lintianfeng Bioinformatics 1 09-30-2011 08:05 AM
RNA-Seq: ConReg-R: Extrapolative recalibration of the empirical distribution of p-val Newsbot! Literature Watch 0 05-21-2011 03:31 AM
RNA-Seq: Using non-uniform read distribution models to improve isoform expression inf Newsbot! Literature Watch 0 12-21-2010 03:00 AM
rna-seq read distribution wenhuang Bioinformatics 1 06-17-2010 10:07 AM

Reply
 
Thread Tools
Old 06-15-2010, 07:46 AM   #1
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default RNA-seq read distribution

Hi,

I wonder how reads mapped to the genome (contiguously or to junctions) are distributed.

My own experience has surprisingly high fraction mapped to introns (over 30% of reads mapped to known genes). There could be many explanations:

1) pre-mRNA
2) DNA contamination, which I expect to be relatively uniform across all genes, but not in my case. But I found over 15% of mapped reads were to the mitochondrial genome. Well, it does contain genes (especially rRNA and tRNA), so not all of the reads may be from DNA. But I am not sure what this number really means.
3) erroneous mapping
4) novel exons
5) splicing that retains introns
etc.

Of course, introns are much longer, so if you count reads per unit length, the fraction goes down.

There are also conflicting evidence in the literature:

The Mortazavi (2008) paper reported 4% intronic reads and 93% exonic, while Marioni (2008) had a similar number (32% of reads mapped to genes are intronic) with what I have seen.

I am wondering what people on this forum have seen in their experience.

Thanks!

Wen
wenhuang is offline   Reply With Quote
Old 06-16-2010, 04:52 PM   #2
townway
Member
 
Location: Rockville

Join Date: May 2009
Posts: 40
Default

I got the similar data as yours. What is the length of your reads, and what is your method to do the purification.poly-A or ribo-minus sth?
townway is offline   Reply With Quote
Old 06-16-2010, 06:17 PM   #3
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default

I have limited amount of RNA ~10ng, so I amplified it using Ambion's MessageAmp and sequenced the aRNA by a 75x2 GA run.

I read somewhere on this forum that intron retention is more than frequent, but I cannot find it anymore...
wenhuang is offline   Reply With Quote
Old 06-16-2010, 10:08 PM   #4
pzumbo
Member
 
Location: NY

Join Date: Mar 2009
Posts: 11
Default

These metrics are highly annotation dependent. Consider, for example, the variation in the number of hg18 annotated bases according to the following databases,

knownGene = 79,498,653
refGene = 66,601,430
ensGene = 70,647,021
acembly = 177,417,935

(as retrieved from UCSC Table Browser, May 31, 2010).
pzumbo is offline   Reply With Quote
Old 06-17-2010, 07:15 AM   #5
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default

I am not sure it is so dependent on annotation.

The 30% intronic reads I got was the fraction of reads mapped to known genes, not total mapped reads. If you have a less complete annotation, exonic reads are less too.
wenhuang is offline   Reply With Quote
Old 06-18-2010, 03:40 AM   #6
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

It is easier to compare if you keep the proportion with respect to the total number of mapped reads. The annotation does matter, but it is true that this impact should be limited if you consider the ratio exon vs. intron. It depends more on the protocol. For instance, Li et al. (PNAS, 2008) also reported about 40% of exonic and 20% of intronic, but i think it was about microRNAs. You can find a related thread here
steven is offline   Reply With Quote
Old 11-08-2010, 01:39 PM   #7
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default intronic reads

I'm getting roughly 17% intronic.

Clearly the % depends on the genome/annotation but I am wondering how people are handling this? This seems to be quite a challenge for Cufflinks (for example) to predict transcripts.

Does anyone have any strategies for filtering intronic reads (particularly ones that are likely to represent background/ precursor mRNA). Such reads seem to be vastly inflating the number of predicted transcripts I get.

Cufflinks does have an option (-j) that is aimed at dealing with this, but I haven't found it to help much. Does anyone have any experience with this? Suggested values for that parameter?

Thanks!

Chris
chrisbala is offline   Reply With Quote
Old 11-08-2010, 02:54 PM   #8
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default

I don't think intronic read fraction depends on annotation, unless you count "intergenic" reads as intronic.

I did a highly simplified calculation to see the effect of pre-mRNA fraction.

Assuming that exons are 1/20 of transcripts (roughly right for bovine), and reads are uniformly distributed across the transcripts, I got

Pre-mRNA fraction Intronic read fraction
1% 16%
2% 28%
5% 49%

I think pre-mRNA "contamination" is a more likely explanation.

I did see the same problem as yours that Cufflinks assembled many transcripts. Scripture appeared to outperform an earlier version of Cufflinks in this respect. It seemed to me Scripture also models the significance of seeing reads above background.
wenhuang is offline   Reply With Quote
Old 11-08-2010, 03:23 PM   #9
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default Scripture

I was thinking - for example - that there could be undescribed exons in the "introns" . I am not working on a standard model system... But yes, my assumption is that pre-mRNA is the problem.

Scripture looks interesting ...

BUt raises another question:

Scripture seems to be rely on paired end data? (haven't read closely yet)

How much improvement in assembly (dealing specifically with pre-mRNA) does one get with paired-end data. Cufflinks too is primarily described for paired end data, but the manual suggests that it "works well" with single-end. I haven't seen anything in the way of single end assembly benchmarks?
chrisbala is offline   Reply With Quote
Old 11-08-2010, 06:00 PM   #10
wenhuang
Member
 
Location: Raleigh, NC

Join Date: Feb 2010
Posts: 30
Default

I think both Cufflinks and Scripture can do single end data, at least their strategy (very similar) to stitch alignments together does not seem to need paired end data. Of course, paired end data will improve sizes of assemblies. I personally think junction reads are much more important than paired end reads in assembling RNA-Seq alignments, as most protocols select an insert size around 300bp, the gain you get by sequencing the other end is probably not that much. And junction reads are where alignment errors are more likely to occur, which mess up with assembly as well. I have seen apparently wrong gene models from Scripture/Cufflinks because of wrong junction alignments.
wenhuang is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:47 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO