SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
who coined RNAseq? RNAseq as an alignment first approach only brachysclereid Bioinformatics 3 01-10-2012 12:17 PM
expected mutation rates? adaptivegenome Genomic Resequencing 3 12-21-2011 12:59 AM
RNAseq - low cluster density - possible inhibitor? OnUrMark Sample Prep / Library Generation 11 05-07-2011 07:05 PM
HiSeq 2000 RNA-Seq Mapping Rates Lee Sam Illumina/Solexa 9 12-09-2010 05:39 AM
Error rates chriscampbell19 Illumina/Solexa 0 08-02-2010 01:05 PM

Reply
 
Thread Tools
Old 04-20-2009, 06:53 AM   #1
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default why low mapping rates for RNAseq?

Hi everyone!

I must say, I'm very happy to find a community where we can discuss this new technology.

I have searched the forum, but could not turn up a thread that discusses the issue of unmappable RNAseq reads.

According to the article "The digital generation" by Nathan Blow, Dr. Liu is quoted as saying that it is not unusual that only "40-50%" of the data generated are mappable. There is some mention that perhaps this unmappable sequence is from antisense transcripts or artefacts of the RNA processing.

Interesting J.Shendure mentions being about to achieve 95% mapping with genomic DNA.

Losing 60-50% of the RNA-seq data seems quite high. Has anyone looked into this more carefully? Are the majority of these unmappable reads just full of sequencing errors? could there be contamination? and what is meant by artefacts in making a sequencing library from RNA? what would these artefacts look like to make them unmappable?

Thanks for any thoughts.
NGSfan is offline   Reply With Quote
Old 04-20-2009, 10:28 PM   #2
francesco.vezzi
Member
 
Location: Udine (Italy)

Join Date: Jan 2009
Posts: 50
Default

Hi,
I think you must be more precise. What is the length of the reads you wont to place, and how many errors you allow for each read. The reads that are placed in multiple copies are considered placed or not?

Actually I was wondering about not placed reads during last weekend (for the happiness of my girlfriend ). In particular I think that if we exclude reads that have low quality the unplaced reads hide some non trivial information...
francesco.vezzi is offline   Reply With Quote
Old 04-21-2009, 02:59 AM   #3
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Hi francesco!

thanks for joining in on the discussion.

I have for example, examined some published data with 25bp long reads, and aligned them to a mouse genome allowing for 2 mismatches at most. About 60% of the reads are alignable under those parameters. I looked at the other unmapped reads for possible contamination, but only 2% mapped to human or ecoli for example.

Perhaps I'll take another look at the unmapped reads and allow for 3 or 4 mismatches and see how many more reads I can recover for mapping. But I'm sure there will be many still unmapped - and I wonder if this is just because there are more errors than advertised by illumina, or what else could this be?

by non-trivial, do you mean some functional sequences? any guesses what these other sequences could be? I'm very curious.

In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?
NGSfan is offline   Reply With Quote
Old 04-21-2009, 04:20 AM   #4
francesco.vezzi
Member
 
Location: Udine (Italy)

Join Date: Jan 2009
Posts: 50
Default

Quote:
Originally Posted by NGSfan View Post
Hi francesco!
In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?
Well this depend a lot against what reference are you aligning. We have an Illumina Genome Analyzer and we have sequenced two years ago the genome of the grapevine. When we sequence the reference plant that we have use to obtain the assembly we align more then 74% of the reads with parameters that are quite similar to yours. If we sequenced another variety of grapevine we are able to align only the 60% of all the reads. This is not strange because the two organism are different.

You are using 25 bases reads, it means that they are really old (now illumina can produce 75 paired ends reads) and probably analysed with the old pipeline. You can find one interesting data set having a look at http://tinyurl.com/68aeq3 and to the article "de novo assembly of the pseudomonas syringae pv syringae b728a genome using illumina/solexa short sequence reads".

About the non trivial information of not aligned reads there are a lot of possibilities like repeated regions with a lot of errors (if compared to the reference sequence) or totally new inserts.

Quote:
Originally Posted by NGSfan View Post
In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?
No, I disagree with you. If we consider an Illumina experiment with 7 lines the 50% of the reads means more then 2Gigabases of data and all this amount of data is obtained at a fraction of the cost of the methods available only 1 year ago.
The problem is the opposite, there is too much data to analyse and we are more or less using instruments that where develop for totally different kind of data.

Obviously this is only an opinion of a four months PhD student.....
francesco.vezzi is offline   Reply With Quote
Old 04-22-2009, 10:35 AM   #5
Melissa
Senior Member
 
Location: Switzerland

Join Date: Aug 2008
Posts: 124
Smile

Unmappable reads sounds new to me. I'm particularly interested in difficulties in de novo transcriptome assembly.

Can you define these unmappable read more precisely? Are you referring to the raw data or high quality reads after filtering? If you are referring to filtered reads, then sequencing errors cannot contribute to this problem. I believe these reads are resulted from technical/experimental problem rather than the nature of the sequences. For example, low quality reads due to platform's temperature problem and artifacts created at the edges of the flow cell.

Most RNA-seq data usually contain 30-40% rRNA. After filtering low quality, contaminating reads and polyA tails, 50-60% of reads sounds REALLY good to begin with. So, the answer is NO to whether it's a waste to get only 50-60% reads. High redundancy is another reason why some reads are not useful at all.

I'm not sure the reason why some reads cannot map to the reference genome. Well, they just don't . Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. RNA processing and other regulatory mechanisms sounds like a good explanation.

Cheers,
Melissa
Melissa is offline   Reply With Quote
Old 04-22-2009, 05:12 PM   #6
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by NGSfan View Post
Hi everyone!

I must say, I'm very happy to find a community where we can discuss this new technology.

I have searched the forum, but could not turn up a thread that discusses the issue of unmappable RNAseq reads.

According to the article "The digital generation" by Nathan Blow, Dr. Liu is quoted as saying that it is not unusual that only "40-50%" of the data generated are mappable. There is some mention that perhaps this unmappable sequence is from antisense transcripts or artefacts of the RNA processing.

Interesting J.Shendure mentions being about to achieve 95% mapping with genomic DNA.

Losing 60-50% of the RNA-seq data seems quite high. Has anyone looked into this more carefully? Are the majority of these unmappable reads just full of sequencing errors? could there be contamination? and what is meant by artefacts in making a sequencing library from RNA? what would these artefacts look like to make them unmappable?

Thanks for any thoughts.
What is your reference genome? Whole genome? Or just transcriptome? And if the latter, what's your definition of transcriptome? In the distant past, I used Refseq as a reference genome, but I now think that's too limiting.

50-60% being "unmappable" sounds strangely high to me.

In my experience, the major issue with RNA-seq is ribosomal RNA contamination. You must do something like a poly-A pull down for these experiments, or the majority of your data will be ribosomal RNA.

For example, in an experiment I ran over a year ago (before RNA-seq was as well established as it is now), I used oligo(dT) cDNA synthesis thinking that would enrich for mRNA sequences enough to keep the rRNA sequences at a low level. Turns out that wasn't the case, and about 70% of my data from that lane of Solexa data aligned to ribosomal RNA sequence.

In that experiment, only an additional 9.5% of the data aligned to Refseq. That means that only 20.5% of my data was "unmappable" for whatever reason. This was also using an older version of our aligner (BFAST), so it's possible more of them would be aligned if I were to re-run the data.

Sufficed to say, in that experiment, my major issue was the rRNA contamination. As Michelle pointed out, a poly-A pulldown can do a good job alleviating this problem and from what I've read can enrich your sample for mRNA such that you only end up with 30-40% rRNA contamination after a single round of poly-A purification.

That said, if your reference genome doesn't include ribosomal RNA, it's possible the "unmapped" reads are mostly ribosomal RNA. Do you know if your reference contains rRNA?
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 04-23-2009, 01:30 AM   #7
jmaury
Junior Member
 
Location: Paris

Join Date: Mar 2008
Posts: 4
Default

Hi,

As mentionned by Melissa, exon/exon junctions will significantly reduce the number of mappable reads ! And so the number of mapped reads will be determined by the characteristics of your genome. Particularly, the number of exons per gene have to be take into account !!
Recently I've worked on RNA-Seq from grapevine genome and we've mapped around 80% of the initial reads, but the average number of exons per gene is less than 5, relatively low compared to mammalian species.
You'll find more informations here : http://seqanswers.com/forums/showthread.php?t=1015

Hope this helps,

Cheers,
Jean-Marc
jmaury is offline   Reply With Quote
Old 04-23-2009, 01:34 AM   #8
francesco.vezzi
Member
 
Location: Udine (Italy)

Join Date: Jan 2009
Posts: 50
Default

Quote:
Originally Posted by jmaury View Post
Hi,
Recently I've worked on RNA-Seq from grapevine genome and we've mapped around 80% of the initial reads, but the average number of exons per gene is less than 5, relatively low compared to mammalian species.
We have develop similar experiments on grapevine (I work at IGA in Udine) obtaining the same results. Actually the number of reads that span over the exons introns junctions is really low.
francesco.vezzi is offline   Reply With Quote
Old 04-23-2009, 05:27 AM   #9
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by francesco.vezzi View Post
Well this depend a lot against what reference are you aligning. We have an Illumina Genome Analyzer and we have sequenced two years ago the genome of the grapevine. When we sequence the reference plant that we have use to obtain the assembly we align more then 74% of the reads with parameters that are quite similar to yours. If we sequenced another variety of grapevine we are able to align only the 60% of all the reads. This is not strange because the two organism are different.
Yes, it definitely seems that one can get higher rates of mapped reads when one aligns reads generated from genomic DNA.

But for RNA-seq, when looking at the literature (RNA-seq mouse studies Mortazavi, Pan, Sultan, etc) the trend tends to be 50-60% of the all the generated reads can be mapped when using the entire genome as the reference sequence.

Quote:
Originally Posted by francesco.vezzi View Post
You are using 25 bases reads, it means that they are really old (now illumina can produce 75 paired ends reads) and probably analysed with the old pipeline. You can find one interesting data set having a look at http://tinyurl.com/68aeq3 and to the article "de novo assembly of the pseudomonas syringae pv syringae b728a genome using illumina/solexa short sequence reads".
For sure longer reads will help, especially for genome assembly and re-sequencing. But the problem for RNA-seq appears to be different.

Quote:
Originally Posted by francesco.vezzi View Post
About the non trivial information of not aligned reads there are a lot of possibilities like repeated regions with a lot of errors (if compared to the reference sequence) or totally new inserts.
Actually the 50-60% mapped reads I mentioned includes reads that are mapped to multiple locations on the genome (ie. repeat regions, paralogs, etc). If you count just the reads that *uniquely* map to just one location in the reference genome, then the percentage drops to 44%.


Quote:
Originally Posted by francesco.vezzi View Post
No, I disagree with you. If we consider an Illumina experiment with 7 lines the 50% of the reads means more then 2Gigabases of data and all this amount of data is obtained at a fraction of the cost of the methods available only 1 year ago.
The problem is the opposite, there is too much data to analyse and we are more or less using instruments that where develop for totally different kind of data.

Very true, compared to Sanger sequencing this is much more cost effective. But if people want to use RNA-seq for de-novo profiling or - taken a step further - quantitative gene expression measurement, then they should be aware that according to current results from published work, 40-50% of your data is not usable, that's 40-50% of a ~$6300 sequencing run. This is fine for labs with lots of money to burn, but smaller labs will have to consider this more carefully before they treat this as something routine.

It would be nice to know and figure out how to recover these reads, and take more advantage of the data.

Last edited by NGSfan; 04-23-2009 at 05:33 AM.
NGSfan is offline   Reply With Quote
Old 04-23-2009, 05:54 AM   #10
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by Melissa View Post
Can you define these unmappable read more precisely? Are you referring to the raw data or high quality reads after filtering? If you are referring to filtered reads, then sequencing errors cannot contribute to this problem. I believe these reads are resulted from technical/experimental problem rather than the nature of the sequences. For example, low quality reads due to platform's temperature problem and artifacts created at the edges of the flow cell.
I am referring to the primary sequence data - all the raw reads generated from a RNA-seq experiment.

That is interesting what you mention about a low temp problem and flow cell edges - I did not know about these issues. It would be interesting to know what fraction of reads are unmappable because of those issues.

Quote:
Originally Posted by Melissa View Post
Most RNA-seq data usually contain 30-40% rRNA. After filtering low quality, contaminating reads and polyA tails, 50-60% of reads sounds REALLY good to begin with. So, the answer is NO to whether it's a waste to get only 50-60% reads. High redundancy is another reason why some reads are not useful at all.
but if you are aligning to the entire genome, the rRNA reads should align as well no?

Quote:
Originally Posted by Melissa View Post
I'm not sure the reason why some reads cannot map to the reference genome. Well, they just don't . Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. RNA processing and other regulatory mechanisms sounds like a good explanation.
According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.
NGSfan is offline   Reply With Quote
Old 04-23-2009, 06:00 AM   #11
francesco.vezzi
Member
 
Location: Udine (Italy)

Join Date: Jan 2009
Posts: 50
Default

Quote:
Originally Posted by NGSfan View Post
It would be nice to know and figure out how to recover these reads, and take more advantage of the data.
It looks like a philosophic problem... what to do with something that at a first sight has no importance?

Where you take the data that originated this discussion? I wont to align them with our tool and see what happens...
francesco.vezzi is offline   Reply With Quote
Old 04-23-2009, 06:01 AM   #12
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by Michael.James.Clark View Post
50-60% being "unmappable" sounds strangely high to me.

In my experience, the major issue with RNA-seq is ribosomal RNA contamination. You must do something like a poly-A pull down for these experiments, or the majority of your data will be ribosomal RNA.

For example, in an experiment I ran over a year ago (before RNA-seq was as well established as it is now), I used oligo(dT) cDNA synthesis thinking that would enrich for mRNA sequences enough to keep the rRNA sequences at a low level. Turns out that wasn't the case, and about 70% of my data from that lane of Solexa data aligned to ribosomal RNA sequence.
Quote:
Originally Posted by Michael.James.Clark View Post
Do you know if your reference contains rRNA?
Yes I'm using the entire genome, so I would assume rRNA should be mapping as well, no?

Yes, 50-60% mapped is really odd (talking about just simply mapping reads to anything, 2 mismatches max, not talking about those uniquely mapping, which is only 44%), especially when considering that the whole genome is used as a reference, so things like poly-A, repeat regions, rRNA, should still be mapping.
NGSfan is offline   Reply With Quote
Old 04-23-2009, 09:07 AM   #13
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by NGSfan View Post
Yes I'm using the entire genome, so I would assume rRNA should be mapping as well, no?
Yes, it should. Of course, you can check.

Quote:
Yes, 50-60% mapped is really odd (talking about just simply mapping reads to anything, 2 mismatches max, not talking about those uniquely mapping, which is only 44%), especially when considering that the whole genome is used as a reference, so things like poly-A, repeat regions, rRNA, should still be mapping.
Have you considered using something other than whole genome as a reference? Since you're using mouse, it should be easy to do.

Generate an adequate transcriptome reference and align to that. In our lab, we've used both Refseq and UCSC known genes as a reference successfully, although I feel we will do a better job the more permissive we become.

I've definitely seen in my own data even coverage across exon-exon junctions, and used exon coverage to identify splice variants.

For discovering novel transcripts, whole genome is fine. But for looking at expression, splice variants, et cetera, I think we're better off using a transcriptome reference.

Quote:
Originally Posted by NGSfan View Post
According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.
Can you reference the source for some of these estimates so we can evaluate that claim? I'd just like to take a look at the papers myself. Thanks.

Also, what alignment algorithm are you using? You say you are only robust against two mismatches. That's pretty low. Consider that means if you have a SNP and a sequencing error, you've filled your quota. This also means since you're using whole genome with 25-base reads that as soon as you're within 23 bases of an exon junction, you won't align any reads. Considering the size of a lot of exons, you'll just completely miss quite a few of them.

Again, I think aligning to a better reference will help you out on that front.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]

Last edited by Michael.James.Clark; 04-23-2009 at 09:21 AM.
Michael.James.Clark is offline   Reply With Quote
Old 04-23-2009, 10:21 PM   #14
francesco.vezzi
Member
 
Location: Udine (Italy)

Join Date: Jan 2009
Posts: 50
Default

Quote:
Originally Posted by Michael.James.Clark View Post
Generate an adequate transcriptome reference and align to that.
We have performed an experiment like that, but the problem is that often the exon/intron junctions are not perfectly known. If you try to place short reads taken from the trascriptome and align them against the reference sequence allowing gaps (this allow you to place for example half of a read in one exon and the other half in the following exon) you will probably discover that some some exon junctions are not totally correct and maybe you can determine new exons.

S the experiment you are proposing is perfect in theory but in practice I'm not sure it will work.

Francesco
francesco.vezzi is offline   Reply With Quote
Old 04-23-2009, 11:06 PM   #15
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by francesco.vezzi View Post
We have performed an experiment like that, but the problem is that often the exon/intron junctions are not perfectly known. If you try to place short reads taken from the trascriptome and align them against the reference sequence allowing gaps (this allow you to place for example half of a read in one exon and the other half in the following exon) you will probably discover that some some exon junctions are not totally correct and maybe you can determine new exons.

S the experiment you are proposing is perfect in theory but in practice I'm not sure it will work.

Francesco
In some cases that's true. For mouse? Not as big a problem.

My advice is to generate a reference genome with all possible splice variants and align to it.

Otherwise you will see a massive drop in coverage at all exon-exon junctions as I said earlier, which will result in a large number of "unmappable reads" when you're only robust against two mismatches.

We've done it this way in our lab with human and it has worked.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 04-24-2009, 12:43 AM   #16
jmaury
Junior Member
 
Location: Paris

Join Date: Mar 2008
Posts: 4
Default

Hello,

Quote:
Originally Posted by NGSfan View Post
According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.
I would like to say : "Don't underestimate splice junctions !!"

As Michael.James.Clark, I'm very suprised by the number of 3% (could you give us a link to this article)!!!

For example, if you sequence a transcript of 1500nt, and obtain 555 reads of 25nt (so an average coverage of 9,25X). If your mRNA contains 10 exons, you will have around 25% of the reads that fall on splice junctions (So you'll only map 75% of the reads, without consider sequencing errors and artefacts).
To obtain 3% with 25nt reads, your initial transcript should contains only two exons, so the number of unmapped reads is highly correlated with the number of exons per transcript and size of reads !!

Cheers,

Jean-marc
jmaury is offline   Reply With Quote
Old 04-24-2009, 01:51 AM   #17
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Hi guys, the paper is:

"Mapping and quantifying mammalian transcriptomes by RNA-Seq"

http://www.nature.com/nmeth/journal/...meth.1226.html

"Splice-crossing reads, such as are shown for Myf6 (Fig. 1b), were identified by mapping otherwise unassigned sequence reads to a library of all known splice events in all University of California Santa Cruz genome database (UCSC) Mouse July 2007 (mm9) gene model splices. When we summed over the entire dataset, including all otherwise unmappable reads, splice-spanning reads comprised approx3% (Supplementary Table 1), which is consistent with splice frequency in gene models across the genome."
NGSfan is offline   Reply With Quote
Old 04-24-2009, 02:02 AM   #18
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by Michael.James.Clark View Post
In some cases that's true. For mouse? Not as big a problem.

My advice is to generate a reference genome with all possible splice variants and align to it.

Otherwise you will see a massive drop in coverage at all exon-exon junctions as I said earlier, which will result in a large number of "unmappable reads" when you're only robust against two mismatches.

We've done it this way in our lab with human and it has worked.

Wow, this is interesting. What percentage of reads did you recover when you did this? how long are your reads?

One thing for sure, as the reads get longer 50, 75, 100, etc.. the more chance they will cross a splice junction. So the recovery effect will be stronger for longer reads.

When I use the Mortazavi dataset (25bp) and compare genome reference vs UCSC Known Genes transcripts (no introns), I see a very small increase in reads recovered ~2% , much like their paper states. Of course, when switching over to a transcript reference, I am also losing reads that are falling in places that are not in the set of annotated transcripts.

Using a reference of transcripts with all splice variants instead of the whole genome has its caveats as well - you will miss the novel junctons not documented. But of course, this will be a very very small number of reads.

You could run a newer program called TopHat that will handle splice junctions, but only it captures ~80% of them.

The mystery to me is that jmaury's argument makes sense - one would expect more splice junction reads, so this is quite odd - isn't it?

Last edited by NGSfan; 04-24-2009 at 02:14 AM.
NGSfan is offline   Reply With Quote
Old 04-24-2009, 02:20 AM   #19
klh
Junior Member
 
Location: Cambridge

Join Date: Oct 2008
Posts: 1
Default

Hi Jean-marc,

Quote:
Originally Posted by jmaury View Post
Hello,

For example, if you sequence a transcript of 1500nt, and obtain 555 reads of 25nt (so an average coverage of 9,25X). If your mRNA contains 10 exons, you will have around 25% of the reads that fall on splice junctions (So you'll only map 75% of the reads, without consider sequencing errors and artefacts).

Jean-marc
I'm not quite sure I understand how you obtain your estimate of 25%. A 1500nt transcript with 10 x 150nt exons contains 1475 25mers, 216 of which (9 * 24) cross splice boundaries. I make this ~15% of reads crossing splice junctions in this example.

But your point about dependence on transcript length, number of exons per transcript and read length is well made. Performing a similar calculation on the 1372 protein-coding transcript on human chr22 (Ensembl-53) gives:

Read-length 25 => ~9% cross splice junctions
36 => ~13%
50 => ~18%

However, this calculation assumes (a) uniform coverage across all transcripts, and (b) uniform expression of all transcripts. Both of these are clearly both gross simplifications! A figure of 3% might sound low, but if the sample contains a number of highly-expressed transcripts with long/few exons, it becomes more reasonable.

Kevin
klh is offline   Reply With Quote
Old 04-24-2009, 02:47 AM   #20
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by klh View Post
Hi Jean-marc,



I'm not quite sure I understand how you obtain your estimate of 25%. A 1500nt transcript with 10 x 150nt exons contains 1475 25mers, 216 of which (9 * 24) cross splice boundaries. I make this ~15% of reads crossing splice junctions in this example.

But your point about dependence on transcript length, number of exons per transcript and read length is well made. Performing a similar calculation on the 1372 protein-coding transcript on human chr22 (Ensembl-53) gives:

Read-length 25 => ~9% cross splice junctions
36 => ~13%
50 => ~18%

However, this calculation assumes (a) uniform coverage across all transcripts, and (b) uniform expression of all transcripts. Both of these are clearly both gross simplifications! A figure of 3% might sound low, but if the sample contains a number of highly-expressed transcripts with long/few exons, it becomes more reasonable.

Kevin
Hi Kevin,

Thanks for the nice explanation. I think your points on the assumptions makes it much easier to understand why one would observe 3% instead of the expected higher frequency.
NGSfan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO